You're a tech lead who's just joined a team and on your first day, the CTO tells you:
"Our video processing API is very slow, looks like it's a problem with AWS EFS that we are using. Take a look and fix it."
You login to AWS and check EFS. There's a single big EFS with 2TB storage mounted on 12 EC2s. You explore all the options on AWS EFS UI, but nothing unusual stands out. Nothing unusual in Cloudwatch too.
You're not sure how you should really test EFS for slowness.
With that, you try to figure out just how slow is the video processing API. There's no observability, so it's difficult to know latencies and other metrics. You login to a few EC2 machines on prod and start tailing nginx logs in hopes of finding something.
You write some bash pipelines with grep, awk and tr to filter logs for that specific API and redirect the output to a file. You see some requests take less than 800 ms while some requests take more than 8 seconds.
"What's going on?"
- you're thinking.
Still, there's no way to know why this API is sometimes slow. There's no trace, it's just the plain logs that you have. You decide to read the source code in the hopes of finding how this API is using EFS.
The code is a bit complex, no tests or documentation. The code is reading a bunch of video files from EFS, combines these files together, and stores the final file back in EFS.
You think you have found the culprit.
- reading the files from EFS,
- combining them, and
- writing them back to the EFS
is causing all the latency.
You decide to add good old print statements in the code - printing the timestamps for each of the above actions along with the unique request id. To your surprise, all the EFS reading, combining, and writing operation is done in milliseconds. You're confused. Where on earth is the latency of more than 8 seconds coming from?
You take another look at the code.
At the end of the function, the code is marking a bunch of rows as "processed" in a DB. That can't be the reason, can it? You add another set of print statements and deploy the code. It is indeed that this DB operation is taking 7-8 seconds for some API requests.
Unaware of the DB schema design, you rope in a DBA for debugging. The DBA tells you that there's some stored procedure running on the same set of tables by a background worker.
You probe further.
You find out that the background worker has a config, called "processing delay" in seconds and it's set to 1 on production. That means the background process tries to process records older than just 1 second. You're still not sure what's really going on.
You ask whether this API was always slow or something changed recently.
"It was working okay before", people say, but no one knows when it really started becoming slow.
We got to know about this issue when customers reported that their video processing is broken or slow.
You ask around and find out that the previous value for "processing delay" was 300 seconds, i.e. 5 mins! The local config value of 1 second got pushed to prod in a recent commit and that's when latencies started increasing.
You fix the config value and sure enough, the latencies are back to normal, less than 1 sec for all APIs. Workers were processing data older than just 1 second, hence resulting in lot of queries and possibly lock contention on DB. That's what was causing the APIs to become slow.
You started with
- debugging EFS
- to tailing logs to find API latency
- to debugging DB schema and queries
- finally landed on the config file fix
Real life debugging is pretty messy. The more situations you have to debug, the better you become.
I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love - Unblocking Team members