Vanishing Acts: The Mystery of Failing Database Writes
πŸ“¦

Vanishing Acts: The Mystery of Failing Database Writes

As an SRE at a growth-stage company, you're on call this week. A PagerDuty alert wakes you up. It's related to the main transactional database, so you check immediately.
The inserts to one of the tables in the database are failing. You start investigating.
  • Is the database down? (no, it's up)
  • Are we able to serve read queries (yes, we are)
  • Are we able to insert/update/delete records in other tables (yes, we are)
You're wondering why the inserts in only one table are failing. This is strange. Also, there are no API failures, just failures in that one table.
You find out what that table contains. It's the event_log table that contains logs from various API endpoints. You can query the data in this table but cannot insert any new data.
You're still not sure what the root cause could be. Out of curiosity, you run select MAX(ID) from event_log to find the maximum ID value from the table. It outputs this number:
2147483647
Suddenly, you have a lightbulb moment!
This number is the max value of the integer. So looks like the auto-increment primary key column ID uses an integer as data type, and you've hit the limit of the integer range.
You now know the problem, but how to fix it? You think of the following approaches:
  1. Change the column type from int to bigint?
  1. Can we use the negative int range as the negative integer space is available?
Let's talk about these approaches and their pros and cons:
  1. int --> bigint
    1. Altering the table to change the column type (int to bigint) would require about 5+ hours (given the number of records in the table), so you rule out this option.
  1. Using Negative int range
    1. To get the service running quickly, you ask the dev team to check if we can use negative number space (-2147483648 or -2^31 to 0) for adding new rows. But, the underlying database library doesn't support this use case.
You're trying to think what to do. The inserts to this table are still failing and it's been more than 30 minutes. You're worried we'll use up our SLO budget just due to this incident alone.
You have handled many downtimes in the past, but this is a unique one. You force yourself to do the most difficult thing during any incident.
"Take a step back and think calmly."
5 minutes in, you start pacing the room, forcing yourself to take a step back. You lock your computer and walk toward the water cooler to get some water. You're so engrossed in thoughts that you don't realize water is overflowing in the glass, just like the integer overflow in your database table.
You hurriedly get back to the computer and join the incident call again. People on the call are still trying to figure out some solution. You propose the following option.
"Let's create a new table and write all new event_logs to it. Later, we'll migrate data from the existing table to the new table."
You ask the backend team to estimate the effort to do this change
"Not much. We'll need to change just one file, and we're good."
You ask the team to proceed with this change. The team creates a new table, this time with bigint ID, and starts the autoincrement from 2147483648. This way, you can migrate data from the old table to this table, and there won't be any conflict in the records.
This change is pushed to prod within the next few minutes, and sure enough, it works! The inserts are working again.
Later, you write a script to copy the records from the old table to the new table and then drop the old table.

Lessons:

  • It's super important to take a step back during the incident. It's counter-intuitive, but great SREs do that.
  • Tech debt bites in unexpected ways. Nobody thought this table would get so big that we'd run out of integers.
  • For large tables, consider using bigint or UUID as the default primary key data type instead of int.

I write such stories on software engineering.
There's no specific frequency, as I don't make up these.
If you liked this one, you might love -
πŸ‘»
A story about a nightmare scenario for every SRE
Follow me on LinkedIn and TwitterΒ for more such stuff, straight from the production oven!
Β