(Database reliability story 1)
You join a team as lead SRE, and the CTO asks you to manage and scale a self-managed 6-node MySQL cluster on production.
Here's a story of how you turn this around - from many database failures per month to three nines of uptime for the database.
To get the context, you ask some questions:
- Why self-manage the database instead of using a cloud SaaS offering?
- Who set this up, and who's managing it currently?
- Any observability set up for the DB?
- What's the traffic pattern, peak rps, data size, VM specs, etc.
You have many questions, but you start with these.
The CTO candidly replies.
- Why self-manage: We started with a small self-hosted MySQL. The business grew so much that we kept scaling vertically.
- Who set it up: The lead who set this up is no longer with the company.
- Observability: We have a basic Grafana dashboard with infra metrics for the cluster, nothing more.
- Traffic pattern, scale, etc: Peak scale of 20k read requests and 5k write requests per second, data size 1.5Tb growing 6GB/day, 150+ tables. 6-node cluster 128GB RAM, 32vCPU.
With this, you suggest these improvements:
- Zero downtime schema migrations
- Separate OLTP and OLAP workloads
- Allow read-write traffic routing
- Controlling replication lag
- Improve security posture
- 99.9% uptime
Finally, migrate to SaaS offering e.g., AWS RDS.
[Each of these improvements deserves a separate post (and a conference talk). But for now, I'll write about one specific outcome.]
"Zero downtime schema migration"
Before the solution, let me talk about why schema migration on large tables is a challenge in MySQL.
MySQL Schema migration with a simple alter table causes many problems:
- For large tables (millions of rows), it results in replication lag
- Replicas can't process any other commands during migration
- Stale data on replicas due to lag
- Non-pausable operation on replicas
You find out that at least 7 product features are blocked due to the inability to perform schema migrations on large tables on production.
The engineering team has created PRs for these features, but these can't be merged as we currently don't have zero downtime schema migrations.
You know that you're not the first to face this problem (MySQL has been around for long, so there must be a way).
You do the research and find two approaches and tools
- trigger-based (Percona's pt-online-schema-change, i.e. pt-osc) and
- binlog-based (GitHub's gh-ost)
You research more and prepare a document comparing the two approaches.
Comparison of pt-osc and gh-ost for MySQL schema migration
After much consideration, you go ahead with GitHub's gh-ost, as it gives you an asynchronous data copy mechanism, controllable switchover, and throttling support.
But what do these terms mean?
See, zero downtime MySQL schema migration works this way:
- Create a table similar to the source table (shadow table)
- Migrate the shadow table while empty
- Copy data from the source to the shadow table
- Rename tables atomically
- Optionally delete the older table
Both gh-ost and pt-osc work this way. But, they differ in step 3 (how data is copied from the source table to the shadow table).
pt-osc uses triggers, gh-ost uses MySQL binlogs.
Both of these mechanisms have their own pros and cons. For our use case, we went ahead with gh-ost.
You set up a replica of the production environment where you'd try out gh-ost and its various configuration options.
The two most imp config flags are:
- Threads_running: The active threads on the primary database server
- Max-lag-millis: The current replication lag on the replica
You go through gh-ost documentation and even the source code to find out how it works. You read through most use cases and lurk in forums to understand production challenges. You simulate prod-like scale (400M rows) and run multiple migrations on such tables with various flags.
Fast forwarding the story:
Once you have the confidence, you perform your first migration on a 10M row table. It works well, with zero downtime. Gradually, you move on to migrating 400M row table. It takes 12+ hours, but things are mainly smooth. You document the runbook.
Thanks to your efforts,
- Product teams can ship features faster
- Database reliability is improved
- Anyone in the team can schema migrate large tables (based on your runbook)
So far, the team has run 100s of schema migrations without any problems.
Scaling stateless systems is easy. Stateful, less so.
- Learn to go deep in your stack
- Read the documentation, code, participate in forums
- Try out tools with production scale, not hello-world scale
- Automate the process
- Give it time; don't expect results on day 1
I write such stories on software engineering.
There's no specific frequency, as I don't make up these.
If you liked this one, you might love - Building Pull Request-based ephemeral Preview environments on Kubernetes
Oh, by the way, if you need help with database reliability and scaling, reach out on Twitter or LinkedIn via DMs. We have worked at Terabyte scale when it comes to relational and non-relational databases.
This was one of the reasons I started One2N - to help growing orgs scale sustainably.