A story about a nightmare scenario for every SRE

A story about a nightmare scenario for every SRE

Story time.
It's about cloud security failures and why good engineering practices matter, especially during the One to N journey.
(A past colleague I met at a conf told me his story from a few years ago.)
Here goes!
As a team lead, you're woken up by an AWS billing alarm!

It's your on-call turn this week. Early morning, the Slack alert about cost anomaly goes off, and you get paged.
You acknowledge the alert and start investigating.
You find out that a new Lambda function is responsible for the sudden cost increase, and hence the alarm.

This lambda function is in a completely different AWS region than you normally use.
You wonder why do we have a random lambda function in an AWS region where we have no other infrastructure.
You try to find more info about the lambda and also ping a dev team member.

The developer knows nothing about this lambda and confidently tells you that we shouldn't have any infra in that region.
You know something is wrong.
At the same time, you receive AWS customer support and trust advisor notifications about EC2 abuse.

You decide to stop and delete the lambda function.
You also find out that similar lambdas are running in all other active AWS regions. You quickly delete all those, too.
Looks like these lambdas are created by a recently created user that you don't recognize at all.

You're in fire-fighting mode.
You know your infra has been compromised and the attacker is using your AWS account to DDoS some targets.
You track down the IAM user and its access credentials.
You disable this user and all its access keys.

This breaks some parts of your application and infrastructure, but you fix those promptly with the help of the dev team.
You focus on stabilizing production and then investigating the root cause in detail.
It takes a few hours for you and the team to get everything to normal.

Later, you find out that the attacker somehow got access to the AWS creds.
Since there were no permission limits on the keys, the attacker was able to create lambda functions using a script and run a DDoS attack.
You create a detailed RCA with the next steps.

There's lots to fix (1/2):
  • Implement the Principle of least privilege for all roles
  • Rotate access keys every 30 days
  • Don't spin up infra in public subnet unless needed
  • Lock down all open ports unless needed
  • Things that can be private, should be private
  • Use IAM roles or secret manager instead of access keys
  • Don't make application secrets available on EC2 machines via plaintext
  • Limit prod infra access to internal team members
  • Don't share IAM users and credentials across services
  • many more...

You work with other team members and engineering leadership to prioritize, implement, and roll out these fixes.
It's not just about tech. You try to bring awareness about the shift-left approach to security in the team. Same for infra automation and other DevOps practices.

This effort takes a few months.
You can't really say you're 100% successful in "transforming" the org as it requires habit change.
Habit changes just take a long time, and also willingness from people to actually change.
But you leave them in a much better place than before.

This is an old story, but I see things aren't magically different today for many companies.
During the scaling phase, the focus is typically on "growth at all costs".
I get it, but orgs often miss out on good engineering culture and practices that allow sustainable growth.

This is what I call Pragmatic engineering - the idea is to maintain a good balance between two extremes - over-engineering and big-ball-of-mud.
This was one of the reasons I started One2N - to help growing orgs scale sustainably.
Do reach out to me if you're facing scaling problems.

I write such stories on software engineering.
There's no specific frequency, as I don't make up these.
Follow me on LinkedIn and TwitterΒ for more such stuff, straight from the production oven!