Here's a story of a pragmatic tech lead who understands networking fundamentals like
- packet routing and
saved thousands of $$$ on cloud costs.
If you think low-level fundamentals don't matter in 2023, be ready for an awakening!
So, you're running some workloads on GCP Kubernetes that require processing large data (think 100+TB per month) from the internet. For simplicity, imagine a web crawler-like use case that makes HTTP calls and processes the HTML response.
Your networking cost is thousands of $$.
You dig deep into it to find out the cost breakdown. The cost is for GCP's Cloud NAT service. Cloud NAT is a managed Network Address Translator by Google. It's a high-performance, high-availability, and scalable service that allows your private IP workloads to access the internet.
Cloud NAT cost, in turn, consists of three sub-components.
You find out that your cost is dominated by the second factor - "Cost per GiB of Data Processed by the Gateway", since you're processing 100+ TB of data.
In fact, 80% of the NAT cost is due to this second factor.
You're wondering whether it makes sense to make the nodes in the Kubernetes cluster public. But, nah, that wouldn't be a good idea. You don't want to expose all nodes publicly just to save some costs. GCP doesn't allow a Kubernetes cluster of mixed nodes - some public & private.
You also talk to the GCP team and see if they have any better alternatives. Finally, you decide to run your own self-hosted NAT in GCP. You go through all the Cloud NAT documentation and also revise your networking fundamentals - iptables, Linux routing tables, etc.
Your plan is:
- Run NAT VMs in each zone in GCP and configure these with iptables rules
- Configure Kubernetes nodes to use these custom NAT VMs instead of the Cloud NAT service
- Take care of high availability, load balancing, fault tolerance, etc.
Let's expand further 👇
- Spinning up VMs via Terraform/Pulumi
- Enabling IP forwarding
- Make forwarding persistent in sysctl.conf file
- Enabling NAT using iptables -t nat -A POSTROUTING rule
- Saving iptables rules
You do this via:
- Tagging Kubernetes nodes with a custom tag
- Configuring Route rules
- Route all traffic from tagged Kubernetes nodes to our previously created custom NAT VMs
So, for this, you configure multiple routes with the same priority. This way, GCP automatically load balances routing across all available VMs. Thus, if any VMs go down, the routing will be switched to other VMs. You deploy this on staging environment and load-test the system with production-like workloads for a few weeks.
The test is successful, saving 80% of the networking costs.
Your effort in learning about networking fundamentals and iptables has paid off massively.
- Learn to lift the veil of layers of technology, don't treat things as a black box
- Read the documentation!
- Test your solution with production scale workloads before rolling it out
"Appreciate knowledge about fundamentals."
It's still relevant!
I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love - My app is slow. Can you fix it?