Taming GCP networking cloud costs
πŸ’°

Taming GCP networking cloud costs

Here's a story of a pragmatic tech lead who understands networking fundamentals like
  • iptables
  • packet routing and
  • NAT
saved thousands of $$$ on cloud costs.
If you think low-level fundamentals don't matter in 2023, be ready for an awakening!

Use case

So, you're running some workloads on GCP Kubernetes that require processing large data (think 100+TB per month) from the internet. For simplicity, imagine a web crawler-like use case that makes HTTP calls and processes the HTML response.
Your networking cost is thousands of $$.

Cost breakdown

You dig deep into it to find out the cost breakdown. The cost is for GCP's Cloud NAT service. Cloud NAT is a managed Network Address Translator by Google. It's a high-performance, high-availability, and scalable service that allows your private IP workloads to access the internet.
Cloud NAT cost, in turn, consists of three sub-components.
notion image
You find out that your cost is dominated by the second factor - "Cost per GiB of Data Processed by the Gateway", since you're processing 100+ TB of data.
In fact, 80% of the NAT cost is due to this second factor.

What can you do?

You're wondering whether it makes sense to make the nodes in the Kubernetes cluster public. But, nah, that wouldn't be a good idea. You don't want to expose all nodes publicly just to save some costs. GCP doesn't allow a Kubernetes cluster of mixed nodes - some public & private.
You also talk to the GCP team and see if they have any better alternatives. Finally, you decide to run your own self-hosted NAT in GCP. You go through all the Cloud NAT documentation and also revise your networking fundamentals - iptables, Linux routing tables, etc.
Your plan is:
  1. Run NAT VMs in each zone in GCP and configure these with iptables rules
  1. Configure Kubernetes nodes to use these custom NAT VMs instead of the Cloud NAT service
  1. Take care of high availability, load balancing, fault tolerance, etc.
Let's expand further πŸ‘‡

1. Running and configuring NAT VMs

This involves:
  • Spinning up VMs via Terraform/Pulumi
  • Enabling IP forwarding
  • Make forwarding persistent in sysctl.conf file
  • Enabling NAT using iptables -t nat -A POSTROUTING rule
  • Saving iptables rules

2. Configure Kubernetes nodes to use custom NAT VMs instead of the Cloud NAT service

You do this via:
  • Tagging Kubernetes nodes with a custom tag
  • Configuring Route rules
  • Route all traffic from tagged Kubernetes nodes to our previously created custom NAT VMs

3. High availability, load balancing, fault tolerance

So, for this, you configure multiple routes with the same priority. This way, GCP automatically load balances routing across all available VMs. Thus, if any VMs go down, the routing will be switched to other VMs. You deploy this on staging environment and load-test the system with production-like workloads for a few weeks.
The test is successful, saving 80% of the networking costs.
Your effort in learning about networking fundamentals and iptables has paid off massively.

Lessons

  • Learn to lift the veil of layers of technology, don't treat things as a black box
  • Read the documentation!
  • Test your solution with production scale workloads before rolling it out
Above all:
"Appreciate knowledge about fundamentals."
It's still relevant!

I write such stories on software engineering. There's no specific frequency as I don't make up these. If you liked this one, you might love -
🐒
My app is slow. Can you fix it?
Follow me on LinkedIn and TwitterΒ for more such stuff.
Β