A short debugging story to start off the new year.
Developer (D): Hey, can you join a call? I need some help in debugging a webhook API connectivity issue. The customer team is also on the call.
Team lead (You): Okay sure, add me to the call.
You: So, what are you trying to do and what's the problem?
D: Our backend has a webhook API that a third party will invoke over the internet. The customer team from that third party is on the call. They are getting a 500 error when making this API call.
You: Okay, what's the error? And before that, can you tell me the HTTP request flow?
D: Yeah, so the error is related to TLS and I am not sure how to debug this. The request flow is as follows:
Third-party --> Firewall --> Nginx --> Our backend
You: Okay, have you tried exposing the API on staging on http instead of https?
D: Yes, this is on a staging environment, and the http API works. When I switch to https, we get an error in the third party.
You: Okay, where does the TLS termination happen in the above request flow?
D: (long pause) I think it happens at Nginx.
D: (thinking...) Yes, I have configured the certs in Nginx.
You: So the Firewall is Layer 4 and Nginx acts as Layer 7 and terminates TLS?
You: Can you try making a call to webhook API from your local machine via curl or Postman?
D: (tries to demo this, but the call times out)
You: Is there any IP whitelisting at the Firewall level to make sure we allow the webhook API call only from that third party and no one else?
You: Can you try removing the whitelisting for testing purposes right now and try the request from the local machine again?
D: (demos this use case, curl works okay, returns 4xx due to missing auth headers)
You: Can you open the url (used only as an example here) subdomain.example.com/api/webhook on your browser?
D: (demos this use case, on Chrome, it shows "Not secure" even though the protocol is https)
You: Interesting, show me more details about the certificate on Chrome.
D: (shows the details) Where are we going with this?
You: See, the certificate you're using is not valid for the subdomain. It's only valid for the main example.com domain. So, you're getting an error on the third party. The third-party strictly checks the certificate, so you're getting a TLS error.
(you continue): I wonder why it works via curl as it should also perform a strict check as we didn't pass -k or --insecure flag.
D: Oh, I have set up a default config in curl to always use the --insecure flag to make testing easier.
You: (smiling) Ah, that explains why! So, the TLS error issue is due to the bad certificate. You'll need to use the correct cert for a subdomain or a wildcard cert that works on all subdomains and update your Nginx config accordingly.
D: (hours later) Thanks, this issue was fixed. How do I learn to debug like this?
You: Here are the lessons.
- Learn the end-to-end request flow and the whole stack (from Layer 4 to Layer 7, at least)
- Understand how proxies operate, what's TLS, DNS and how certs work
- Get familiar with basic networking utilities - curl, nslookup, netstat, telnet, tcpdump, and more
- Try to form a mental model about how things work and a hypothesis about where the problem could be
- Only change one variable at a time when debugging
- Only change what's relevant to your hypothesis and revisit your hypothesis and mental model
- Practice and learn from past incidents and war stories from seniors
I write such stories on software engineering.
There's no specific frequency, as I don't make up these.
If you liked this one, you might love - Migrating Terabytes of metrics data with zero downtime.
Follow me - Chinmay Naik for more such stuff, straight from the production oven!