Had another minor troubleshooting adventure recently! Not quite as exciting or far-ranging as my NTP story, but it was still fun.
I get pulled into a call. Something happened over the weekend, and I’m getting too much information from too many people. They’re asking me about load balancers, WAN circuits, a bunch of firewalls, and more. Not sure yet what actually happened. I have them slow down a bit and give me one thing which is affected. I confirm the path, and it goes through one firewall, over a WAN link to another datacenter, through another firewall, then to the server. No load balancers! It also skipped several other firewalls people had been asking about.
I ask for more problematic traffic flows. Eventually, I find an even simpler one: in one interface on the second firewall and out another interface on that same firewall. No load balancers. No routers. Just switches and one firewall.
I start packet captures on all the relevant interfaces. While waiting for a report of issues between those endpoints, we realize something interesting. You see, a bunch of separate teams have all been troubleshooting issues on their own. They’ve been keeping track of the issue timings separately. When we get enough together, we see that while the affected flows are rarely the same, *something* is happening every 20 minutes like clockwork.
The owner of the systems I’m watching reports they’ve had a traffic issue, so I kill the captures and pull them down. After a few seconds in the client-side capture, I’ve found the problem. Sure enough, some traffic just goes missing. Client sends some data, server doesn’t respond for several seconds. I look at the capture on the server-side interface. It’s different. The server sends the packet which was missing on the client-side capture. And the server-side capture doesn’t have some packets the client sent. So the firewall is definitely dropping or severely delaying some traffic. What gives?
Shortly before the issue is expected to occur, I start a debug which gives me a brief message about every packet the firewall drops along with a timestamp. I wait a few minutes, issues are reported, issues go away, and I kill it.
Copy it down, open it up, and I start by just scrolling through. The timestamps zip by, then I start seeing a lot of the same second. Then a *LOT* of the next second. I let it run for about five minutes, and 95% of the file has timestamps from a 15 second span. Huh.
This particular firewall brand works like a load balancer and a bunch of tiny firewalls in a box. It has worker processes which actually handle applying the rules, inspecting the data, and so on, and it has dispatcher processes which pick which traffic goes to which worker. A lot of the drops are problems enqueueing packets for a particular worker. And it’s all the same worker.
I look for that worker in the debug, and I find about 200k drops, all in the span of 15 seconds. None of the drops are for flows reporting problems, though. I do a little RegEx witchery to get just the source IP, destination IP, and destination port, then sort the result and pipe it to ‘uniq -c’ to get a feel for heavy hitters. Lots of tuples which appear one or two times, then two which appear 98,000 times each. It’s UDP 514 traffic from one client to a few destinations which were decommissioned some time ago. So this client is basically DoSing the firewall worker process, which prevents it from handling the other flows which are directed to it. The firewall has around 65 of these workers, so that explains why the issue is so sporadic! It only affects whatever ~1.5% of traffic is on the affected worker, and that changes all the time.
UDP port 514 is used for syslog. While waiting on someone from the team which owns the client sending all this traffic, I run another capture. I limit it to 500 packets, and I look for any traffic from that client to the two destinations it is trying to reach. It runs for a few minutes, then stops. I pull down the file and open it up.
Sure enough, it’s syslog traffic. It’s all from a sendmail process complaining it has no more space available in the filesystem which contains the mail spool. It can’t move messages around or accept any more mail.
The team which owns it joins and confirms they don’t need it, because they never actually finished setting it up. They approve shutting it off. 20 minutes later, no reports of problems! An hour later, and it’s officially declared resolved.
Lessons learned (really, reinforced):
• Document, document, document! Troubleshooting is a science, and that means you should follow the scientific method. Document your observations, tests you made, and what the tests told you. Document changes to your variables.
• Isolate a simple part of the issue and focus on that. If you find an even simpler version, change your focus to that simpler version. Once you learn what’s going on there, you can usually generalize it to the broader problem.
• Set up monitoring for your stuff early. Ideally before you set up any services or cronjobs on it.
• Decommission stuff you don’t actually use.
• It can be really hard to tell when an issue is just your problem versus when it’s something broader. If you’re dealing with weird, sporadic issues, record the timing and see if anybody else in similar teams has weird, sporadic issues. See if they mesh in any way.