2008-12-01 – Weird network bug

I just spent a couple of hours debugging a strange network problem, which I haven’t seen before.

The initial description was that email relayed from one departmental email server to another via my email server sometimes failed with a timeout error. This only occurred with messages with multiple recipients.

My initial suspicion was that it might be related to an old (now fixed) Exim callout verification bug, which had caused us some problems in the past. If a callout’s target server was slow, and there were many recipients needing callout verification, and the client used pipelining, the total time for the callout delays could add up to more than the client’s timeout, so it wouldn’t see Exim’s pipelined replies. Exim now flushes its replies before doing a callout to avoid this problem.

However, some testing showed that the relevant servers were quite swift, which cast some doubt on this diagnosis.

The problem report included some good details which were unfortunately too old for me to be able to correlate with our logs. I had a look to see if there was anything more recent that I could examine. It turned out that there were a couple of messages on our queues that had been delayed by this problem. I ran a test delivery of one of them and it did in fact time out between us sending the message envelope and them sending the reply. Consistent with my guess, but not consistent with other testing.

I tried a delivery with full debugging from Exim, but there was no enlightenment. I tried to send a copy of the problem message’s envelope manually. This worked. Um, WTF? What was the difference between what Exim did and what I did?

I ran tcpdump to analyze the two connections. It turned out that Exim was sending the whole envelope (MAIL, RCPT, RCPT, …, DATA) in one packet, whereas my cut-and-paste into telnet split the envelope into a packet for the MAIL command and one for the rest. So I created a file containing the envelope and used a little shell script to drive telnet. Bingo. I had reproduced the problem.

At this point I thought it might be an MTU-related problem, e.g. blocking ICMP “Fragmentation Needed” messages. I tried pinning down the packet size at which the problem started occurring, with a guess of something related to 512 bytes. After a couple of goes I noticed that the problem message had an envelope of 513 bytes, and an envelope of 512 bytes worked OK.

Then I tried a larger envelope, and - boggle - it worked. 512 or less worked, 513 failed, 514 or greater worked. This also explained why the problem affected message envelopes but not message data. The usual symptom of MTU firewall problems is timeouts at the message data stage, not during the envelope, and in this case message data was getting through fine. (The overhead of message headers makes it very unlikely that a message will have as little as 513 bytes of data.)

A couple of friends suggested testing 1025 as well, and it demonstrated similar but weirder problems. Again 1024 worked, 1025 failed, and 1026 worked. However in the failure case, the timeout occurred after I received the destination’s replies to the first half of the envelope - enough commands to fit in 512 bytes. A closer look at both the 513 and 1025 cases revealed that I was getting TCP ACKs for all the data I sent, but something was going wrong after that.

I guess the problem is a firewall that’s doing some kind of TCP interception and re-segmentation, and getting it wrong. The ACKs would therefore have been generated by the firewall and not by the actual destination. Someone needs to be given a good whipping with a copy of end-to-end arguments in system design before having their firewall programming licence revoked.

⇐ 2008-11-18 ⇐ Licence revoked ⇐ ⇒ RJ45 is too fat ⇒ 2009-01-16 ⇒