Today I rolled out a significant improvement to the automatic recovery system on Cambridge University's recursive DNS servers. This change was because of three bugs.
BIND RPZ catatonia
The first bug is that sometimes BIND will lock up for a few seconds doing RPZ maintenance work. This can happen with very large and frequently updated response policy zones such as the Spamhaus Domain Block List.
When this happens on my servers, keepalived
starts a failover
process after a couple of seconds - it is deliberately configured to
respond quickly. However, BIND soon recovers, so a few seconds later
keepalived
fails back.
BIND lost listening socket
This brief keepalived
flap has an unfortunate effect on BIND. It
sees the service addresses disappear, so it closes its listening
sockets, then the service addresses reappear, so it tries to reopen
its listening sockets.
Now, because the server is fairly busy, it doesn't have time to clean up all the state from the old listening socket before BIND tries to open the new one, so BIND gets an "address already in use" error.
Sadly, BIND gives up at this point - it does not keep trying periodically to reopen the socket, as you might hope.
Holy health check script, Bat Man!
At this point BIND is still listening on most of the interface
addresses, except for a TCP socket on the public service IP address.
Ideally this should have been spotted by my health check script, which
should have told keepalived
to fail over again.
But there's a gaping hole in the health checker's coverage: it only tests the loopback interfaces!
In a fix
Ideally all three of these bugs should be fixed. I'm not expert enough to fix the BIND bugs myself, since they are in some of the gnarliest bits of the code, so I'll leave them to the good folks at ISC.org. Even if they are fixed, I still need to fix my health check script so that it actually checks the user-facing service addresses, and there's no-one else I can leave that to.
Previously...
I wrote about my setup for recursive DNS server failover with
keepalived
when I set it
up a couple of years ago. My recent work leaves the keepalived
configuration bascially unchanged, and concentrates on the health
check script.
For the purpose of this article, the key feature of my keepalived
configuration is that it runs the health checker script many times per
second, in order to fake up dynamically reconfigurable server
priorities. The old script did DNS queries inline, which was OK when
it was only checking loopback addresses, but the new script needs to
make typically 16 queries which is getting a bit much.
Daemonic decoupling
The new health checker is split in two.
The script called by keepalived
now just examines the contents of a
status file, so it runs predictably fast regardless of the speed of
DNS responses.
There is a separate daemon which performs the actual health checks, and writes the results to the status file.
The speed thing is nice, but what is really important is that the daemon is naturally stateful in a way the old health checker could not be. When I started I knew statefulness was necessary because I clearly needed some kind of hysteresis or flap damping or hold-down or something.
This is much more complex
https://www.youtube.com/watch?v=DNb4VKln1uw
There is this theory of the Möbius: a twist in the fabric of space where time becomes a loop
BIND observes the list of network interfaces, and opens and closes listening sockets as addresses come and go.
The health check daemon verifies that BIND is responding properly on all the network interface addresses.
keepalived
polls the health checker and brings interfaces up and down depending on the results.
Without care it is inevitable that unexpected interactions between these components will destroy the Enterprise!
Winning the race
The health checker gets into races with the other daemons when interfaces are deleted or added.
The deletion case is simpler. The health checker gets the list of
addresses, then checks them all in turn. If keepalived
deletes an
address during this process then the checker can detect a failure -
but actually, it's OK if we don't get a respose from a missing
address! Fortunately there is a distinctive error message in this case
which the health checker can treat as an alternative successful
response.
New interfaces are more tricky, because the health checker needs to give BIND a little time to open its sockets. It would be really bad if the server appears to be healthy, so keepalived brings up the addresses, which the health checker tests before BIND is ready, causing it to immediately fail - a huge flap.
Back off
The main technique that the new health checker uses to suppress flapping is exponential backoff.
Normally, when everything is working, the health checker queries every network interface address, writes an OK to the status file, then sleeps for 1 second before looping.
When a query fails, it immediately writes BAD to the status file, and sleeps for a while before looping. The sleep time increases exponentially as more failures occur, so repeated failures cause longer and longer intervals before the server tries to recover.
Exponential backoff handles my original problem somewhat indirectly: if there's a flap that causes BIND to lose a listening socket, there will then be a (hopefully short) series of slower and slower flaps until eventually a flap is slow enough that BIND is able to re-open the socket and the server recovers. I will probably have to tune the backoff parameters to minimize the disruption in this kind of event.
Hold down
Another way to suppress flapping is to avoid false recoveries.
When all the test queries succeed, the new health checker decreases
the failure sleep time, rather than zeroing it, so if more failures
occur the exponential backoff can continue. It still reports the
success immediately to keepalived
, because I want true recoveries to
be fast, for instance if the server accidentally crashes and is
restarted.
The hold-down mechanism is linked to the way the health checker keeps track of network interface addresses.
After an interface goes away the checker does not decrease the sleep time for several seconds even if the queries are now working OK. This hold-down is supposed to cover a flap where the interface immediately returns, in which case we want exponential backoff to continue.
Similarly, to avoid those tricky races, we also record the time when each interface is brought up, so we can ignore failures that occur in the first few seconds.
Result
It took quite a lot of headscratching and trial and error, but in the end I think I came up with something resonably simple. Rather than targeting it specifically at failures I have observed in production, I have tried to use general purpose robustness techniques, and I hope this means it will behave OK if some new weird problem crops up.
Actually, I hope NO new weird problems crop up!
PS. the ST:TNG quote above is because I have recently been listening to my old Orbital albums again - https://www.youtube.com/watch?v=RlB-PN3M1vQ