This afternoon I reckon I was six deep in a stack of yaks that I needed to shave to finish this job, and four of them turned up today. I feel like everything I try to do reveals some undiscovered problem that needs fixing…
When the network is a bit broken, my DNS servers soon stop being able to provide answers, because the most popular sites insist on tiny TTLs so they can move fast and break things.
As a result the DNS gets the blame for network problems, and helpdesk issues get misdirected, and confusion reigns.
Serve-Stale to the rescue! It was implemented towards the end of last year in BIND and is a feature of the 9.12 releases.
Let’s deploy it! First attempt in March with 9.12.1.
CVE-2018-5737 appears!
Roll back!
The logging is too noisy for production so we need to wait for 9.12.2 which includes a separate logging category for serve-stale.
Time passes…
Deploy 9.12.2 earlier this week, more carefully.
Let’s make sure everything is sorted before we turn on serve-stale again! (Now we get to today.)
The logging settings need revising: serve-stale is enough of a shove to make it worth reviewing other noisy log categories.
Can we leave most of them off most of the time, and use
the default-debug
category to let us turn them on when
necessary?
This means the debug 1
level needs to be not completely
appalling. Let’s try it!
Hmm, this RPZ debug log looks a bit broken. Let’s fix it!
Two little patches, one cosmetic, one a possible minor bug fix.
Need to rebase my hack branch onto master to test the patches.
Fix dratted merge conflicts.
Build patched server!
Build fails :-( why?
No enlightenment from commit logs.
Sigh, let’s git bisect
the build system to work
out which commit broke things…
Success! The culprit is found!
Submit bug report
Work around bug, and get a successful build!
Test patched server!
The little patches seem OK, but while repeatedly restarting the server, a more worrying bug turns up!
Sometimes when the server starts, my monitoring queries get stuck with SERVFAIL responses when they should succeed! Why?
Really don’t want this to be anything that might affect production, so it needs investigation.
Turn off noisy background activity, and reproduce the problem with a simpler query stream. It’s still hard to characterize the bug.
I’ll need to test this in a less weird and more easily reconfigured server than my toy server. Let’s spin up a VM.
Damnit, my virtualbox setup was broken by the jessie -> stretch upgrade!
Work out that this is because virtualbox is no longer included in stretch and the remnants from jessie are not compatible with the stretch kernel.
Reinstall virtualbox direct from Oracle. It now works again.
Install BIND on the new VM with a simplified version of my toy config. Reproduce the bug.
Is it related to serve-stale? no. QNAME minimization? no. RPZ? no.
After much headscratching and experimentation, enlightenment slowly, painfully dawns.
Submit bug report
Actually, the writing of the bug report, and especially the testing of the unfounded assertions and guesses as I wrote it, was a key part of pinning down this weirdness.
I think this is one of the most obscure DNS interoperability problems I have investigated!
OK, that’s it for now. I still have two patches to submit, and a revised logging configuration to finalize, so I can put serve-stale into production, so I can make it easier in some situations for my colleagues to tell the difference between a network problem and a DNS problem.