This afternoon I reckon I was six deep in a stack of yaks that I needed to shave to finish this job, and four of them turned up today. I feel like everything I try to do reveals some undiscovered problem that needs fixing…
-
When the network is a bit broken, my DNS servers soon stop being able to provide answers, because the most popular sites insist on tiny TTLs so they can move fast and break things.
As a result the DNS gets the blame for network problems, and helpdesk issues get misdirected, and confusion reigns.
-
Serve-Stale to the rescue! It was implemented towards the end of last year in BIND and is a feature of the 9.12 releases.
-
Let’s deploy it! First attempt in March with 9.12.1.
-
CVE-2018-5737 appears!
Roll back!
-
The logging is too noisy for production so we need to wait for 9.12.2 which includes a separate logging category for serve-stale.
-
Time passes…
-
Deploy 9.12.2 earlier this week, more carefully.
-
Let’s make sure everything is sorted before we turn on serve-stale again! (Now we get to today.)
-
The logging settings need revising: serve-stale is enough of a shove to make it worth reviewing other noisy log categories.
-
Can we leave most of them off most of the time, and use the
default-debug
category to let us turn them on when necessary? -
This means the
debug 1
level needs to be not completely appalling. Let’s try it!-
Hmm, this RPZ debug log looks a bit broken. Let’s fix it!
-
Two little patches, one cosmetic, one a possible minor bug fix.
-
Need to rebase my hack branch onto master to test the patches.
-
Fix dratted merge conflicts.
-
-
Build patched server!
-
Build fails :-( why?
-
No enlightenment from commit logs.
-
Sigh, let’s
git bisect
the build system to work out which commit broke things…- While the workstation churns away repeatedly building BIND, let’s get coffee!
-
Success! The culprit is found!
-
Submit bug report
-
Work around bug, and get a successful build!
-
-
Test patched server!
-
The little patches seem OK, but while repeatedly restarting the server, a more worrying bug turns up!
Sometimes when the server starts, my monitoring queries get stuck with SERVFAIL responses when they should succeed! Why?
-
Really don’t want this to be anything that might affect production, so it needs investigation.
-
Turn off noisy background activity, and reproduce the problem with a simpler query stream. It’s still hard to characterize the bug.
-
I’ll need to test this in a less weird and more easily reconfigured server than my toy server. Let’s spin up a VM.
-
Damnit, my virtualbox setup was broken by the jessie -> stretch upgrade!
-
Work out that this is because virtualbox is no longer included in stretch and the remnants from jessie are not compatible with the stretch kernel.
-
Reinstall virtualbox direct from Oracle. It now works again.
-
-
Install BIND on the new VM with a simplified version of my toy config. Reproduce the bug.
-
-
Is it related to serve-stale? no. QNAME minimization? no. RPZ? no.
-
After much headscratching and experimentation, enlightenment slowly, painfully dawns.
-
Submit bug report
Actually, the writing of the bug report, and especially the testing of the unfounded assertions and guesses as I wrote it, was a key part of pinning down this weirdness.
I think this is one of the most obscure DNS interoperability problems I have investigated!
-
-
-
-
OK, that’s it for now. I still have two patches to submit, and a revised logging configuration to finalize, so I can put serve-stale into production, so I can make it easier in some situations for my colleagues to tell the difference between a network problem and a DNS problem.