Last week I rolled out my new DNS servers. It was reasonably successful - a few snags but no showstoppers.
Authoritative DNS rollout playbook
I have already written about scripting the recursive DNS rollout. I also used Ansible for the authoritative DNS rollout. I set up the authdns VMs with different IP addresses and hostnames (which I will continue to use for staging/testing purposes); the rollout process was:
- Stop the Solaris Zone on the old servers using my zoneadm Ansible module;
- Log into the staging server and add the live IP addresses;
- Log into the live server and delete the staging IP addresses;
- Update the hostname.
There are a couple of tricks with this process.
You need to send a gratuitous ARP to get the switches to update their forwarding tables quickly when you move an IP address. Solaris does this automatically but Linux does not, so I used an explicit arping -U command. On Debian/Ubuntu you need the iputils-arping package to get a version of arping which can send gratuitous ARPs (The arping package is not the one you want. Thanks to Peter Maydell for helping me find the right one!)
If you remove a "primary" IPv4 address from an interface on Linux, it also deletes all the other IPv4 addresses on the same subnet. This is not helpful when you are renumbering a machine. To avoid this problem you need to set sysctl net.ipv4.conf.eth0.promote_secondaries=1.
Pre-rollout configuration checking
The BIND configuration on my new DNS servers is rather different to the old ones, so I needed to be careful that I had not made any mistakes in my rewrite. Apart from re-reading configurations several times, I used a couple of tools to help me check.
bzl
I used bzl, the BIND zone list tool by JP Mens to get the list of configured zones from each of my servers. This helped to verify that all the differences were intentional.
The new authdns servers both host the same set of zones, which is the union of the zones hosted on the old authdns servers. The new servers have identical configs; the old ones did not.
The new recdns servers differ from the old ones mainly because I have been a bit paranoid about avoiding queries for martian IP address space, so I have lots of empty reverse zones.
nsdiff
I used my tool nsdiff to verify that the new DNS build scripts produce the same zone files as the old ones. (Except for th HINFO records which the new scripts omit.)
(This is not quite an independent check, because nsdiff is part of the new DNS build scripts.)
Announcement
On Monday I sent out the DNS server upgrade announcement, with some wording improvements suggested by my colleagues Bob Dowling and Helen Sargan.
It was rather arrogant of me to give the expected outage times without any allowance for failure. In the end I managed to hit 50% of the targets.
The order of rollout had to be recursive servers first, since I did not want to swap the old authoritative servers out from under the old recursive servers. The new recursive servers get their zones from the new hidden master, whereas the old recursive servers get them from the authoritative servers.
The last server to be switched was authdns0, because that was the old master server, and I didn't want to take it down without being fairly sure I would not have to roll back.
ARP again
The difference in running time between my recdns and authdns scripts bothered me, so I investigated and discovered that IPv4 was partially broken. Rob Bricheno helped by getting the router's view of what was going on. One of my new Linux boxes was ARPing for a testdns IP address, even after I had deconfigured it!
I fixed it by rebooting, after which it continued to behave correctly through a few rollout / backout test runs. My guess is that the problem was caused when I was getting gratuitous ARPs working - maybe I erroneously added a static ARP entry.
After that all switchovers took about 5 - 15 seconds. Nice.
Status checks
I wrote a couple of scripts for checking rollout status and progress. wheredns tells me where each of our service addresses is running (old or new); pingdns repeatedly polls a server. I used pingdns to monitor when service was lost and when it returned during the rollout process.
Step 1: recdns1
On Tuesday shortly after 18:00, I switched over recdns1. This is our busier recursive server, running at about 1500 - 2000 queries per second during the day.
This rollout went without a hitch, yay!
Afterwards I needed to reduce the logging because it was rather too noisy. The logging on the old servers was rather too minimal for my tastes, but I turned up the verbosity a bit too far in my new configuration.
Step 2a: recdns0
On Wednesday morning shortly after 08:00, I switched over recdns0. It is a bit less busy, running about 1000 - 1500 qps.
This did not go so well. For some reason Ansible appeared to hang when connecting to the new recdns cluster to push the updated keepalived configuration.
Unfortunately my back-out scripts were not designed to cope with a partial rollout, so I had to restart the old Solaris Zone manually, and recdns0 was unavailable for a minute or two.
Mysteriously, Ansible connected quickly outside the context of my rollout scripts, so I tried the rollout again and it failed in the same way.
As a last try, I ran the rollout steps manually, which worked OK although I don't type as fast as Ansible runs a playbook.
So in all there was about 5 minutes downtime.
I'm not sure what went wrong; perhaps I just needed to be a bit more patient...
Step 2b: authdns1
After doing recdns0 I switched over authdns1. This was a bit less stressy since it isn't directly user-facing. However it was also a bit messy.
The problem this time was me forgetting to uncomment authdns1 from the Ansible inventory (its list of hosts). Actually, I should not have needed to uncomment it manually - I should have scripted it. The silly thing is that I had the testdns servers in the inventory for testing the authdns rollout scripts; the testdns servers were causing me some benign irritation (connection failures) when running ansible in the previous week or so. I should not have ignored this irritation and (like I did with the recdns rollout script) automated it away.
Anyway, after a partial rollout and manual rollback, it took me a few ansible-playbook --check runs to work out why Ansible was saying "host not found". The problem was due to the Jinja expansion in the following remote command, where the "to" variable was set to "authdns1.csx.cam.ac.uk" which was not in the inventory.
ip addr add {{hostvars[to].ipv6}}/64 dev eth0
You can reproduce this with a command like,
ansible -m debug -a 'msg={{hostvars["funted"]}}' all
After fixing that, by uncommenting the right line in the inventory, the rollout worked OK.
The other post-rollout fix was to ensure all the secondary zones had transferred OK. I had not managed to get all of our masters to add my staging servers to their ACLs, but this was not to hard to sort out using the BIND 9.10 JSON statistics server and the lovely jq command:
curl http://authdns1.csx.cam.ac.uk:853/json | jq -r '.views[].zones[] | select(.serial == 4294967295) | .name' | xargs -n1 rndc -s authdns1.csx.cam.ac.uk refresh
After that, I needed to reduce the logging again, because the authdns servers get a whole different kind of noise in the logs!
Lurking bug: rp_filter
One mistake sneaked out of the woodwork on Wednesday, with fortunately small impact.
My colleague Rob Bricheno reported that client machines on 131.111.12.0/24 (the same subnet as recdns1) were not able to talk to recdns0, 131.111.8.42. I could see the queries arriving with tcpdump, but they were being dropped somewhere in the kernel.
Malcolm Scott helpfully suggested that this was due to Linux reverse path filtering on the new recdns servers, which are multihomed on both subnets. Peter Benie advised me of the correct setting,
sysctl net.ipv4.conf.em1.rp_filter=2
Step 3: authdns0
On Thursday evening shortly after 18:00, I did the final switch-over of authdns0, the old master.
This went fine, yay! (Actually, more like 40s than the expected 15s, but I was patient, and it was OK.)
There was a minor problem that I forgot to turn off the old DNS update cron job, so it bitched at us a few times overnight when it failed to send updates to its master server. Poor lonely cron job.
One more thing
Over the weekend my email servers complained that some of their zones had not been refreshed recently. This was because four of our RFC 1918 private reverse DNS zones had not been updated since before the switch-over.
There is a slight difference in the cron job timings on the old and new setups: previously updates happened at 59 minutes past the hour, now they happen at 53 minutes past (same as the DNS port number, for fun and mnemonics). Both setups use Unix time serial numbers, so they were roughly in sync, but due to the cron schedul the old servers had a serial number about 300 higher.
BIND on my mail servers was refusing to refresh the zone because it had copies of the zones from the old servers with a higher serial number than the new servers.
I did a sneaky nsupdate add and delete on the relevant zones to update their serial numbers and everything is happy again.
To conclude
They say a clever person can get themselves out of situations a wise person would not have got into in the first place. I think the main wisdom to take away from this is not to ignore minor niggles, and to write rollout/rollback scripts that can work forwards or backwards after being interrupted at any point. I won against the niggles on the ARP problem, but lost against them on the authdns inventory SNAFU.
But in the end it pretty much worked, with only a few minutes downtime and only one person affected by a bug. So on the whole I feel a bit like Mat Ricardo.