Recursive DNS rollout plan - and backout plan!

The last couple of weeks have been a bit slow, being busy with email and DNS support, an unwell child, and surprise 0day. But on Wednesday I managed to clear the decks so that on Thursday I could get down to some serious rollout planning.

My aim is to do a forklift upgrade of our DNS servers - a tier 1 service - with negligible downtime, and with a backout plan in case of fuckups.

Solaris Zones

Our old existing DNS service is based on Solaris Zones. The nice thing about this is that I can quickly and safely halt a zone - which stops the software and unconfigures the network interface - and if the replacement does not work I can restart the zone - which brings up the interfaces and the software.

Even better, the old servers have a couple of test zones which I can bounce up and down without a care. These give me enormous freedom to test my migration scripts without worrying about breaking things and with a high degree of confidence that my tests are very similar to the real thing.

Testability gives you confidence, and confidence gives you productivity.

Before I started setting up our new recursive DNS servers, I ran zoneadm -z testdns* halt on the old servers so that I could use the testdns addresses for developing and testing our keepalived setup. So I had the testdns zones in reserve for developing and testing the rollout/backout scripts.

Rollout plans

The authoritative and recursive parts of the new setup are quite different, so they require different rollout plans.

On the authoritative side we will have a virtual machine for each service address. I have not designed the new authoritative servers for any server-level or network-level high availability, since the DNS protocol should be able to cope well enough. This is similar in principle to our existing Solaris Zones setup. The vague rollout plan is to set up new authdns servers on standby addresses, then renumber them to take over from the old servers. This article is not about the authdns rollout plan.

On the recursive side, there are four physical servers any of which can host any of the recdns or testdns addresses, managed by keepalived. The vague rollout plan is to disable a zone on the old servers then enable its service address on the keepalived cluster.

Ansible - configuration vs orchestration

So far I have been using Ansible in a simple way as a configuration management system, treating it as a fairly declarative language for stating what the configuration of my servers should be, and then being able to run the playbooks to find out and/or fix where reality differs from intention.

But Ansible can also do orchestration: scripting a co-ordinated sequence of actions across disparate sets of servers. Just what I need for my rollout plans!

When to write an Ansible module

The first thing I needed was a good way to drive zoneadm from Ansible. I have found that using Ansible as a glorified shell script driver is pretty unsatisfactory, because its shell and command modules are too general to provide proper support for its idempotence and check-mode features. Rather than messing around with shell commands, it is much more satisfactory (in terms of reward/effort) to write a custom module.

My zoneadm module does the bare minimum: it runs zoneadm list -pi to get the current state of the machine's zones, checks if the target state matches the current state, and if not it runs zoneadm boot or zoneadm halt as required. It can only handle zone states that are "installed" or "running". 60 lines of uncomplicated Python, nice.

Start stupid and expect to fail

After I had a good way to wrangle zoned it was time to do a quick hack to see if a trial rollout would work. I wrote the following playbook which does three things: move the testdns1 zone from running to installed, change the Ansible configuration to enable testdns1 on the keepalived cluster, then push the new keepalived configuration to the cluster.

---
- hosts: helen2.csi.cam.ac.uk
  tasks:
    - zoneadm: name=testdns1 state=installed
- hosts: localhost
  tasks:
    - command: bin/vrrp_toggle rollout testdns1
- hosts: rec
  roles:
    - keepalived

This is quick and dirty, hardcoded all the way, except for the vrrp_toggle command which is the main reality check.

The vrrp_toggle script just changes the value of an Ansible variable called vrrp_enable which lists which VRRP instances should be included in the keepalived configuration. The keepalived configuration is generated from a Jinja2 template, and each vrrp_instance (testdns1 etc.) is emitted if the instance name is not commented out of the vrrp_enable list.

Fail.

Ansible does not re-read variables if you change them in the middle of a playbook like this. Good. That is the right thing to do.

The other way in which this playbook is stupid is there are actually 8 of them: 2 recdns plus 2 testdns, rollout and backout. Writing them individually is begging for typos; repeated code that is similar but systematically different is one of the most common ways to introduce bugs.

Learn from failure

So the right thing to do is tweak the variable then run the playbook. And note the vrrp_toggle command arguments describe almost everything you need to know to generate the playbook! (The only thing missing is the mapping from instance name (like testdns1) to parent host (like helen2).

So I changed the vrrp_toggle script into a rec-rollout / rec-backout script, which tweaks the vrrp_enable variable and generates the appropriate playbook. The playbook consists of just two tasks, whose order depends on whether we are doing rollout or backout, and which have a few straightforward place-holder substitutions.

The nice thing about this kind of templating is that if you screw it up (like I did at first), usually a large proportion of the cases fail, probably including your test cases; whereas with clone-and-hack there will be a nasty surprise in a case you didn't test.

Consistent and quick rollouts

In the playbook I quoted above I am using my keepalived role, so I can be absolutely sure that my rollout/backout plan remains consistent with my configuration management setup. Nice!

However the keepalived role does several configuration tasks, most of which are not necessary in this situation. In fact all I need to do is copy across the templated configuration file and tell keepalived to reload it if the file has changed.

Ansible tags are for just this kind of optimization. I added a line to my keepalived.conf task:

    tags: quick

Only one task needed tagging because the keepalived.conf task has a handler to tell keepalived to reload its configuration when that changes, which is the other important action. So now I can run my rollout/backout playbooks with a --tags quick argument, so only the quick tasks (and if necessary their handlers) are run.

Result

Once I had got all that working, I was able to easily flip testdns0 and testdns1 back and forth between the old and new setups. Each switchover takes about ten seconds, which is not bad - it is less than a typical DNS lookup timeout.

There are a couple more improvements to make before I do the rollout for real. I should improve the molly guard to make better use of ansible-playbook --check. And I should pre-populate the new servers' caches with the Alexa Top 1,000,000 list to reduce post-rollout latency. (If you have a similar UK-centric popular domains list, please tell me so I can feed that to the servers as well!)