A weird BIND DNSSEC resolution bug, with a fix.

The central recursive DNS servers in Cambridge act as stealth slaves for most of our local zones, and we recommend this configuration for other local DNS resolvers. This has the slightly odd effect that the status bits in answers have AD (authenticated data) set for most DNSSEC signed zones, except for our local ones which have AA (authoritative answer) set. This is not a very big deal since client hosts should do their own DNSSEC validation and ignore any AD bits they get over the wire.

It is a bit more of a problem for the toy nameserver I run on my workstation. As well as being my validating resolver, it is also the master for my personal zones, and it slaves some of the Cambridge zones. This mixed recursive / authoritative setup is not really following modern best practices, but it's OK when I am the only user, and it makes experimental playing around easier. Still, I wanted it to validate answers from its authoritative zones, especially because there's no security on the slave zone transfers.

I had been procrastinating this change because I thought the result would be complicated and ugly. But last week one of the BIND developers, Mark Andrews, posted a description of how to validate slaved zones to the dns-operations list, and it turned out to be reasonably OK - no need to mess around with special TSIG keys to get queries from one view to another.

The basic idea is to have one view that handles recursive queries and which validates all its answers, and another view that holds the authoritative zones and which only answers non-recursive queries. The recursive view has "static-stub" zone configurations mirroring all of the zones in the authoritative view, to redirect queries to the local copies.

Here's a simplified version of the configuration I tried out. To make it less annoying to maintain, I wrote a script to automatically generate the static-stub configurations from the authoritative zones.

  view rec {
    match-recursive-only yes;
    zone cam.ac.uk         { type static-stub; server-addresses { ::1; }; };
    zone private.cam.ac.uk { type static-stub; server-addresses { ::1; }; };
  };

  view auth {
    recursion no;
    allow-recursion { none; };
    zone cam.ac.uk         { type slave; file "cam";  masters { ucam; }; };
    zone private.cam.ac.uk { type slave; file "priv"; masters { ucam; }; };
  };

This seemed to work fine, until I tried to resolve names in private.cam.ac.uk - then I got a server failure. In my logs was the following (which I have slightly abbreviated):

  client ::1#55687 view rec: query: private.cam.ac.uk IN A +E (::1)
  client ::1#60344 view auth: query: private.cam.ac.uk IN A -ED (::1)
  client ::1#54319 view auth: query: private.cam.ac.uk IN DS -ED (::1)
  resolver: DNS format error from ::1#53 resolving private.cam.ac.uk/DS:
    Name cam.ac.uk (SOA) not subdomain of zone private.cam.ac.uk -- invalid response
  lame-servers: error (FORMERR) resolving 'private.cam.ac.uk/DS/IN': ::1#53
  lame-servers: error (no valid DS) resolving 'private.cam.ac.uk/A/IN': ::1#53
  query-errors: client ::1#55687 view rec:
    query failed (SERVFAIL) for private.cam.ac.uk/IN/A at query.c:7435

You can see the original recursive query that I made, then the resolver querying the authoritative view to get the answer and validate it. The situation here is that private.cam.ac.uk is an unsigned zone, so a DNSSEC validator has to check its delegation in the parent zone cam.ac.uk and get a proof that there is no DS record, to confirm that it is OK for private.cam.ac.uk to be unsigned. Something is going wrong with BIND's attempt to get this proof of nonexistence.

When BIND gets a non-answer it has to classify it as a referral to another zone or an authoritative negative answer, as described in RFC 2308 section 2.2. It is quite strict in its sanity checks, in particular it checks that the SOA record refers to the expected zone. This check often discovers problems with misconfigured DNS load balancers which are given a delegation for www.example.com but which think their zone is example.com, leading them to hand out malformed negative responses to AAAA queries.

This negative answer SOA sanity check is what failed in the above log extract. Very strange - the resolver seems to be looking for the private.cam.ac.uk DS record in the private.cam.ac.uk zone, not the cam.ac.uk zone, so when it gets an answer from the cam.ac.uk zone it all goes wrong. Why is it looking in the wrong place?

In fact the same problem occurs for the cam.ac.uk zone itself, but in this case the bug turns out to be benign:

  client ::1#16276 view rec: query: cam.ac.uk IN A +E (::1)
  client ::1#65502 view auth: query: cam.ac.uk IN A -ED (::1)
  client ::1#61409 view auth: query: cam.ac.uk IN DNSKEY -ED (::1)
  client ::1#51380 view auth: query: cam.ac.uk IN DS -ED (::1)
  security: client ::1#51380 view auth: query (cache) 'cam.ac.uk/DS/IN' denied
  lame-servers: error (chase DS servers) resolving 'cam.ac.uk/DS/IN': ::1#53

You can see my original recursive query, and the resolver querying the authoritative view to get the answer and validate it. But it sends the DS query to itself, not to the name servers for the ac.uk zone. When this query fails, BIND re-tries by working down the delegation chain from the root, and this succeeds so the overall query and validation works despite tripping up.

This bug is not specific to the weird two-view setup. If I revert to my old configuration, without views, and just slaving cam.ac.uk and private.cam.ac.uk, I can trigger the benign version of the bug by directly querying for the cam.ac.uk DS record:

  client ::1#30447 (cam.ac.uk): query: cam.ac.uk IN DS +E (::1)
  lame-servers: error (chase DS servers) resolving 'cam.ac.uk/DS/IN': 128.232.0.18#53

In this case the resolver sent the upstream DS query to one of the authoritative servers for cam.ac.uk, and got a negative response from the cam.ac.uk zone apex per RFC 4035 section 3.1.4.1. This did not fail the SOA sanity check but it did trigger the fall-back walk down the delegation chain.

In the simple slave setup, queries for private.cam.ac.uk do not fail because they are answered from authoritative data without going through the resolver. If you change the zone configurations from slave to stub or static-stub then the resolver is used to answer queries for names in those zones, and so queries for private.cam.ac.uk explode messily as BIND tries really hard (128 times!) to get a DS record from all the available name servers but keeps checking the wrong zone.

I spent some time debugging this on Friday evening, which mainly involved adding lots of logging statements to BIND's resolver to work out what it thought it was doing. Much confusion and headscratching and eventually understanding.

BIND has some functions called findzonecut() which take an option to determine whether it wants the child zone or the parent zone. This works OK for dns_db_findzonecut() which looks in the cache, but dns_view_findzonecut() gets it wrong. This function works out whether to look for the name in a locally-configured zone, and if so which one, or otherwise in the cache, or otherwise work down from the root hints. In the case of a locally-configured zone it ignores the option and always returns the child side of the zone cut. This causes the resolver to look for DS records in the wrong place, hence all the breakage described above.

I worked out a patch to fix this DS record resolution problem, and I have sent details of the bug and my fix to bind9-bugs@isc.org. And I now have a name server that correctly validates its authoritative zones :-)