.@ Tony Finch – blog


I haven't got round to setting up proper performance monitoring for our DNS servers yet, so I have been making fairly ad-hoc queries against the BIND statistics channel and pulling out numbers with jq.

Last week I changed our new DNS setup for more frequent DNS updates. As part of this change I reduced the TTL on all our records from one day to one hour. The obvious question was, how would this affect the query rate on our servers?

So I wrote a simple monitoring script. The first version did,

    while sleep 1
    do
        fetch-and-print-stats
    done

But the fetch-and-print-stats part took a significant fraction of a second, so the queries-per-second numbers were rather bogus.

A better way to do this is to run `sleep` in the background, while you fetch-and-print-stats in the foreground. Then you can wait for the sleep to finish and loop back to the start. The loop should take almost exactly a second to run (provided fetch-and-print-stats takes less than a second). This is pretty similar to an alarm()/wait() sequence in C. (Actually no, that's bollocks.)

My dnsqps script also abuses `eval` a lot to get a shonky Bourne shell version of associative arrays for the per-server counters. Yummy.

So now I was able to get queries-per-second numbers from my servers, what was the effect of dropping the TTLs? Well, as far as I can tell from eyeballing, nothing. Zilch. No visible change in query rate. I expected at least some kind of clear increase, but no.

The current version of my dnsqps script is:

    #!/bin/sh
    
    while :
    do
        sleep 1 & # set an alarm
        
        for s in "$@"
        do
	    total=$(curl --silent http://$s:853/json/v1/server |
                    jq -r '.opcodes.QUERY')
            eval inc='$((' $total - tot$s '))'
            eval tot$s=$total
            printf ' %5d %s' $inc $s
        done
        printf '\n'
        
        wait # for the alarm
    done