A Curious Case of Intermittent BIND 9 System Test Failures

In early 2018, we started using GitLab as the development platform for BIND 9. One of the reasons for that move was to take advantage of GitLab’s continuous integration (CI) capabilities to test every change to the source code before merging it. Once we made BIND’s test suite and the CI environment play nicely with each other, the testing process became stable enough to warrant requiring every merge request to pass the CI process before being accepted. However, stabilizing the test process is not a one-time effort - it is more of a never-ending game of whack-a-mole as all changes, especially ones affecting core named components, can introduce subtle variations in behavior, causing some checks to fail intermittently. We also keep on adding new tests all the time. Thus, not all issues can be detected before a change is merged and sometimes it takes hundreds or thousands of test runs to trigger a specific failure mode. We investigate such issues as they appear; most of them turn out to be imperfections of the test scripts themselves, but some are caused by genuine BIND bugs.

Recently, one change turned out to be particularly disruptive for the stability of the BIND 9 test suite: refactoring the task manager. This should come as no surprise, given that the way this part of the code operates essentially defines how the numerous tasks inside named are scheduled and performed. We started going through all the test failures and quickly noticed that one CI machine stood out in terms of system test error counts. A quick look at a few of the failed test jobs on the problematic machine revealed that some named instances on it suffered from inexplicable, several-second stalls at seemingly random times. This machine was configured with the same CI job concurrency settings as other CI machines and was similar (though not identical) to them in terms of CPU core count, memory, and storage characteristics, yet no other machine exhibited the same problem. We were worried that the increased test error rate on that machine was a symptom of some bug in the refactored task manager code that only caused issues in certain environments.

First, we tried manually reproducing the failures observed in the CI environment by moving the problematic host out of the CI pool and running the tests manually on it. At first, all the tests appeared to be working fine, which was slightly confusing. However, we quickly noticed that worst-case CI loads were higher than the ones we were testing, and we subsequently managed to confirm the issue by upping concurrency levels for BIND system test runs.

Since the problem could apparently only be triggered under high load (and even then reproduction was intermittent), we wanted to get a better idea of the circumstances in which these mysterious stalls were happening, hoping that we could attribute them to a specific piece of code (either in the test scripts or in the C code being tested). Thus, we prepared a “quick and dirty” script to find and list large gaps between consecutive lines in named log files. While this approach was obviously quite crude as somenamed instances used in system tests are expected to remain idle for extended periods of time, it allowed us to see how specific named instances behaved on properly working CI machines in comparison with their counterparts on the problematic host, during failed test jobs. It soon turned out that apparently most stalls were happening when named was writing data to disk, but that was it in terms of specificity - stalls were appearing in a vast number of different situations.

This was a surprising find because the problematic machine was equipped with fast NVMe SSDs and, while running BIND 9 system tests involves quite a lot of small-block I/O, it should not be nearly enough to should make a modern server choke, even with a dozen of test jobs being run in parallel. Thus, to test the theory that the I/O subsystem was to be blamed, we performed a number of test runs on the problematic machine with all I/O confined to tmpfs mounts. Not a single test failed. After switching back to normal storage, the tests began failing again.

These results confirmed that some I/O issue was indeed the likely culprit. We could just permanently use tmpfs mounts for running system tests on the problematic machine, but that would make the test environment less realistic, so we still wanted to figure out what was wrong. We went back to comparing CI machines which were working properly with the problematic one, concentrating on I/O-related matters.

This eventually drew our attention to the mount options used on the problematic machine, specifically the discard option which was set for the / mount point on the problematic machine but not on other CI machines. Let man speak:

discard
       Disable/enable the discard mount option. The discard function issues frequent commands
       to let the block device reclaim space freed by the filesystem. This is useful for SSD
       devices, thinly provisioned LUNs and virtual machine images, but may have a significant
       performance impact. (The fstrim command is also available to initiate batch trims from
       userspace.)

Apparently the wording used in the excerpt quoted above is not overly strong because, lo and behold, running mount -o remount,nodiscard / solved the problem - the tests no longer failed. Looks like the SSDs used on the problematic machine did not cope well with at least some waves of TRIM commands sent by the operating system when a large number of BIND system tests were run concurrently. Once we confirmed that the problem resurfaced after enabling the discard mount option again, we became confident that modifying the /etc/fstab entry for the / mount point so that it does not include the discard option was enough to fix the problems we were observing.

The last piece of the puzzle was to determine how just one machine ended up with thediscard mount option set. The answer turned out to be simple: all other machines were provisioned automatically by the hosting provider whose machines we were using for CI purposes, while the problematic one was set up manually by a member of the BIND team. Coincidentally, the problematic machine was added to the CI pool around the time when the task manager was refactored, which prevented us from noticing the extraordinary test failure rate for this machine earlier.

Solving this issue allowed us to retain confidence in the way the BIND task manager was refactored. While the problem turned out to be caused by a system configuration issue and not directly by BIND 9 itself, we still plan to make our CI environment more varied than it is now, in the hopes that this will increase the chances of finding bugs before the code changes causing them are even merged. While we are aware that we will never achieve a 100% success rate in this department, we can - and will - still try to get close to it.

Recent Posts

What's New from ISC