Finding the “Second Bug” in glibc’s Condition Variable

by Malte Skarupke

I continue to have no time for big programming projects, so here is a short blog post. Two years ago I looked into a bug in the glibc implementation of condition variables: Sometimes pthread_cond_signal() doesn’t do anything, which can easily hang your program. The bug is still not fixed, partially because a mitigation patch was available right away that seemed to make it go away. Except that people kept on showing up in the bug report saying that they still hit the bug sometimes, raising the suspicion that there might be a second bug. I finally got around to looking into this. I found that the mitigation patch only helps a little, it’s still the same bug, and the patch I submitted (unreviewed, don’t use yet) would actually fix it.

As I mentioned last time, one of the affected programming languages is Ocaml. Their master lock occasionally doesn’t notify a sleeper because sometimes pthread_cond_signal() doesn’t do anything. And then the whole process hangs forever, because that’s what happens when someone doesn’t get woken up in cooperative multithreading.

Checking this code in TLA+ happens to be easier than the code I was checking last time, because this results in a deadlock, which can be checked quickly. Last time I had to write temporal formulas, which make TLA+ run slowly, but a deadlock is easy to find: Find a state that has no successor states. All TLA+ has to do is enumerate all states. So whenever you can, try to write your TLA+ code so that it causes a deadlock on error.

The Ocaml masterlock can be directly translated into the PlusCal language of TLA+. It just wraps a normal mutex and adds a count of waiters, to make cooperative multithreading slightly more efficient: You can quickly check if anyone actually wants the lock, so you don’t have to give it up if nobody is waiting. For our purposes we don’t even need the yield function, just st_masterlock_acquire() and st_masterlock_release() ended up being enough. Then all we have to do is call those in a loop on multiple threads. Here is the code in PlusCal (it’s calling the mutex and condition-variable functions we wrote last time)

procedure acquire_masterlock()
{
acquire_masterlock_start:
  call lock_mutex();
acquire_masterlock_loop:
  while (busy)
  {
    waiters := waiters + 1;
    call cv_wait();
  acquire_lock_after_wait:
    waiters := waiters - 1;
  };
acquire_masterlock_after_loop:
  busy := TRUE;
  call unlock_mutex();
  return;
}

procedure release_masterlock()
{
release_masterlock_start:
  call lock_mutex();
release_masterlock_after_lock:
  busy := FALSE;
  call unlock_mutex();
release_masterlock_signal:
  call cv_signal();
  return;
};

fair process (Proc \in Procs)
variable num_loops = MaxNumLoops;
{
proc_start:
  while (num_loops > 0)
  {
    num_loops := num_loops - 1;
    either { call acquire_masterlock(); }
    or     { goto proc_done; };
  proc_loop_done:
    call release_masterlock();
  };
proc_done:
  skip;
}

If you clicked on the github link, you will see that this is a direct translation. But the point is that it’s fairly straightforward code. Now we just have to run this with the magic number 3, three Procs and MaxNumLoops=3, and three hours later we have a trace that shows how you can hang even with the mitigation patch. (full reproduce steps at the bottom of this blog post)

One reason why I didn’t find this problem last time is that the trace is even more complicated. The mitigation patch does help for smaller traces, with fewer threads or smaller numbers for MaxNumLoops. To hit the bug you need a very specific long interleaving. This will be gibberish unless you have spent many hours studying the glibc condition variable code:

  1. Find the magic interleaving that makes you hit the “potential steal” code path in pthread_cond_wait(), discussed last time
  2. Run the pthread_cond_broadcast() from the mitigation patch while no other thread is waiting, so that it will early-out without doing anything
  3. Then signal again and trigger a call to quiesce_and_switch_g1()
  4. At the same time as step 3 have a thread wait, consuming the leftover signal from the “potential steal” in step 1, then wait again
  5. Now have step 3 finish with g_size=2 (because of the two waits) for the new g1, then signal after the switch, leaving g_size=1
  6. Then have step 4 finish the second wait, consuming the signal from step 5

You’re now left with g_size=1 for the new g1 even though nobody is waiting on it. The next call to pthread_cond_signal() just reduces g_size to 0 without waking anyone. Any signal after that will work again.

This is still really complicated for me, so I’m not sure exactly where the problem is, but it is suspicious that the pthread_cond_broadcast() in my trace would just early-out because nobody was waiting, so that the mitigation patch doesn’t result in any change to the state. If that can happen, TLA+ just had to find an interleaving where the lingering signal from the “potential steal” causes problems later.

Why didn’t I find this last time? Because I stopped looking after I had found the initial problem, and after the mitigation patch made the problem go away with my trace. Turns out I needed to run an even bigger job to find the problem with the mitigation patch.

So now the state of the glibc condition variable is

  1. It mostly works except occasionally a thread can take too long to wake up, which will cause it to steal a signal from a later wakeup. (and then the later thread doesn’t wake)
  2. To fix that the code tries to detect whether we potentially stole a signal. But the response to that can leave the condition variable in a broken state, so that a later pthread_cond_signal() will signal on the wrong futex, causing sleepers to not be woken.
  3. To fix that there is a mitigation patch which various distributions are running. But that mitigation patch has two issues:
    • it doesn’t work for the reasons that the author thinks it does (discussed last time)
    • it sometimes doesn’t do anything, which allows the broken state to cause problems later. (discovered in this blog post)

Both of these fixes makes the issue less likely to happen, but also make it harder to understand (and debug). What’s the solution? I submitted a patch for this already a year ago: Fix the start of the chain of errors by broadening the scope of grefs, so that no waiter can ever be missed when we close a group. Then the “potential stealing” code path becomes unnecessary, and the bug in that path goes away, and the mitigation patch becomes unnecessary. The implementation also becomes simpler to reason about: My “patch” is actually a series of six patches in which the first one fixes the bug and the remaining five just clean up code. (but a word of warning: this patch is unreviewed and I wouldn’t be at all surprised if there is still a problem with it. This is tricky code. See comment by Ricardo below)

If anyone has specific steps I can take to get this into glibc, I will try to do them. Last time I gave up a bit too easily because I got burned out on this problem after having spent way too many hours debugging it. Actually getting the fix in was too much extra work as a project to do in my spare time. But it’s been two years, so my burnout on this particular problem is gone and I’m game. As long as there are steps that actually work to get this in.

Appendix: Detailed Steps to Reproduce

  1. Download TLA+ https://github.com/tlaplus/tlaplus/releases/tag/v1.7.1#latest-tla-files
  2. Run the toolbox
  3. Click “File -> Open Spec -> Add New Spec” then select ocaml_mutex.tla
  4. Click “TLC Model Checker -> New Model”, click OK
  5. In the model under
    • What is the behavior spec: leave as default “Spec”
    • What is the model?
      • UseSpinlockForCVMutex <- TRUE
      • AlwaysSetGrefs <- FALSE (this flag controls whether my patch should be used)
      • Procs <- {P0, P1, P2}, also select “Set of model values” and “Symmetry set”
      • UseSpinlockForMutex <- TRUE
      • UsePatch <- TRUE (this flag controls whether the mitigation patch should be used)
      • MaxNumLoops <- 3
    • What to check?
      • Check “Deadlock” (should be checked by default)
      • Invariants: leave empty
      • Properties: leave empty
  6. Click “TLC Model Checker -> Run Model”
  7. Watch the progress in the “Model Checking Results” column. Once per minute it should populate the table “Time, Diameter, States Found, Distinct States, Queue Size.” After ~2.5 hours, when “Diameter” reaches 342, a new panel should open up titled “TLC Errors”, and it should say “Deadlock reached.” in the first window.
  8. Now you have 341 states to look at in the Error-Trace, in full detail. The “pc” variable shows you which line each process was on for every state. All other variables are also visible. They are a pretty direct translation from the C code. It should be possible to figure out by having the C code and TLA+ code side-by-side to see the minor differences. I uploaded a table with only the relevant information here. (this was from a run with slightly different source code, with the AlwaysSetGrefs commented out. This means your trace will be slightly different and the line numbers of the spreadsheet won’t align if you try to add more information. You’ll need to create your own by exporting the trace)

(optional step before 6: Click on “Additional TLC Options” and increase “Number of worker threads” for your machine, also “Fraction of physical memory allocated to TLC” and set “Profiling” to “Off”, this should speed up the checking)