Composable Berkeley Transactions

On Tuesday this week I spent some quality time with the Berkeley DB documentation. The motivation for this was to write a script that would reliably back up a Berkeley DB transactional database. This is simple, because Berkeley DB comes with the necessary utilities which work independent of whatever client application is built on top of the database: you just checkpoint, archive, and remove cruft. Recovery is supposed to be similarly simple.

However I ended up getting distracted by various funky features, like replication, distributed transactions, running Berkeley DB as an RPC server, etc. One of the things that particularly caught my eye was the page about nested transactions. The documentation describes them as being useful for decomposing large transactions into smaller parts, so that if a child transaction encounters a conflict it can abort without having to unwind the whole parent transaction. This allows complex transactions to proceed even in the face of heavy contention.

However what the documentation doesn’t say is that nested transactions make it easier to create simple library APIs without compromising flexibility. A library which provides a function implemented using a Berkeley DB transaction can simply provide a parameter for a parent transaction, then code which uses the library can call the same function in isolation or as part of a larger transaction. In the absence of nested transactions, the library will probably want to provide two versions of the function, one with the rubric for dealing starting, retrying, and committing the transaction, and a raw guts version for use inside larger transactions. This breaks the library’s abstraction.

This reminded me of a paper I read in about February this year, called Composable Memory Transactions. The paper was written by two members of the Haskell mafia and two members of the STM mafia (who were all working at MICROS~1 Research in Cambridge at the time), and it combined the two in a mind-blowingly beautiful way.

Software Transactional Memory is mostly a subset of a concurrent programming technique known as lock-free programming. Traditionally, concurrent programs have used locks to ensure that changes to data structures appear atomic - the data moves directly from one consistent state to another, from the point of view of other threads in the program. It is difficult to program correctly with locks, and they do not scale well as the complexity of the program or the number of processors increases. The aim of lock-free programming is to create concurrent data structures which can make effective use of large multi-processor machines, which are becoming increasingly common - most high-end CPUs nowadays are multi-core. It uses the processor’s atomic operations (e.g. compare-and-swap) to manipulate the data structure directly. Keir Fraser’s PhD thesis “Practical Lock Freedom” contains several examples and is really nicely written.

However lock-free programming at this level is hard: it’s possible for a really studly programmer to implement a reasonably simple data structure such as a skip list, but it isn’t feasible for mortals to implement lock-free versions of the kind of ad-hoc structures you get in large programs. So we need a higher-level way of programming in this style. What software transactional memory does is to encapsulate the complexities of lock-free programming in a general-purpose way, so that it looks like the transaction processing that is familiar from database programming. You start a transaction, manipulate the data structure, and then commit (or retry if there was a conflict), and the changes you made appear atomic to the rest of the program. Nice and easy.

What the Composable Memory Transactions paper did is turn STM into a general-purpose IPC mechanism. Inter-process communication (including inter-thread communication) generally involves two things: data transmission and rendezvous. STM takes care of data transmission via shared data structures, but it does not necessarily provide a mechanism for a thread to block until some condition is satisfied. For example, in a producer-consumer setup, the consumer needs to wait if the queue is empty until the producer puts something in it.

The way Haskell’s STM supports blocking is with the retry operator. If the conditions aren’t right for the thread to complete its transaction (e.g. the queue is empty), it calls retry. This doesn’t retry immediately - that would be pointless - instead it aborts the transaction and waits until something has changed which may allow the thread to complete the transaction successfully. The STM system has a pretty good idea of which somethings to keep an eye on, because it was keeping a transaction log of the variables that the thread read during the transaction, so it can block the thread until some other transaction writes to one of these variables.

Blocking operations like this are usually the enemies of compositionality. The paper uses the example of two C functions which use select() internally to wait until data is available on a socket; a program that uses these functions can’t easily compose them into a higher-level function which waits for either of the inner functions to succeed. The Haskell STM system fixes this problem with an orElse combinator which composes two child transactions. If the first one calls retry, the second is run. If both of them retry then the combined transaction retries. If there is no outer orElse then the thread blocks until any variable read by either of the child transactions is modified.

A little thing that really excited me was one of the examples in the paper that showed how you could use these combinators to turn a blocking operation into a polling operation by wrapping it in a function. (With more familiar IPC systems only the opposite is possible.) The following tries a transaction, and if it succeeds immediately it returns true; if the transaction attempts to block, it returns false.

    tryTXN t = do { t; return True }
               `orElse` return False

When I first read the paper I thought it would be cool if the same neat programming interface could be applied to data on disk as well as in memory, via a transactional database such as Berkeley DB; however I didn’t pursue the idea. Having been reminded of it recently, I have thought about it a little more - it helps that I know a bit more about Berkeley DB’s more advanced features now than I did then. In particular, this style of composable transactions requires support for nested transactions: each alternative of an orElse is run as a nested transaction, so that if it calls retry it can be rolled back independently of the parent transaction, so that other alternative sees the data structure as it was before the orElse.

What else is required? We need to record which database entries are read during a transaction, and we need to be able to block until any of them are written. Now this could be done as a wrapper around the Berkeley DB calls, but that is only sufficient if every program accessing the database uses these wrappers. This is not so good for compositionality. It’s also duplicating work, because Berkeley DB’s transaction log already records which database entries were read. So I think the feature should be built-in to Berkeley DB.

I suggest the addition of a DB_TXN->wait method. This unwinds the transaction’s log, releases all its locks, and waits until another transaction modifies a database entry that this transaction read. Then it returns DB_RETRY, which indicates that the caller should retry the transaction as if it were recovering from a deadlock. However, if this is a child transaction, DB_TXN->wait does not release the locks and block, but instead it frees the transaction after unwinding it and returns DB_TRYOTHER. This indicates that the caller should return to its parent, which can try an alternative child transaction. When the parent runs out of alternatives, it calls DB_TXN->wait which will either block or tell the parent to return to its parent. In a similar manner to Berkeley DB’s existing handling of nested transactions, when DB_TXN->wait returns DB_TRYOTHER, the log of database entries which will unblock this thread is merged from the child transaction into the parent transaction, so that when it blocks, any change that might cause a retried child to succeed will wake up the parent. This is at a somewhat lower level than the Haskell STM, but this is consistent with Berkeley DB’s existing APIs, which, for example, require that you explicitly code loops to retry transactions which conflict.

Berkeley DB already has some limited support for IPC rendezvous, but only for producer-consumer relationships implemented via the queue access method. If you call DB->get with DB_CONSUME_WAIT then your thread will block until entries are added to an empty queue. So there is at least a little of the necessary infrastructure to implement the much more general DB_TXN->wait.

All that remains is for someone to write the code…