After fixing the performance and scalability problems, we’re on our way to getting a stable Pijul. In this post, I explain what I’ve been up to in the recent months.
Pijul has always been advertised as a research project, trying to implement a theory of patches that would be sound and fast. This is an ambitious goal, and became even more ambitious than initially envisioned.
One of the hardest challenges is that source code is by essence stateful, which makes it much harder to iterate over algorithm designs, like normal research projecst need to. For example, in order to get from our last published version to our current design, we have gone through many different variants, and there wasn’t much to publish.
Moreover, the UX aspect is what matters most in the end, and testing it on a real world project is the only way to get it there. However, unlike in a compiler, where bootstrapping is done one step at a time, and previous versions are always available to compile your current one, a version control system has the additional problem that the previous versions might not always be easily accessible if there is a bug.
One of the criticisms I’ve heard since I realised that better datastructures were possible is that I was “working secretely”. I certainly understand this feeling, but this is based on a misunderstanding of how research works. When I first had the idea that I’m explaining in this post, I realised that a complete rewrite would be needed. But for a very long time, almost nothing other than unusable, unreadable prototypes happened.
Back then, there wasn’t much to show, since it wasn’t even clear that the basic datastructure would work. And even when they started working at a large enough scale, it took me quite a bit of testing on large repositories before they started actually working.
This also implies that there wasn’t much to show for quite a while, since the new algorithm wasn’t usable until very recently, and any repository started before now would have become obsolete in a matter of days.
There were also persoprofessional reasons for this silence, described at the end of this post.
Pijul depends on two other projects I’ve started.
One of these projects is Sanakirja, which is “just” a key-value store, but has the extra feature that databases can be cloned efficiently. I would have loved to just use an existing library, but there just isn’t any that has this cloning feature. However, the scope of Sanakirja is still quite modest, it does one thing and does it well. Obviously, it took some time to find the memory-management bugs, but I have good confidence that this is now done.
In previous releases of Pijul, databases were implemented with a single mmapped file containing the binary representation of B Trees. Despite their lower writing performance (compared to alternatives such as Log-structured merge-trees), and the complexity of the code for deletions, B Trees are very well suited to this use case: indeed, since they are trees, reference-counting the nodes is enough to implement efficient clones.
One of the remaining issues was that in order to grow the database, we needed to un-mmapped the file, grow it, and mmap it again. Since applying a single change in Pijul must be an atomic operation, we needed to cancel the transaction when that happened, and restart it with a bigger file.
Another issue is that I wanted the next libpijul to compile on platforms that don’t have mmap, such as WASM. However, if reallocating an mmapped file has a very low complexity (even though it does have a non-zero cost in terms of system calls), reallocating a chunk of memory often requires copying everything. This completely defeats the point of the algorithms in Pijul, which rely on a particular representation of the datastructures on the disk.
The main innovation in Sanakirja 0.13 is to use a vector of memory blocks (either in memory or mmapped from a file), of exponentially-increasing size. The overhead is just one extra indirection, the complexity of adding items is the same (since the operation of creating an extra block is $O(1)$). The exponentially-increasing sizes mean that the allocated memory is always at least half-full.
The other one is Thrussh. That library implements the SSH protocol, and tries to handle a number of key formats. The former is a surprisingly easy goal, and keeping up with Tokio versions has historically been the hardest bit, while the latter is the most horrendous hydra-like task, with new heads and legacy formats showing up every time you think you’re done.
Old-style repositories represented a single file by a directed graph $G = (V, E)$ of lines, where each vertex $v\in V$ represented a line, and an edge from $u \in V$ to $v\in V$, labelled by some change (also called patch) number $c$, could be read as “according to change $c$, line $u$ comes before $v$”.
This means that changes could introduce vertices and lines, as in the following example, where a line $D$ is introduced between $A$ and $B$:
Here, the thick line represents the change from the file containing the lines $A$, $B$, $C$ to the file with the new line $D$. An important feature to note is that vertices are uniquely identified, by the hash of the change that introduced them, along with a position in that change. This means that two lines with the same content, introduced by different changes, will be different. It also means that a lines keeps its identity, even if the change is applied in a totally different context.
Moreover, this system is append-only, in the sense that deletions are handled by a more sophisticated labelling of the edges. In the example above, if we want to delete line $D$, we just need to make a change mapping the edge introduced by $c_0$ to a deleted edge, which we label by the name $c_1$ of the change that introduces it:
From now on, we call the full edges alive, and the dashed ones dead.
We have just described the two basic kinds of actions in Pijul. There are no other. One kind adds vertices to the graph, along with “alive” edges around them, and the other kind maps an existing edge label onto a different one. In order to fully described the system, I also need to mention that the edge labels are given by two parameters: their status (alive, deleted, and a few others related to multiple files and technical details explained below) and the change that introduced them.
This scheme allows to defines dependencies between changes:
If a change $c$ adds a vertex, we must have its “context”, i.e. the lines before and after it, hence the changes that introduced these lines are in the dependencies of $c$.
If a change $c$ deletes a vertex, or in other words maps an existing edge introduced by a change $d$, then $c$ must depend on $d$.
Of course, this is just the minimal set of dependencies needed to make sense of the text edits. Hooks and scripts may add extra language-dependent dependencies based on semantics.
Our goals is to find the smallest possible system, both for reasons of mathematical aesthetics (why store useless stuff?) and the other one for performance. Therefore, one immediate question comes to mind: why even keep the change number on the edges?
In order to answer that question, suppose we don’t keep the labels, meaning that the maps happen between statuses only. Then, consider the following two situations:
Change inverses
The first issue happens when two authors delete a line in parallel, and one of the authors reverts their change. Applying these changes yields the following diagram, where the two deletions get merged into one, and the inverse applies to both:
However, this is not what we expect, since one of the authors explicitly reverted the deletion, while the other performed the same deletion in parallel. By keeping the labels, this is what we get instead:
Missing contexts
For the sake of clarity, in the rest of this post, we name two users Alice (with pronouns “she/her”) and Bob (with pronouns “he/his”).
This situation, where Alice writes something in the middle of a paragraph $p$, while Bob deletes $p$ in parallel. One issue here, is that the situation is not symmetric: when Bob applies Alice’s change, he can tell immediately that something is wrong, because the context of Alice’s edits is labelled as deleted in his repository.
However, Alice’s situation is different: indeed, consider the case where instead of deleting $p$ in parallel of her changes, Bob deleted $p$ after applying Alice’s change. The edges deleted are exactly the same, but this is not a conflict, as shown in the following diagram:
The situation is further complicated by the fact that this system doesn’t behave symmetrically with the contexts above and below the new line. Indeed, if Bob deleted the down context of the line (i.e. if he deleted line $C$) instead of the up context (line $B$), Alice could detect the conflict, since in that case, $C$ would have both an alive and a dead edge pointing to it ($C$ is called a “zombie vertex” internally), as shown in the following diagram:
Keeping the change identifiers on each edge allows us to solve this. In Pijul 0.12, Bob would add the labels of all the edges around the deleted lines to the dependencies of his change. Then, Alice can tell whether Bob knows of her change before applying it. The changes are conflict if and only if Bob doesn’t know of the new lines.
However, this behaviour was counter-intuitive, as noted by @tae.
A finer analysis of what dependencies are led to a different behaviour in the new Pijul. Changes now have two different sets of dependencies: one is the set of strict dependencies, which are change we require in order to apply the current change, while the other one is merely a set of “known” changes, which the apply algorithm checks to decide whether to mark a conflict or not.
According to what we described so far, there are two types of conflicts in Pijul:
Moreover, it is easy to show that Pijul implements a conflict-free replicated datatype (CRDT): indeed, we’re just adding vertices and edges to a graph, or mapping edge labels which we know exist because of dependencies.
However, Pijul’s datastructure models, in a conflict-free way, the conflicts that can happen over a text file. In fact, Pijul would remain a CRDT with or without the design choices explained above about edge labels: for example, we could decide that the “deleted” status has higher precedence. But as shown above, that wouldn’t model conflicts accurately.
The scenario I want to talk about now is the sequential (i.e. single-user) situation where we start with many lines. Our user deletes all of them, adds one new line at the very beginning, and one at the very end, as shown in the following diagram:
If we represented this situation naively like in that diagram, the complexity of applying changes and outputting the repository would depend linearly on the size of the graph, as we would need to traverse the entire thing to know about line $C$, and know it comes after $B$.
The trick is to use what we call “pseudo-edges”, which are not part of any change, but are just here to keep the “alive subgraph” (the subgraph of the alive vertices) connected. Every time we delete an edge, we add pseudo-edges between the source of the edge and all the descendants of the target, like the dotted edge in the graph below:
The extension of this scheme to multiple files is done in the following way:
We introduce another type of edge label to indicate that edges represent files, and hence are not be transitive. In particular, when we delete a file, all descendant vertices below must be deleted.
Each file or directory is represented by two separate vertices: one is its name, the other one is an “inode” vertex representing the file itself. This allows directory renames to commute with file renames, and file renames to commute with edits at the beginning of the file. The naive representation where files are represented as just their name would cause the kind of conflicts described above when the contexts of new vertices are deleted in parallel.
The non-conflicting case is where the graph of alive files, reduced by contracting the edges from and to name vertices, is a tree, and each file has exactly one unique name.
Negating this last sentence yields four different types of conflicts:
In the previous Pijul, the contents of each line-vertex was stored in a table in the database. This was not optimal, since:
As wasteful as this may seem, on the repositories we analysed, this was only about half of the problem. The other half of the disk space was taken by the graph. Not only were repositories taking a lot of space on the disk, but this was also making non-trivial use cases extremely hard to debug, as the graph would quickly become impossible to plot, even on a codebase as small as Pijul itself.
Moreover, some potential applications of Pijul called for a word-based diff. But the size of these graphs was already out of control for lines, and words would only make it worse, because that would require one vertex per word.
This was a major scalability issue, even bigger than parsing your particular format of SSH key (which is possibly the most frequently reported problem).
I did consider abandoning the project at that point, but just before that I wanted to try two ideas out:
Since many graphs I was looking at had long paths of consecutive lines, could we store them “packed” together as a single vertex, until another change needs to split them?
To be honest, I wasn’t too happy about this idea, because there was still a problem with words, since we would have to choose at the beginning of the repository, and for all the files of the repository, whether the graph represented lines of words. Also, this would involve at least a major refactoring for a potentially small benefit.
At the same time, I tried to solve the problem of how to store the content. After all, if I had to abandon the project, then at least Sanakirja could be a nice by-product if I managed to make databases smaller on disk.
In the end, I didn’t “fix” Sanakirja, but managed to solve both problems at the same time. If we could store groups of lines, we might as well store groups of bytes, the graph would be larger. And as an extra benefit, if graph vertices store references to the byte offsets of a change, we don’t have to store a table mapping line numbers to contents, we can simply load contents from the change files directly.
However, this idea meant that a completely new Pijul had to be written, in order just to benchmark the idea. I wrote the core algorithms during my Christmas holidays at the end of 2019, and the results were quite satisfactory, as I was able to import a large number of large files without a problem. Even better, I was able to graphviz the result.
I wasn’t super confident about how to announce that publicly, since a number of people had started using Pijul, and even though they knew the project was still experimental, this move would be a breaking change in all possible imaginable ways. Moreover, it wasn’t clear to me that I was still legally allowed to work on Pijul (more on this below).
We first revisit the first two diagrams in this post: adding and deleting lines. Vertices are now referenced by a change number (for example $c_0$) and a byte interval in that change (for example $[0, n[$, which means “bytes from offset $0$ included to offset $n$ excluded”). Note that vertices can now represent an arbitrary number of lines. Moreover, the size they occupy in memory is independent from $n$ (assuming $n < 2^{64}$).
Starting from a single vertex $c_0:[0, n[$ with $n$ bytes (containing an arbitrary number of lines), introduced by change $c_0$, adding a line is done by first splitting $c_0:[0, n[$ at some offset $i < n$, and inserting the new vertex just like before.
This means in particular that the splitting of content into lines is done by the diff algorithm and is encoded in the changes, instead of being set in stone for all repositories. With a different diff algorithm, we could imagine splitting contents according to programming language structure.
Once we know how to split vertices, deletions are handled almost as before: as shown in the following diagram, where we first apply the same change $c_1$ as in the previous example, and go on to applying change $c_2$, which deletes the end of vertex $c_0:[0, i[$ from some $j<i$ to $i$, and the beginning of vertex $c_1:[0, m[$, from $0$ to some $k<m$.
One important difference with before is that our previous edges had two different roles which were not clearly distinguishable from one another until now. One of these meanings was to order the lines, and the other one was the status. However, now that vertices can be split, the “status” role of edges becomes ambiguous: for example, a deleted edge pointing to vertex some vertex $c:[i, j[$ means that bytes $[i, j[$ of change $c$ are deleted, but what if we split that vertex into $c:[i, k[$ and $c: [k, j[$? Should we add an extra deleted edge, and if so, where?
There is a simple solution: by introducing a new kind of edge label (named BLOCK
in the source code) we can distinguish between “internal” or “implicit” edges that are only here to order the blocks, and “status” edges informing about the status of their target vertex. I’ll probably explain more about this in a future blog post, or in the manual, or in a paper.
Certainly the most breaking change of them all is the new patch format:
After discussions on our discourse, “patches” have been renamed to “changes”. Not too breaking, alright.
Everything is stored differently, because vertices do not even mean the same thing as before.
Changes are now compressed with a variant zstd, allowing subsets of a file to be decompressed without decompressing the whole thing.
Changes are now divided into two main sections: one contains a header with metadata, the dependencies, known patches, edits, and a cryptographic hash of the contents. The other section contains the contents of the change, which is the concatenation of the contents of all new vertices in the change. In particular, changes do not store deleted contents (they actually never did, even in previous versions of Pijul). This division allow a server and a client to skip contents that isn’t alive anymore, while allowing the client to retrieve and verify the contents at a later time, should they need to.
Changes now have a text representation, aiming at a one-to-one conversion with binary changes. This is a new thing, and is not totally trivial to get right, in particular because one cycle of serialization and deserialization must yield the exact same binary content.
When creating a new change, authors are presented with a draft of the change in the text format (unless some recorded files are detected to be binary files), in which they can (1) edit the headers, (2) delete the changes they don’t want, and (3) add dependencies.
One can also try to edit the changes, but the result is not guaranteed to work. It is, however, one way to test the examples presented in this post, which is especially useful when paired with pijul debug
and graphviz.
Another new idea is that we now have version identifiers. This has always been a difficulty in Pijul, because any two patches $c_0$ and $c_1$ that could be produced independently commute, meaning that applying $c_1$ after $c_0$ yields the same repository as applying $c_0$ after $c_1$.
This means that meaningful version identifiers must be independent from the order. This could be achieved easily with any linear function of the hashes, for example taking the bitwise XOR of hashes. However, some naive schemes would have the problem that servers could trick clients into accepting that versions are synchronised, whereas they are not. The bitwise XOR described above has this problem, and so do other linear functions: since the hashes are random, there is a high probability that any $n$ hashes form an independent linear basis of the vector space of hashes. The problem of forging a version identity becomes easy, as it is just a matter of solving a linear equation.
Our new scheme for version identifiers comes from the discrete log problem. The version identifier of an empty repository is always the identity element $1$, and applying a change with hash $h$ on top of version $V = e^{v}$ is $V^h = e^{v\cdot h}$.
Now, because getting from $V$ to the exponent $v$ is hard (at least for a classical computer), forging a version identity is hard, since we would need to generate hashes whose multiplication is $v$, without even knowing what $v$ is.
I don’t think this scheme is robust to quantum algorithms, though: if you can find $v$ easily, it seems you can solve linear equations on a vector space based on the multiplicative group over $\mathbb N$, and $\mathbb Z/2\mathbb Z$ as the scalars.
The last thing I want to talk about today is the new protocol. First, Pijul can work without a central server, and our central hosting service does not need to be open source for that (thanks for asking, and even more thanks for reading my countless previous answers before asking).
One particular thing worth mentioning is how comparisons between remote and local versions is done. In previous version, we would download text files representing a list of changes over SSH or HTTP, and compare the whole sets.
The new way is with a binary search over the network, where the client first asks the server what their last version identifier is, and then performs a binary search to determine the latest version the client knows about. Then, new changes are fully or partially downloaded (depending on their size). Finally, we go through the new vertices that are still alive after applying all the new changes, and download (if necessary) the full changes that introduced these vertices.
One common criticism we’ve heard since we started Pijul a few years ago was about the name. I came up with that name, but to be honest, I was more interested in getting stuff to work (which was challenging enough) than in thinking about names at that time.
One suggestion I’ve commonly heard is that maybe we should translate the name to another language. The translation of that word in English is Ani, but the relevant domain names are not available, and the googlability is terrible. Then, Anu is the translation in portuguese, and also a word in many other languages, and is even the name of an antique God in Mesopotamia, which is actually the first result to show up on Wikipedia, along with a nice logo in cuneiform which looks like a messed up commutative diagram.
Anyway, it seems this new name has offended some people. I should have asked more people about it, but in times of lockdown I don’t have many around me. After running a Twitter poll, I’m now convinced that neither name is terrible, and the previous name has the advantage of being almost uniquely googleable, so I’m reverting that change.
The old Nest will remain available until the end of 2020 at https://old-nest.pijul.com. The HTTPS certificates are broken at the moment, but SSH will continue to work.