Structural and semantic deficiencies in the systemd architecture for real-world service management, a technical treatise

by V.R.

Preface and disclaimer (!)

You’re probably wide-eyed and gnawing at your teeth already.

I was finally tempted into writing this from a Hacker News discussion on “Debian Dropping the Linux Standard Base,” where some interest was expressed in reading an architectural critique of systemd.

To the best of my knowledge, this article - though it ultimately ended up more of a paper in article format, is the first of its kind. This is startling. It’s been over 5 years of systemd, and countless instances of religious warfare have been perpetrated over it, but even as it has become the dominant system in its area, there really hasn’t been a solid technical critique of it which actually dissects its low-level architecture and draws remarks from it.

In fact, much more worthwhile has been written on the systemd debate than on systemd itself. Among these include Judec Nelson’s “systemd: The Biggest Fallacies” and my own later “Why pro-systemd and anti-systemd people will never get along”.

I am very wary of publishing this, due to being keenly aware of how these discussions descend into an inferno with the same dead horse talking points, even in cases where the author makes relatively salient or previously unexplored points. The tribal instinct to show where one stands kicks in and people begin slinging mud about init systems, even if it’s tangential to whatever they’re supposed to be commenting on.

I have previously raised various points similar to the ones in this article, but usually scattered across comment sections on news sites, and never in any centrally compiled resource. Nevertheless, I will be analyzing systemd’s internal semantics to a reasonable degree, monumentally more so than is the norm. In the process, I hope to somewhat “tilt” the already horrifically spoiled nature of the debate into a more concrete direction, and hopefully to bring to light the lesser known side of systemd, regarding how it operates, and not merely musings on its public interfaces.

Indeed, my analysis will be culminating into some rather heterodox conclusions, as you shall read and hopefully enjoy.

Yet before we head on to this long overdue (perhaps even a bit too late at this point) treatise, I must make some reservations.

Firstly, this article assumes prior familiarity with systemd. I’m sorry to disappoint you if you just wanted a straightforward “Why systemd sucks - this, this, that” article for dummies, but the conclusions are much more nuanced, and the analysis not strictly linear. I would argue that to read a critique of an architecture, one must already have some awareness of it.

Secondly, I do not claim to make any absolute case for unproficiency or unfitness of systemd given a particular requirement. I only present a critique of it in the general case. The former simply isn’t practical even if I desired to achieve it, since much of the gen on Unix process management, supervision and init is not formally specified or defined to begin with, and relies on heuristics and a priori statements to some degree.

Thirdly, read the article in full, especially before responding. We will be qualifying lots of statements, progressively introducing and elucidating on concepts, applying some a priori reasoning at times, and backtracking to derive conclusions or reiterate on prior stated knowledge.

Fourthly, I will only be dealing with systemd the service manager (of which the init is an intracomponent subset, and also contains several other internal subsystems and characteristics which will prove of paramount importance to the analysis), and to a lesser extent journald. Some credence to the general systemd interfaces, public and private, not strictly related to the service manager, and including generators, will also be given. Most auxiliaries will not. They are outside of the scope of this article.

At the very least, no one will be able to claim that criticism of systemd is technically baseless.

With these caveats in mind, let us begin.

Everything is a Unit (but it doesn’t mean a lot)

The conventional understanding of systemd is usually related to its features as a service manager, or as an init system, and it is most often promoted and defended in relation to its capacity in that problem domain. Perks like journald are part of that, as reliable logging has long been established as an important part of a process supervision toolkit. The auxiliary components, such as logind and nspawn, are also frequently given praise, but they largely exist as services and utilities built on top of and consuming the capabilities of the systemd service manager, thus they are seldom advertised in isolation.

The thrust of my argument, and the positions espoused in this paper, is that this interpretation is wrong, and that in fact almost all commonly given existing definitions of systemd lead to thinking of it in the wrong mental model. It is improper to interpret systemd as an “init system”, as a “service manager”, or as a “software suite for central management and configuration of the GNU/Linux operating system,” (as defined by Wikipedia at present), or even as a low-level userspace middleware.

Instead, these all emerge or are designed on top of what systemd fundamentally provides and is: an object system for encapsulating OS resources alongside a transactional job scheduling engine (itself consisting of such objects) with the intention of providing a uniform interface for controlling and partitioning the units of CPU time, as well as static names and entities, in a GNU/Linux system.

Many of my positions will be related to the complexity, inflexibility, inconsistencies and excessive indirection of the aforementioned object system, and why the model it presents is dubious for its stated goals of being a standard building block suite to base OS distributions (GNU/Linux, specifically) under.

It is worth noting that the systemd developers themselves have owned up to this definition, though have largely not publicized it. In a wiki article titled “The New Control Group Interface”, they write, in the context of cgroups being described as objects:

Well, as mentioned above, a dependency network between objects, usable for propagation, combined with a powerful execution engine is basically what systemd is.

This is more-or-less equivalent to the interpretation presented herein, except that the execution engine is actually one of the less prominent components, largely relegated to being a context inside the job and unit queues that are driven by the Manager object, which is the primary data structure defining the high-level logic of a systemd “instance”, w.r.t. to whether it functions as a system-wide or per-session manager. Another area of dispute will be the primacy of the aforementioned “dependency network,” in that it is left uncertain whether it is truly essential to the design of the object system, or if it is a constraint to handle conflicts in the job and primitive unit scheduling semantics. Further complicating things is reconciling the way the systemd authors seem to encourage the use of the service manager in a lazy fashion, though it is overwhelmingly understood by its users as an eager one, instead. This would have implications on how important parallelism is perceived as a design aspect, as it is less of one in a lazy service manager, but in an eager one the dependency information is important to synchronize and order services per a directed graph, in order to resolve intrinsic non-determinism and race issues in parallel startup.

Let us digress.

Plenty of systems revolve around the “grand abstraction”, or the “overriding metaphor”. In Plan 9, this is the file. But, not in the sense of an on-disk file or really any assumption about instances of a file - the file has context only insofar as it is defined by the byte-stream protocol 9P, and so any arbitrary data structure can be represented as a hierarchical tree of files. In systemd, this metaphor is the Unit.

The Unit (often confused with the unit file - a textual representation of a configuration for what is serialized into a Unit) is the overriding metaphor that systemd uses to model the world. Units are objects and the design of systemd is thoroughly object-oriented in its nature. A Unit is associated with a Manager, which as mentioned previously is the object responsible for being a systemd “driver”, system-wide or per-session. A distinction is made between an abstract Unit, an instance of a Unit and a reference to a Unit (which is refcounted for purposes of being able to merge them). Unit instances are associated with a vtable, a mechanism for dynamically dispatching a polymorphic function, in systemd’s case this vtable includes operations for generic process management (starting, stopping, killing), bookkeeping operations like releasing resources, and other fields as they pertain to the runtime state of a Unit rather than static data such as ID, description, dependency set, booleans and so forth. Individual unit types register their own specific handlers corresponding to the generic vtable members. A Unit may encapsulate or queue a job (only one job per Unit), which is a generic unit of execution dispatched by a Manager object’s private run queue. Units themselves have a Manager-local load queue where their configuration is serialized into an internal format from a unit file or programatically. Two more queues are the cleanup queue for invalidating stale Units (ones not holding jobs and are inactive+failed), and the GC queue for Units pending to enter the cleanup queue (a basic mark-and-sweep method is used).

This arrangement, which is virtually what defines systemd’s architecture, while likely useful for providing a uniform internal interface, is much less adept at providing a uniform external interface, and generally does not reap a lot of benefits from object-orientation.

For instance, there are several disciplines for units, as I shall define it henceforth (as opposed to types defined in systemd.unit(5)), each having important semantic differentiations:

Permanent units

Permanent units, though not officially christened as such (the term is instead derived from the antonym of “transient unit”, which is official terminology) are the standard ones you know which have configuration loaded from disk in the form of unit files.

These have the full breadth of options corresponding to their type, and are subject to the unit file installation logic for managing their configuration in the file system ([Install] directives).

Transient (executable) units

Transient units are not specified from unit files, but created programmatically. Here we talk about those that manage some form of executable data or unit of work. There is at least one executable unit which is solely transient, that of scopes – which are a form of logical binding to associate multiple non-forking tasks in a cgroup for resource management purposes, not unlike how svchost does grouping, but with different mechanics. Otherwise, timers (themselves units queued as jobs for a Manager object meant to trigger other units) and services may be run transiently through the systemd-run(1) tool.

Transient units may or may not have unit files existing as nodes on the file system (such as in the case of device units, though those are non-executable), but largely do not and in any case, the main signature is that the on-disk unit file is not the mechanism of their configuration.

The obvious benefit to this, of course, is to run arbitrary processes within the systemd framework, e.g. for one-off tasks, testing or scripting, and adjust their execution environment dynamically.

The issue is that, in fact, systemd is not well amenable to dynamic execution environment modification (whether dynamism be pure runtime replacement, or quick reexecution with checkpointing and calculation of unit execution environment deltas), in spite of the object system. There is a feature called unit drop-ins, whereby one can write snippets of unit file INI pairs to overwrite a vendor configuration. However, this is merely a static load-time measure, and even still not capable of adjusting dependency information beyond addition.

The alternate method is the set-property option of systemctl(1), which does actual runtime modification of the execution environment, but is mostly limited to options in systemd.resource-control(5), those pertaining to cgroup controllers.

Transient (non-executable) units

Just like transient units, these are programatically generated. The two types falling under this discipline are device units and snapshot units. The former has a file system representation (of course with no options), the latter does not.

Snapshot units are peculiar, in that they do not exist as unit files, but are purely in-memory, yet simultaneously can be referenced as though they were filenames. They are generated through systemctl snapshot and are implemented as units which hold dependencies on any enabled active units at the point of creation. Their level of granularity is low, being able only to preserve information about uptime, i.e. what is running or has been stopped.

The in-memory constraint of snapshots slashes their utility considerably, as it makes it infeasible to create e.g. known-good checkpoints of the dependency graph to deterministically restore system state from. Nonetheless, they appear to preserve too little information to perform this reliably, but it does raise the question of systemd and non-deterministic boot order, which we will be revisiting.

Device units do exist on the file system, solely for ordering purposes. They are generated only for devices tagged “systemd” for udev to interpret in its rulesets. They have no options, and have their primary use to implement a so-called “device-based execution”, which is a conditional execution of units based on udev device availability – this is, in fact, done with the cooperation of udev itself which understands rules like SYSTEMD_WANTS=. Therefore, the semantics of this unit type are distinct, as it is being used to encapsulate the public API of a specific device node manager.

Permanent task-based units unassociated with jobs

The name is a mouthful, but these are worth outlining. As mentioned previously, the standard way systemd handles units of execution is through an internal unit type called a job, which itself can be queued by other units. However, there are some exceptions, and the most interesting ones are mount and swap units.

As they are permanent, they’re configured through unit files. Mount units, in fact, serve a similar purpose to a job or oneshot service, and even try to resemble jobs in emitting job status messages, but are internally implemented on top of the generic Unit interfaces as what boils down to a driver for the mount(8) binary in util-linux. The same is true of swap units for swapon(8)/swapoff(8), but along the way also use libudev interfaces to register a device node and actually do seem to queue jobs. This makes these two types somewhat of a duplication and overlapping case, on top of hiding inconsistency.

Automount units then are a further variation, as they drive autofs, though do explicitly queue jobs. In practice, they’re used to extend regular mount units by having mount points be attached lazily, i.e. deferring only until they are accessed.

Job queuing

Jobs are an internal unit type which are associated with other units (only one job per unit), nominally providing a generic internal interface for controlling any unit of execution time. Jobs are scheduled as part of transactions (more on that later), the semantics of how they’re treated there depending on what type they’re assigned - JOB_START, JOB_STOP, JOB_VERIFY_ACTIVE, JOB_NOP, etc. Job types are internal. Job modes, however, of which there are seven, can be controlled via systemctl(1). Modes specify how a job is to be queued relative to other jobs, i.e. how its presence affects the failure actions, compliance of dependency information and end result of other jobs. Job results include JOB_TIMEOUT, JOB_DONE, JOB_CANCELED, JOB_ASSERT (assertion failures in service condition prerequisites can trigger hard job failures if need be), etc. Jobs are handled inside a Manager object’s private run queue, usually checked for consistency first by being added into a transaction (where they are tracked using a refcount marker and a generation count).

Job types can be collapsed. That is to say, a job type can be adjusted based on a discrepancy with a Unit’s active state (active, reloading, inactive, failed…). Jobs have their own private dependencies separate from the publicly configurable Unit dependencies. Jobs can be merged - coalesced into one transaction, provided they satisfy some heuristics properties and do not conflict. The strategy is described in the source as:

/* Merging is commutative, so imagine the matrix as symmetric. We store only its lower triangle to avoid duplication. We don't store the main diagonal, because A merged with A is simply A.

* If the resulting type is collapsed immediately afterwards (to get rid of * the JOB_RELOAD_OR_START, which lies outside the lookup function's domain), * the following properties hold:

 
  Merging is associative! A merged with B, and then merged with C is the same
  as A merged with the result of B merged with C.
  
 Mergeability is transitive! If A can be merged with B and B with C then
  A also with C.
 
  Also, if A merged with B cannot be merged with C, then either A or B cannot
  be merged with C either.

(Merges can be done to jobs already running, as well, with the exception of those of type JOB_RELOAD so as to avoid race conditions involving configuration file updates being missed.)

The situation is somewhat of a leaky abstraction (or a failure of information hiding in an OO context), in that this is an internal unit type which the user is partially made aware of – job modes and job listings being done through systemctl(1), job scheduling announcements being made on bootup – yet also largely expected not to use or care about. Moreover, as we saw above, jobs are not used consistently even in situations where they would make sense.

The transaction manager

Jobs usually aren’t scheduled in the raw, but run as transactions by a Manager object and subjected to a lot of heuristics therein.

Transactions hold a hash table of jobs, a boolean flag of whether they are irreversible, and a pointer to the “anchor job” (the original job being queued before undergoing merging, collapsing and other tuning). Jobs inside transactions are garbage collected (deleted from the hash table bucket) when they no longer have requirement dependencies on other jobs.

A new single-job transaction involves either retrieving an existing job from a hash table, or allocating a new one, setting its generation and refcount marker, then prepending it to a transaction list.

Adding a job with unit dependencies is more involved. First, the unit must be checked for its configuration being loaded, if it has an actively running state, the unit is not masked and whether the job type being queued corresponds to the unit’s properties. After which all requirement, positive, requisite and negative dependencies (elaborated in the next section) are recursively added. If the job type is JOB_RELOAD, then relationships of PropagatesReloadTo= and PropagatesReloadFrom= are also added. JOB_RESTART types are usually translated to JOB_TRY_RESTART so as to not eagerly force dependencies that might not present to be started.

Examples of bookkeeping tasks performed by the transaction manager are dropping redundant jobs (i.e. ones whose type overlaps with a present active Unit state, jobs that are NOPs), dropping unmergeable jobs to resolve conflicts (provided they do not directly relate to the anchor job being queued), breaking ordering cycles (by dropping jobs) based on the presence of non-NULL job refcount markers and whether the generation integer matches to the one during the last time the graph was traversed.

To live is to depend

Dependency-based init is a much more delicate problem than is frequently given credit for. The term “dependency” within the context of process managers usually refers to a node in a topologically sorted direct graph, and moreover the goal is usually ordering - ensure the sequence is fulfilled without error. It is not identical, to, e.g. a library dependency, which needs a shared object exporting the same public symbols in order to be satisfied. Indeed, in init systems, one doesn’t care about the services, rather about the resources exported by the services. The service dependency is a reasonable proxy for it, though still implying some design trade-offs (including a rigid eager execution discipline, unless made multi-paradigm with laziness like systemd or nosh+UCSPI).

Nearly all dependency problems can therefore be reduced to ordering problems. This raises the question of why a dependency system reaching a certain level of complexity is necessary, or, if a dependency system is just a weak preprocessor for scanning directives and could as easily be replaced by manual policy, why is it included? The primary motivation for dependency information is to calculate safe sequences for parallel startup, which by definition in the context of OS processes would be prone to synchronization and resource failures without enforcing ordering guarantees. Parallelism, then, is largely about startup speed. An alternative method of optimizing boot would be using a checkpoint/restore solution like DMTCP or CRIU to overlay a process image from a point where its initialization is known to be done, so it jumps right to the main loop. Dependency-based init must therefore make sure its solution is neither too weak to be mere sugar, nor too complicated to involve complex processing costs and excessive input paths that make it difficult to mentally estimate the execution model.

Dependencies are not quite the same as relationships, which are more fine-grained and pertain to the service’s broader context of interaction with other services, rather than strictly the order it should be put into the startup sequence. systemd generally refers to the whole umbrella as “dependencies”, but draws some categories specific to it.

These categories are: positive, inverse-positive, negative, ordering, reload propagators. OnFailure= and JoinsNamespaceOf= are considered their own dependency categories in the systemd source’s UnitDependency enum, as well. There also exist some internal dependency categories like triggers and references (latter used in GC queues).

Positive dependencies include Requires=, Wants= and PartOf=.

Inverse-positive dependencies include RequiredBy= and WantedBy= (for unit file installation).

The negative dependency is Conflicts=.

Ordering dependencies are Before= and After=.

Reload propagators are PropagatesReloadTo= and ReloadPropagatedFrom=.

A notable subset, which systemd draws distinctions between, is requirement dependencies (like Requires=) from ordering dependencies. The former does not influence the latter, and without a particular ordering any requirement dependencies will be started in parallel with the service requesting them. Requires= implies a hard binding dependency whereby the failure of a required unit will deactivate the requirer, as opposed to the softer Wants=.

It must be highlighted that, unlike in other dependency-based init and rc systems such as BSD init+rc.d and OpenRC, ordering dependencies are relative suggestions and not absolute fiat decreed by the user. The processing pipeline of job and transaction scheduler is free to rearrange and reorder as per heuristics, and so the final graph will be quite different from what the configuration might imply. These heuristics frequently prove either too liberal or too restrictive (and often allowances from the former makes reversing failure cases of the latter more involved).

Besides the high degree of implicitness and the dependency graph being obscured as a machine resource, the dependency options are frequently overlapping or only subtly differentiated, making it easy for a user to pick subtly wrong relationships out of ignorance (and ReloadPropagatedFrom= humorously bearing resemblance to the infamous spooky-action-at-a-distance COMEFROM statement). In addition, all units by default have DefaultDependencies= set to true, which brings in at least basic.target, shutdown.target, sysinit.target and umount.target as implicitly assumed synchronization points.

The question of laziness versus eagerness and the link to service relationships is discussed in a later section.

Every problem can be solved by a layer of indirection

Though many users would perceive the long processing pipeline to increase reliability and be more “correct” than the simpler case, there is little to acknowledge this. For one thing, none of jobs, transactions, unit semantics or systemd-style dependencies map to the Unix process model, but rather are necessary complications to address issues in systemd being structured as an encapsulating object system for resources and processes (as opposed to a more well-defined process supervisor) and one accommodating for massive parallelism. Reliability gains would be difficult to measure, and that more primal toolkits like those of the daemontools family have been used in large-scale deployments for years would serve as a counterexample needing overview. Nonetheless, the setup includes some relatively unique failure points:

Ordering cycles

Ordering cycles, also called cyclic transactions, are a manifestation of a classic problem where the dependency graph is not acyclic, and there exist circular dependencies that lead to a situation where evaluation order is impossible to calculate. The way they are traditionally addressed, and how systemd does it, is to drop nodes/jobs, which may stall boots depending on what was selected to break the cycle. Notably, they also cause dependency loops.

Ordering cycles have been known to be difficult to debug with no generic methodology, largely due to the overt error hiding and obscurity of dependency information, since it is entirely transient. They are sometimes the result of the implicit state brought by DefaultDependencies=.

Dependency loops

A result of ordering cycles whereby a circular dependency leads to a continuous dropping and requeuing of a job, often stalling boot. Has been reported with NFS and rpcbind across multiple distros, with ZoL, and Xen. Likely many other cases are reported simply as ordering cycles.

Destructive transactions

A transaction is deemed destructive when an irreversible job type other than JOB_NOP is queued as part of a transaction relative to another job type besides JOB_NOP, but is not mergeable per rules above. This usually implies that an integrity violation of an existing job would have been performed. This has been known to affect reboot and poweroff operations, with very little context.

Stuck jobs

Where single or multiple start jobs take excessively long times (30s, >1m, etc.) to synchronously complete (likely due to being an anchor) before moving to the next. Generally resolved by shotgun debugging via disabling the culprit, as the information in systemd-analyze(1) is not granular enough to make estimations about the internal scheduling behavior.

Non-deterministic boot order

bootup(7) states:

The boot-up process is highly parallelized so that the order in which specific target units are reached is not deterministic, but still adheres to a limited amount of ordering structure.

It is known that in parallelism, the order of operations is not necessarily deterministic, but the result is (should be). Yet, in the case of dependency resolution, your order is your result, more-or-less. Issues have been reported in QEMU environments where dependency information is interpreted inconsistently on boot. This is one of the few cases where the author explicitly identifies it as “non-determinism,” as it is likely many cases are ignored in favor of just rebooting twice.

The aforementioned problems are especially pernicious given they are caused by the indirection incurred from systemd’s execution engine (or, where they would have happened regardless, are made more difficult to troubleshoot from the indirection overhead), and as such requires a sophisticated mental model of it for it to be debugged, due to the sparse default information and solutions like systemd-analyze(1) either being too approximate, or quickly enumerating dependency chains not easily manageable by humans. Indeed, the highly spartan snapshot units and the general lack of being able to checkpoint the dependency graph to persistent storage for reproducibility and deterministic recovery, reflect the dependency graph as being a machine resource, constantly recalculated in memory, that the user should not care about. Yet the solution is not robust enough to truly afford this luxury. Examples of past and present systems that operate on explicit compiled graphs are serel and s6-rc.

Bus APIs, connections and object interface duplication

Criticism of D-Bus is outside the scope of this article, but nonetheless it must be acknowledged that a heavyweight mechanism like D-Bus involving service discovery, an object system and authentication API does have consequences for a low-level userspace component compared to more primitive IPC. Though speaking the D-Bus protocol and marshalling/unmarshalling structured Unit data to/from its wire format, systemd actually communicates through a private socket in /run, one separate for system instances (which includes any systemd process run as init(8)) and for user instances. Failing to obtain a connection to the private bus has been a common error, with “Failed to get D-Bus connection” showing plenty of results, in addition to references to /run/systemd/private. The fact that systemd hosts and runs on its own since de facto reference implementation of a kdbus (with dbus1 fallback) client library entitled sd-bus does not raise confidence.

More pertinently, PID1 exports several documented D-Bus APIs. Most notable is the Manager object interface, with systemctl(1) internally being just a client for it. It is also the only interface (other than just one in total for scope units) to emit signals to bus subscribers, but they’re quite limited - when units are loaded or unloaded from memory (regardless of any discipline described above), same for jobs, when startup is finished, when unit files on disk are enabled/masked and when a service reload occurs. Within the scope of possible systemd events, this is a small piece of the pie. Most other bus APIs then simply just deliver properties, usually just unit configuration information that could be sent through any mechanism, D-Bus offering no special benefits.

The ultimate irony then becomes that, one must communicate with an object system (D-Bus) in order to launch commands or query data which must be deserialized to another object system, all the while the real object system is obscured, non-uniform and hidden from user manipulation, though the proxy object system (D-Bus) nonetheless increasing overhead and failure paths of communication.

cgroup writing

Widely considered to enable “reliable process tracking,” it’s more nuanced - opting in to cgroups as the main method of process tracking creates an unfavorable coupling with the main gist of cgroups, that of resource control and partitioning. A more nuanced solution could be listening for events from the Netlink proc connector (cn_proc).

Additionally, systemd does not handle incorrect (Type=forking) services much better than most traditional service managers, though it is commonly believed to. Such services are inherently flawed in that they daemonize themselves instead of delegating daemonization as is customary in a supervision suite. There are two methods of recourse in systemd. PIDFile= with all the TOCTTOU that comes from that, or a GuessMainPID= heuristic which fails if there is more than one daemonized PID inside a cgroup.

Internally, the cgroup object interface leaks across Manager and Unit module boundaries, however primitive cgroupfs manipulation is relatively well separated. Nonetheless, the design as such makes it hard to have a dedicated cgroup writer service with well-defined communication, but there is also zero interest from the developers in such an endeavor due to a duplication of efforts they allege will occur. In light of the unified cgroup hierarchy mandating a single writer only, the de facto standard being a poorly specified and internally coupled object interface will be increasingly an issue. systemd devotes two entire unit types for named logical partitioning of processes via cgroups - slices (which are dummy units, not unlike targets, but just for grouping and not even synchronization) and scopes (grouping arbitrary system processes not directly configured through systemd’s Unit framework). The public cgroup API exposed by systemd as it stands is also quite spartan, largely limited to that of creating transient units.

Parsing in critical paths

Quoting from djb’s “The qmail security guarantee”:

Don’t parse.

I have discovered that there are two types of command interfaces in the world of computing: good interfaces and user interfaces.

The essence of user interfaces is parsing: converting an unstructured sequence of commands, in a format usually determined more by psychology than by solid engineering, into structured data.

When another programmer wants to talk to a user interface, he has to quote: convert his structured data into an unstructured sequence of commands that the parser will, he hopes, convert back into the original structured data.

This situation is a recipe for disaster. The parser often has bugs: it fails to handle some inputs according to the documented interface. The quoter often has bugs: it produces outputs that do not have the right meaning. Only on rare joyous occasions does it happen that the parser and the quoter both misinterpret the interface in the same way.

When the original data is controlled by a malicious user, many of these bugs translate into security holes. Some examples: the Linux login -froot security hole; the classic find | xargs rm security hole; the Majordomo injection security hole. Even a simple parser like getopt is complicated enough for people to screw up the quoting.

In qmail, all the internal file structures are incredibly simple: text0 lines beginning with single-character commands. (text0 format means that lines are separated by a 0 byte instead of line feed.) The program-level interfaces don’t take options.

All the complexity of parsing RFC 822 address lists and rewriting headers is in the qmail-inject program, which runs without privileges and is essentially part of the UA.

Even launchd, systemd’s main influence, parses plists outside of PID1 (my earlier article about FreeBSD and launchd made its one erroneous claim to the contrary).

A relatively recent research group has emerged around “language-theoretic security,” or LANGSEC, who make the same case as djb and others on parsing, in great detail. As such, systemd’s decision to handle all unit file configuration parsing tasks (including globbing) in a critical process appear shortsighted.

Non-generic fd-holding and socket preopening

The so-called “socket activation” facility (actually a great misnomer, as elucidated by Laurent Bercot) is one of systemd’s main claims to fame. It is amusing to note that the “socket units” which serve as lazy load points, are in fact deceivingly named – socket units also handle FIFOs, POSIX message queues and “special” files (i.e. character devices and things you ioctl(2)).

In “Systemd for Administrators, Part XI”, Lennart Poettering claims:

Socket activation of any kind requires support in the services themselves.

This is incorrect. There has been such a mechanism for nearly two decades known as UCSPI (UNIX Client-Server Programming Interface), which is given an overview by JdeBP here. However, this returns to the fact that daemontools-like service managers explicitly compose execution state through chain loading tools inside runscripts, whereas the systemd object model is generally very self-contained and not amenable to much extension.

Using systemd “socket activation,” then, besides muddying and equivocating different semantics as stated by Bercot, requires integration with libsystemd to exploit effectively via sd_listen_fds(3). The same goes for its recently introduced fd-holding via sd_notify_pid_with_fds(3).

It should be noted that dynamic tracing is another potential way to achieve deferred execution conditional upon resource availability.

Inexpressive unit file options

The unit file syntax is one of systemd’s prime selling points, a subject of much praise due to its perceived ease of use (often wrongly conflated with simplicity). A more pertinent defense is that systemd guarantees a clean process state for every service, compared to the traditional hodgepodge of shell scripts which intrinsically result in mutable state run amok.

In fact, this is a misconception. Clean process state simply means being able to compose the execution environment from an explicitly defined manifest, and has nothing to do with foregoing from a shell interpreter to write your service configurations in. For instance, one can reliably create arbitrarily complex execution environments by chain loading tools that operate on overlaying the argument vectors of the next process in sequence with their own image via exec(2), before ending mutatis mutandis and the sequence closes. This is, in fact, not at all dissimilar to function composition, except the “functions” are at a higher level of granularity, being OS processes. In systemd, however, the execution state is not based on explicit chain loading, but on serializing unit file options to a private ExecContext structure overlayed into Unit objects and set for service unit types.

Examples of implicit state include the above mentioned DefaultDependencies=, which has on occasion been linked as a trigger for ordering cycles that lead to dependency loops. Other examples are Restart= where the options have definitions that are either too rigid or too lenient, such as “on-abnormal” and “on-abort”. Though in fairness this can sometimes be worked over through options such as RestartPreventExitStatus=, you will notice systemd provides no generic facility for delegated restarters. There is not even anything like an ExecRestart=. One could hack some behaviors by e.g. launching scripts on postconditions, but there is no fine-grained way to say “I want to call my own service-specific restarter and have the service manager cooperate,” unlike other new-school systems such as Solaris SMF, or most lightweight init systems. An even greater possibility would be to write delegated restarters with the systemd APIs so as to integrate to its object framework, but the available D-Bus ones are not good enough stock.

Execution environment modifiers such as PrivateTmp=, PrivateDevices=, PrivateNetwork=, ProtectSystem= and ProtectHome= are then not small composable options, but more akin to large subroutines which make predetermined policy decisions. In fact, for an object system, there is indeed a general lack of delegation or being able to rebind and extend options. In fact, systemd, unlike prior systems such as pinit, initng and eINIT, plus later ones like finit, does not have anything resembling a plugin or extension system despite its high surface and potential for needing to customize its options.

These concerns are not only theoretical. Misuses of systemd unit options are frequent in the wild, and some of the more egregious cases (all involving Java applications, which clearly would benefit from a delegated restarter, it appears) documented here.

Imbalance between promoting laziness or eagerness

It appears that much of the complexity behind the dependency system comes from an uncertainty over whether systemd intends to be used in an eager or lazy fashion for managing resources. It has most often been promoted as the latter, and indeed, it is publicly touted for its simplification of dependency logic (at least so far as pertaining to ordering dependencies – relationships such as Conflicts= and BindsTo= would still be around and nominally considered unit dependencies, though theoretically they could also be implemented as pre/postconditions) and thus its parallelization benefits.

Yet, in practice, the lazy loading is used far less often than the traditional eager service management with explicit ordering and requirement dependencies from which a dependency graph is then derived, units allocated, jobs scheduled and queued in a transaction by a Manager object.

Where something like launchd, which was also a major inspiration for systemd, operates purely on laziness, and expects any potential resource dependency requirements to be settled cooperatively via IPC, systemd is in a multi-paradigm middle ground of an eager, dependency-networked object model and a resource deferral coexisting, but stepping on each other’s toes.

Eager service managers work particularly well with a “let it crash” approach, module boundary separations (either via toolkits, e.g. daemontools and derivatives thereof, or communicating RPC services, e.g. SystemXVI), execution state composition through chain loading, dependencies limited to ordering, or otherwise such that the graphs are comprehensible by humans and easily observable. Lazy managers generally eschew dependency information and focus on watching resources exported by services to asynchronously and dynamically launch services based on conditional factors, valuing responsiveness over raw throughput. In systemd, one must endure both the non-determinism from aggressive parallelism via dependency information, the object network needing the transaction and job metaphors, all while having its lazy features be thoroughly restricted to its internal Manager and not a flexible toolkit such as UCSPI.

With the object system, no true benefits are reaped like late binding (only dynamic dispatch through vtables), introspection (only indirectly and with limitations via D-Bus), monkey patching of Unit semantics (e.g. to allow overloading options), nor does the Unit properly enforce a uniform grand abstraction. The object system is mostly a mock, and amounts to little more than a smoke screen, only complicating the execution model through excessive indirection without tangible benefit.

Targets over milestones for synchronization

Targets are a dummy unit type which intrinsically have no meaning, beyond the one derived from the dependencies they accumulate. They are used as synchronization points to create named system states or checkpoints, and order units relative to these states, taking the place of runlevels, profiles and milestones on other systems.

It is worth noting that some information is lost in the process of treating a target as a mere graph. For instance, in Solaris SMF, milestones are configured like full manifests, and as such can be used to set properties for individual services or globally on a system-wide basis. Thus it becomes more than just a synchronization point, but indeed an actual different system state to transition to. Furthermore, because it is still limited to the configuration limitations enforced by the service manager, it can both retain a near-arbitrary power akin to setting a shell script for a runlevel in inittab(5), but still have a sane environment and proper consistency checking. Another benefit is that the services inside the milestone are explicit, whereas the dependency associations in a target are not.

See the SmartOS milestones for examples.

The (system-specific) problem of readiness notification

The Unix process model doesn’t intrinsically have a way of signaling “readiness”, i.e. that initialization is complete and one is ready to serve requests in the main loop.

This is seldom perceived as a problem, though real. Generally, eager service managers will autorestart a dependent service until its conditions are satisfied within whatever limits are set, but this is sometimes criticized for leading to excessive thrashing. Whether it presents a concrete issue is relative to the workload being performed.

There are two common ways to get around this potential inefficiency: laziness (usually via preopening sockets, though possibly deferring over conditional availability of any resource) and service manager-specific readiness notification.

systemd devotes an entirely separate service unit type for the latter case, that of Type=notify. The underlying sd_notify functions operate on a simple socket-based rendezvous, but nonetheless the standard way involves integrating one’s daemon with libsystemd to call the exported function(s).

A supervisor insisting on programmatic accommodations from service writers is not the most desirable state of affairs. A rarely discussed alternative to the two common approaches is, again as with startup speed optimization minus parallelism, checkpoint/restore: checkpoint a process image from a point where initialization is known to be complete and overlay it on startup, using a tool such as DMTCP or CRIU.

Intertwining of global system and service state

Of course we also have the issue of circular dependencies in the systemd architecture itself. We have the init, process manager, process supervisor, cgroup writer, local service tools, the Unit object structure (which might benefit from being made a protocol), timers, mounts, automounts and swaps all in the same module with ill-defined boundaries.

These make live upgrades much riskier, relegate lots of potentially extensible interfaces (e.g. to write replacement components, or delegated components) to internal details and prevent potentially useful features, like running a system instance of a systemd Manager so as to use it without init and shutdown, or even a user instance so it may be used as a session manager separately from a system manager and init. FWIW, I have demonstrated this to be possible (circa systemd-208 with backported patches, i.e. uselessd) with relatively minimal modifications to unveil a state where it is outwardly functional though not properly usable for purposes beyond testing. No doubt things have gotten more complicated since then, but it could probably be done if the developers had interest in it.

Circular designs also hinder composability due to being all-or-nothing.

journald, central I/O bottleneck

The question of journald’s binary logging will not be addressed.

There certainly are problems with traditional syslogd, however it is quite interesting in that systemd’s solution actually goes even further off the edge into being a central bottleneck. It coalesces early boot, kernel ring buffer, service, and other logs (as well as coredumps) all in one location, applying indexing and postprocessing rules that are largely implicit.

This is in stark contrast to prior art such as multilog, s6-log and co. These have a clear separation of phases between collection, rotation and processing. There is no global snarfing of logs, but rather the collection is on the per-process level with its own dedicating logging service. Simple selection is available via arguments through POSIX regexps, but more complicated postprocessing may be delegated to a script. This means that the rulesets are explicit and adjustable on the per-process level. Logs are rotated automatically based on a simple naming convention, with configurable limits.

The gen of this approach is described as such:

Every program, without exception, should send its logs (be it error messages, warning messages, or anything) to its standard error descriptor, i.e. fd 2. This is why it’s open for.

When process 1 starts, the logging chain is rooted to the machine console: anything process 1 sends to its stderr appears, without modification, on the machine console, which should at any time remain the last resort place where logs are sent.

Process 1 should spawn and supervise a catch-all logging mechanism that handles logs from every service that does not take care of its own logging. Error messages from this logging mechanism naturally go to the machine console.

Process 1’s own error messages can go to the machine console, or dirty tricks can be used so they go to the catch-all logging mechanism.

Services that are spawned by process 1 should come with their own logger service; the supervision mechanism offered by s6-svscan makes it easy. Error messages from the loggers themselves naturally go to the catch-all mechanism.

User login mechanisms such as getty, xdm or sshd are services: they should be started with their own loggers. Of course, when a user gets a terminal and a shell, the shell’s stderr should be redirected to the terminal: interactive programs break the automatic logging chain and delegate responsibility to the user.

A syslogd service may exist, to catch logs sent via syslog() by legacy programs. But it is a normal service, and logs caught by this syslogd service are not part of the logging chain. It is probably overkill to provide the syslogd service with its own logger; error messages from syslogd can default to the catch-all logger. The s6 package, including the ucspilogd program, combined with the s6-networking package, provides enough tools to easily implement a complete syslogd system, for a small fraction of the resource needs and the complexity of native syslogd implementations.

The job of indexing these logs then is totally orthogonal from collecting them, allowing for much greater flexibility. It also makes interoperability with foreign services much easier, and less babysitting as compared to things like setting limits in journald.conf(5) to mitigate journal fragmentations.

In conclusion

The typical interpretations of systemd all suffer from incompleteness and conceptual mismatch. The definition of systemd proposed herein is a superior model for examining and drawing remarks from the systemd architecture.

Despite its overarching abstractions, it is semantically non-uniform and its complicated transaction and job scheduling heuristics ordered around a dependently networked object system create pathological failure cases with little debugging context that would otherwise not necessarily occur on systems with less layers of indirection. The use of bus APIs complicate communication with the service manager and lead to duplication of the object model for little gain.

Further, the unit file options often carry implicit state or are not sufficiently expressive. There is an imbalance with regards to features of an eager service manager and that of a lazy loading service manager, having rusty edge cases of both with non-generic, manager-specific facilities. The approach to logging and the circularly dependent architecture seem to imply that lots of prior art has been ignored or understudied.

DnE Blog The darkest, edgiest blog etc etc.