Tony Finch's blog

getentropy() vs RAND_bytes()

2024-10-01T14:26:50Z

A couple of notable things have happened in recent months:

There is a new edition of POSIX for 2024. There’s lots of good stuff in it, but today I am writing about getentropy() which is the first officially standardized POSIX API for getting cryptographically secure random numbers.
On Linux the getentropy(3) function is based on the getrandom(2) system call. In Linux 6.11 there is a new vDSO call, vgetrandom(), that makes it possible to implement getentropy() entirely in userland, which should make it significantly faster.

UUID v4 and v7 are great examples of the need for high performance secure random numbers: you don’t want the performance of your database inserts to be limited by your random number generator! Another example is DNS source port and query ID randomization which help protect DNS resolvers against forged answers.

I was inspired to play with getentropy() by a blog post about getting a few secure random bytes in PostgreSQL without pgcrypto: it struck me that PostgreSQL doesn’t use getentropy(), and I thought it might be fun (and possibly even useful!) to add support for it.

I learned a few things along the way!

what is getentropy()?
portability of getentropy()
advantages of getentropy()
disadvantages of getentropy()
performance of getentropy() vs RAND_bytes()
conclusions

what is getentropy()?

A cryptographically secure pseudorandom number generator basically generates bulk random bytes using a stream cipher that is keyed and periodically re-keyed using some source of high-quality randomness. In NIST standards a CSPRNG is often referred to as a DRBG, deterministic random bit generator.

In the kernel, the high-quality randomness comes from things like the unpredictable timing of interrupts, hardware random number generators, or maybe an underlying hypervisor. The word “entropy” has often been used to refer to this distilled essence of randomness. Random bytes are made available to userland via interfaces such as /dev/urandom or getentropy().

A userland CSPRNG such as OpenSSL RAND(7) gets its the high-quality randomness from these kernel interfaces. A notable feature of getentropy() is that it will not produce an arbitrarily large number of bytes: it can provide just enough to securely key a userland CSPRNG.

portability of getentropy()

OpenBSD introduced getentropy() in 2014; it was added to Mac OS X in 2016, glibc in 2017, musl and FreeBSD in 2018, NetBSD and POSIX in 2024.

It’s ubiquitous enough now that my code assumes that getentropy() exists without worrying.

There are a couple of issues that you are likely to encounter:

Originally getentropy() was declared in but POSIX declares it in . You need to include both headers to be sure.
POSIX specifies a GETENTROPY_MAX macro in for the largest buffer getentropy() will fill. Most systems don’t yet have this macro; if it isn’t defined the limit is 256 bytes.

advantages of getentropy()

There are some annoying issues with /dev/random

You have to ensure the special file is present in containers or chroot() jails
It requires multiple system calls and a retry loop to get a few bytes of randomness
It can fail if a process hits its file descriptor limit

Cryptographic algorithms often need nonces that absolutely must never be repeated, otherwise the private key is leaked. So a CSPRNG must also avoid repeating output, which can be difficult when a process fork()s.

While writing this blog post, I discussed this fork() issue with Rich Salz (who re-wrote OpenSSL’s RAND to use a NIST FIPS DRBG algorithm). He said RAND_bytes() uses getpid() to detect when the CSPRNG must be re-keyed because the process fork()ed. (Another way not used by OpenSSL is pthread_atfork().)

The kernel has to deal with the similar issue of ensuring its CSPRNG is re-keyed when a VM is cloned. However there isn’t a way for a userland process to find out its VM has been cloned.

Unlike a stateful userland CSPRNG, if you call getentropy() directly, you don’t have to worry about repeated output due to fork() or VM clones. You also don’t have to worry about linking with a cryptographic library.

disadvantages of getentropy()

In princple, once a userland CSPRNG has been keyed it should be able to produce random bytes very fast. getentropy() is designed to do the keying; it is annoying to use for bulk random bytes because it needs a loop to get 256 bytes at a time.

So one might expect getentropy() to be slower than RAND_bytes().

Is that actually the case? Let’s measure it!

performance of getentropy() vs RAND_bytes()

I wrote a simple benchmark to measure the performance of getentropy() and RAND_bytes() at a few different buffer sizes. It prints a table of nanoseconds per function call. It can be built with different versions of OpenSSL on a few different systems.

The results are more complicated than I expected!

Apple M1 pro / Mac OS 14 Sonoma

OpenSSL 1.1.1w / 3.3.1

 len entropy   3.3.1  1.1.1w
  16     495     244     626
  64     543     249     591
 256     550     277     613
1024    2183     392     730

The behaviour of getentropy() on Mac OS is curious. The first time I run bentropy after a pause, getentropy() takes about 1µs. If I run it in quick succession, like bentropy && bentropy && bentropy, then getentropy() speeds up to about 0.5µs - which is what you can see in the table above. This speed-up also affects the OpenSSL timings.

OpenSSL 3.3 is substantially faster than 1.1. The OpenSSL timings are dominated by their startup latency, and not much affected by the buffer size for these relatively small lengths.

The time of one call to getentropy() is not much affected by the buffer size, but large buffers require multiple calls, so the time for 1024 bytes is about 4x the time for 256 bytes.

AMD Ryzen 7950 / Debian 11 bullseye

Linux 5.10 / OpenSSL 1.1.1w / BoringSSL

 len entropy openssl  boring
  16     476    1081     393
  64     659    1118     417
 256    1411    1218     548
1024    5466    1593     916

BoringSSL is from Debian’s android-libboringssl-dev package. I am mainly using it as a representative of more recent versions of OpenSSL to show that RAND_bytes() is a lot faster than it used to be.

It’s weird that getentropy()’s time varies with the buffer size so much. Dunno what’s up there!

Intel Xeon E3-1230 / FreeBSD 15 current

OpenSSL 3.0.14

 len entropy openssl
  16     707    1033
  64     697    1100
 256    1131    1080
1024    4416    1289

This mainly shows the performance numbers for OpenSSL 3.0 on FreeBSD are similar-ish to OpenSSL 1.1.1w on Linux.

conclusions

getentropy() and RAND_bytes() are pretty close in performance!
OpenSSL RAND_bytes() generally beats getentropy(), which is what I expected based on the general principles of how the two functions work.
The exception is older versions of OpenSSL are slower than getentropy() for small buffers.
It will be interesting to see how vDSO-based getentropy() compares. I would expect its per-call overhead to be much lower, such that it might beat OpenSSL in more cases. Will it win for larger buffers, I wonder?

Maybe I should upgrade my Debian box to a newer kernel so I can try it out!

exponential rate limiting

2024-09-03T01:08:52Z

Following my previous post on rate limiting with GCRA, leaky buckets without the buckets, I reviewed my old notes on rate limiting for Exim. I thought I should do a new write-up of the ideas that I hope will be more broadly interesting.

Exponential rate limiting uses an exponentially-weighted moving average to measure the client’s rate. It is motivated by a shift of perspective:

first measure the client’s rate,
then compare it to the limit.

Algorithms like GCRA and leaky bucket don’t allow you to separate these two points because they don’t measure the client’s rate as a concrete number.

A moving average allows more flexible policy enforcement because the rate measurement is meaningful even when you don’t apply back-pressure. For example, it’s useful in a dry run mode, or when diverting messages to a quarantine.

An exponential rate limiter stores, for each client:

last update time
average rate

This is a similar amount of space as leaky bucket. GCRA uses less space because it only needs to store a time.

The main disadvantage is that an exponential rate limiter needs fairly complicated floating point arithmetic.

configuration parameters
algorithm
behaviour
enforcement
rationale
discussion

configuration parameters

A rate limiter can have three configuration parameters:

a maximum rate
a burst size that allows a sender to briefly exceed the rate limit
an averaging period that determines how quickly past behaviour is forgotten

In a linear rate limiter like GCRA or leaky bucket, the period is fixed as burst / rate owing to the way the model works.

An exponential rate limiter has two parameters:

a limit which is also the maximum burst size
an averaging period

The maximum rate is limit / period.

For example, I might set limit = 600 requests per period = 1 hour. If I want to allow the same long-term average rate, but with a smaller burst size, I might set limit = 10 requests per period = 1 minute.

Deriving the max rate from the other two parameters makes the algorithm easy to configure, and it turns out to simplify the mathematical model very nicely.

algorithm

each client has a stored update time and rate
a request has a cost, which is typically 1 for fixed-cost requests, or (for example) the request size in bytes when limiting bandwidth

when a request arrives, get the client’s details

t_prev = client ? client.time : 0
r_prev = client ? client.rate : 0

calculate the interval since the previous request, relative to the averaging period
```
interval = (t_now - t_prev) / period
```
clamp the interval to avoid division by zero
```
interval = max(interval, 1.0e-10)
```
the exponential smoothing weight is explained below
```
alpha = exp(-interval)
```
the instantaneous rate, measured in cost per period
```
r_inst = cost / interval
```

the updated average rate is

r_now = (1 - alpha) * r_inst + alpha * r_prev

ensure rare requests are counted in full
```
r_now = max(r_now, cost)
```

behaviour

When a client starts making requests very fast, its average rate (r_prev and r_now) increases by close to the cost each time, so it will hit the limit after close to limit / cost requests.

When the client’s rate is more modest, or closer to its measured average, the average changes by a smaller amount for each request.

When the client slows down, its measured rate decays exponentially towards the new level.

enforcement

When a client exceeds its limit, how long must it wait before it can try again and its request will be allowed?

    t_next = t_now + period * ln(r_now / limit)

The decision to allow or deny a request is separate from calculating the client’s average rate. It will typically look like,

    if r_now > limit:
        return DENY(t_next)

    client.time = t_now
    client.rate = r_now
    return ALLOW

This implements a “leaky” policy that measures the rate of requests that are allowed, without increasing the client’s rate for requests that are denied. This is usually the right policy when DENY causes backpressure and clients are expected to retry denied requests.

You can implement a “strict” policy by updating the client’s stored rate for both denied and allowed requests. A “strict” policy is often appropriate when there is no backpressure. I used it when quarantining email messages from end-users whose accounts might have been compromised to send spam, or who might have been sending a quarterly newsletter.

rationale

The next few subsections explain how the algorithm works in more detail. You don’t need to read them to successfully use exponential rate limiting.

very slow clients

When a client returns after a long gap, the interval is very large, which means alpha is small, and r_inst is small. As a result r_now becomes very small.

This is unhelpful in practice: it effectively means the client’s first request is not counted. A more useful way to handle an isolated request is to say its rate is the cost of the request per the period. That way it gets treated like the first request of a fast burst.

The algorithm implements this logic by ensuring that the average rate is at least as large as the cost of the current request.

varying intervals

Where does the exponential smoothing weight exp(-interval) come from in the algorithm above?

We are using the usual formula for an exponentially weighted moving average,

    r_now = (1 - alpha) * r_inst + alpha * r_prev

Moving averages are commonly calculated over fixed-size intervals, so typically alpha is also fixed. The subexpression alpha * r_prev says how much to forget past behaviour. Each time a fixed-size interval passes, the old rate gets multiplied by alpha again: that is, the forgetfulness scales exponentially with time.

In our scenario, we want to update our measurement of the rate of requests each time a request occurs, at irregular and unpredictable intervals. So our alpha must vary exponentially with the interval. We derive it using the time since the previous request as a power of some base.

    t_delta = t_now - t_prev

    alpha = pow(base, t_delta)

We set the base using the configured averaging period. I previously said somewhat vaguely that the period determines how quickly past behaviour is forgotten. In an exponential rate limiter it is the time for 63% forgetfulness.

    pow(base, period) == exp(-1.0)

    exp(period * ln(base)) == exp(-1.0)

    ln(base) == -1.0 / period

    pow(base, t_delta) == exp(t_delta * ln(base))

Therefore,

    alpha = exp(-t_delta / period)

penalty time

When a client exceeds its limit, it must wait for some time doing nothing before its request will be allowed. The wait is derived as follows:

    limit == (1 - alpha) * 0 + alpha * r_now

    limit / r_now == exp(-wait)

    ln(limit / r_now) == -wait

    wait = ln(r_now / limit)

This wait is relative to the averaging period, so it gets multiplied by the period to calculate the next permitted time.

    t_next = t_now + period * wait

fast bursts

Basing the forgetfulness on e seems somewhat arbitrary: why not make the forgetfulness 50% (a half life) or 90% instead of 63%?

Another seemingly arbitrary choice is to measure rates in cost per period instead of per second.

It turns out that these choices fit together neatly so that fast requests are counted at their full cost, so a client will hit its limit when expected.

The updated rate is calculated as

    r_now = (1 - alpha) * r_inst + alpha * r_prev

When the interval is very small, alpha is very nearly 1.0, and as a result the calculation turns out to be counting approximately linearly towards the limit

    r_now = cost + r_prev

The second subexpression is obvious but the first one is surprising!

Let’s unpack it.

        (1 - alpha) * r_inst
     == (1 - exp(-interval)) * cost / interval

Factor out the cost; the surprise is that

   1 ≈≈ (1 - exp(-interval)) / interval

This property comes from the fact that the gradient of e^x is 1 when x is 0. To show why this is so, I need some basic calculus:

y ≡ f(x) ≡ e^x
δy ≡ f(x + δx) − f(x)
δx ≡ −interval

So, when x is zero and δx is small,

(1 − e^δx) / −δx
= (e^δx − 1) / δx
= (e^{0 + δx} − e⁰) / δx
= ( f(0 + δx) − f(0) ) / δx
= δy / δx
≈ dy / dx
= e^x
= 1

discussion

I originally developed this algorithm in 2005-2006 to throttle outgoing email spam floods. It turned out to be easy to use and produced results that made sense. (More so after I added the slow client tweak!)

I have gone into some detail to explain how the algorithm is derived and how it behaves. It was years before I understood some of the mathematics properly, because I accidentally landed on a sweet spot through some combination of luck and applied mathematical supersition – 1/e is more beautiful than 1/2 or 1/10, right?

I don’t know if I reinvented exponential rate limiting, or if there are other similar algorithms out there. When I was working on it I was not able to find much literature on exponentially weighted moving averages with varying intervals, so I laboriously worked it out for myself.

I would love to hear about any other uses of exponential rate limiting!

GCRA: leaky buckets without the buckets

2024-08-30T14:40:44Z

Yesterday I read an article describing the GCRA rate limiting algorithm. I thought it was really interesting, but I wasn’t entirely satisfied with Brandur’s explanation, and the Wikipedia articles on leaky buckets and GCRA are terrible, so here’s my version.

what is GCRA?

GCRA is the “generic cell rate algorithm”, a rate-limiting algorithm that came from ATM. GCRA does the same job as the better-known leaky bucket algorithm, but using half the storage and with much less code.

GCRA only stores one time stamp per sender, whereas leaky buckets need to store a time stamp and a quota.

GCRA needs less arithmetic, and makes it trivial to tell a client how long it should wait before the next request is permitted.

rate limit parameters
dodgy metaphors
leaky bucket algorithm
eliminate the bucket
GCRA algorithm
discussion
an old mystery solved

rate limit parameters

Both algorithms have the same configuration parameters:

a maximum rate
a burst size that allows a sender to briefly exceed the rate limit

Each operation has a cost, e.g. the size of a packet if this is a network rate limiter, in which case the cost is measured in bytes and the rate is bytes per second. In simple cases we might fix the cost at 1 unit of work per request, so the rate is measured in requests per second.

dodgy metaphors

The leaky bucket algorithm is named after a questionable analogy.

Imagine you have a bucket that leaks at some rate.

Each request is represented as a jug of water; the amount of water is the cost. A request is permitted if you can pour the jug into the bucket without overflowing.

In a steady state you can pour water into the bucket at the same rate that it leaks out.

The capacity of the bucket determines the burst size: how much water you can chuck in quickly after waiting for the bucket to drain.

On its own terms this metaphor doesn’t make much sense. If the bucket is leaking, why is it bad for it to overflow? But I suppose it’s fairly closely analogous to a write buffer attached to a fixed-rate transmission line.

A better metaphor might be a battery that recharges at a fixed rate; the battery’s capacity determines the burst size, and the cost of of an operation says how much it drains the battery.

leaky bucket algorithm

each client has a stored update time and a quota

when a request arrives, get the client’s details

time = client ? client.time : now
quota = client ? client.quota : burst

calculate the bucket leakage or battery recharge
```
quota += (now - time) * rate
```
limit to the permitted burst size
```
quota = min(quota, burst)
```
pay the cost
```
quota -= cost
```
reject if the client overspent their quota; tell them when it will recover
```
if quota < 0:
    return DENY(now - quota / rate)
```

update the client if the request is permitted

client.time = now
client.quota = quota
return ALLOW

eliminate the bucket

GCRA’s key trick is to change the units so that the client’s quota is measured in units of time. Calculating bucket leakage or battery recharge becomes trivial: it is just time passing.

Re-think the bucket capacity as a sliding window around the present time during which requests are allowed. The size of the window is burst / rate seconds.

A request nominally takes cost time, which is 1 / rate for fixed-cost requests.

GCRA stores the earliest permitted time of a request, which (with the change of units) is the present time minus the quota. Thus the two leaky bucket values get replaced with one.

A client is only allowed to make a request late (after the earliest permitted time) so the sliding window covers the recent past. We don’t let the window reach into the future to protect against clock resets.

When a client is bursting, its earliest permitted time increases faster than real time and will catch up with the present time.

When a client goes over quota, its earliest permitted time is in the future, so its requests get denied.

GCRA algorithm

when a request arrives, get the client’s earliest permitted time
```
time = client ? client.time : 0
```
ensure it is within the sliding window
```
time = clamp(now - window, time, now)
```
pay the cost
```
time += cost
```
tell the user when they can try again if they are too soon
```
if now < time:
    return DENY(time)
```
update the client’s time if the request is permitted
```
client.time = time
return ALLOW
```

discussion

Leaky bucket is a pretty simple algorithm, but GCRA is even simpler. It requires almost no calculation – a big advantage for something intended for implementation in hardware in the 1990s!

I have not tried to match my variable names to the standard GCRA terminology. To be honest, I find it too confusing and I would probably get it wrong. (Why do they have so many different Ts?!)

an old mystery solved

When I was working on rate limiting for Exim, I used an exponentially weighted moving average to calculate a client’s smoothed sending rate. That algorithm stores an update time and a measured rate for each client – the same amount of storage as traditional leaky bucket.

The nice thing about the moving average was that it allowed more flexible policy enforcement than leaky bucket, because the rate measurement was meaningful even when it didn’t apply back-pressure. For example, a dry run mode, or diverting messages to a quarantine.

But it was strange that moving averages packed more information into the same space as leaky buckets. Back then I didn’t really understand how the space was being wasted. Now I know!

C is Turing complete

2024-08-02T12:51:39Z

Yesterday there was some discussion on the Orange Site about whether or not C is Turing complete. The consensus in the StackOverflow question is,

no, because the C abstract machine is a (large) finite state machine,
or maybe yes, if you believe that unaddressable local variables can exist outside the finite address space, and you can have an unbounded number of them, i.e. no.

My answer is definitely yes, if you include the standard IO library.

And using IO is much closer to Turing’s original model of a finite state machine working on unbounded peripheral storage.

C is a finite state machine

The C abstract machine limits the size of the state it can work on by limiting the size of pointers and the size of the objects that can be pointed to. There are some unaddressable objects but they are usually understood to be a small finite number of machine registers.

So the number of states in the C abstract machine is,

    ptr_bits = CHAR_BIT * sizeof(char *);
    memory_bytes = 1 << ptr_bits;
    total_bytes = memory_bytes + register_bytes;
    total_bits = CHAR_BIT * total_bytes;
    number_of_states = 1 << total_bits;

Typically about 2^(2^(2^(2*3)+3)) states. Which is a lot, but still finite.

is IO really unbounded?

So, for C to be Turing complete, it must support unbounded IO.

Traditionally, C stdio supports two kinds of stream:

unbounded streams, such as terminals;
seekable streams, such as files.

What we need is an unbounded seekable stream. It has to be seekable because we need to be able to move in both directions, and the only way to move backwards is to seek.

But aren’t seeks bounded? Well, no, it turns out. The declaration of fseek() is,

    int fseek(FILE *stream, long offset, int whence);

What is notable is that in many real-world implementations, file sizes have been measured with 64 bit numbers, but the fseek long offset has been only 32 bits. This works because the offset is not absolute: it’s relative to either the start or end of the file or the current file position indicator.

Edited to add… What about ftell()? Doesn’t the size of its long return type impose a limit on the size of a file? Well, there’s no requirement that ftell() works when fseek() works. Consider 32 bit long and 64 bit file sizes again: fseek() can work on terabyte-sized files even though ftell() can only return an error.

Therefore the size of a seek offset does not limit the size of a stdio file.

We can read and write within some finite distance relative to the file position indicator, and (so long as we treat its value as unknowable and incomparable) after a series of reads and writes and seeks the file position indicator can become arbitrarily large.

conclusion

So, in theory (after all, this is a theoretical model of computation) we can have unbounded seekable IO streams in C, so we can implement a Turing machine in C.

That’s the point of this post, so you can stop reading. But I worked out the details, trying to stay as close as possible to the classic Turing machine formalism, so read on if you like code. (The trick is in the read and write part.)

the symbols

A Turing machine has an alphabet of symbols or marks on the tape.

All of the tape is blank, except for a known region around the head of the machine. The known region covers the initial input (which is finite) and any part of the tape that has been visited by the head.

In a state transition, the machine can write a new symbol, erase the symbol, or leave it unchanged. To make things more uniform, we always write a new symbol: we write the blank symbol to erase, and write the same symbol to leave it unchanged.

As an implementation detail, we add a special symbol to represent the infinite_blanks to the left and right of the known region. The state transitions for infinite_blanks are the same as normal blanks. The Turing machine cannot write infinite_blanks; it must write one normal blank at a time.

    enum mark {
        infinite_blanks,
        blank,
        // ... other marks ...
        mark_count
    };

the state transitions

A Turing machine has a set of states, including a particular initial start state, and a final halting stop state.

In mathematical descriptions, the machine’s state transition function is described using a collection of 5-tuples:

the current state
the symbol under the head
the symbol to replace it with
which direction to move or not
the next state

The write happens before the move.

We represent it as a 2D array indexed by the current state and mark, containing instructions about what to do. There are no transitions for the stop state.

    enum state {
        start,
        // ... other states ...
        stop,
        state_count = stop
    };

    static struct transition {
        enum { move_left, move_not, move_right } move;
        enum mark mark;
        enum state state;
    } machine[state_count][mark_count] = {
        // ...
        // define your Turing machine here
        // ...
    };

the tape

We represent the tape using two files, one containing all the tape to the left of the head, and one containing all the tape to the right of the head. The cell of the tape under the head is represented separately, in memory.

The files contain a sequence of cells represented in binary as enum mark objects accessed using fread() and fwrite(). Each file is divided into three parts:

The first cell contains infinite_blanks. (No other cells contain infinite_blanks.) It is a sentinel that allows us to avoid trying to seek before the start of the file.
The cells between the first cell and the file position indicator are (the left or right parts of) the known region of the tape. The file position indicator is manipulated using fseek().
There can be garbage cells after the file position indicator. C does not give us a way to truncate the file to erase unused cells, so we just ignore them.

The files are in opposite orders: the left file reads left to right, and the right file reads right to left, from the beginning of each file to its file position indicator.

the machine

To start, we need to be given the initial contents of the tape.

There are three command line arguments: the left file, the cell under the head, and the right file.

After opening the files we need to move the file position indicator to the end of each file.

    int main(int argc, char *argv[])
    {
        FILE *left = fopen(argv[1], "wb+");
        fseek(left, 0, SEEK_END);

        enum mark mark = atoi(argv[2]);

        FILE *right = fopen(argv[3], "wb+");
        fseek(right, 0, SEEK_END);

The operation of the machine itself is sraightforward.

The read_write() function pushes a new cell onto the second file, representing the overwritten contents of the cell that used to be under the head; and it pops a cell from the first file that becomes the cell under the head, and returns its contents.

        enum state state = start;
        while(state != stop) {
            struct transition transition = machine[state][mark];
            if(transition.move == move_not)
                mark = transition.mark;
            if(transition.move == move_left)
                mark = read_write(left, right, transition.mark);
            if(transition.move == move_right)
                mark = read_write(right, left, transition.mark);
            state = transition.state;
        }

When the machine stops, we need to remove the garbage from the left and right files, and print the contents of the cell under the head, so that the caller can see the final contents of the tape.

        cleanup(left, argv[1]);
        cleanup(right, argv[3]);
        printf("%d\n", mark);
        return(0);
    }

read and write

This is the tricky part, where we make the files work as unbounded one-ended tapes.

Pushing a cell onto a file is easy: it’s just fwrite(), which extends the file as necessary (or overwrites garbage). This is where unbounded storage is created.

Popping a cell is more faff, because we have to read backwards. We have to move the file position indicator before the cell, read the cell – which moves the file position indicator after the cell, so we have to move the file position indicator before it again.

When we reach the start of the file, we skip the second seek, so that the infinite_blanks always remain outside the known region of the tape, before the file position indicator, and do not erroneously become garbage.

    #define MARKSIZE (long)sizeof(enum mark)

    static enum mark
    read_write(FILE *read, FILE *write, enum mark mark)
    {
        fwrite(&mark, sizeof(mark), 1, write);
        fseek(read, -MARKSIZE, SEEK_CUR);
        fread(&mark, sizeof(mark), 1, read);
        if(mark != infinite_blanks)
            fseek(read, -MARKSIZE, SEEK_CUR);
        return(mark);
    }

cleanup

I mentioned before that C does not give us a way to truncate a file to a particular size. In fact we need a way to truncate an unbounded file to an arbitrarily large size, so POSIX truncate() is no good either because off_t is finite and our file position indicator is unknowable.

To clean up the garbage, we copy the known region of the tape to a replacement file, in an incremental streaming manner. We read backwards towards the start of the file and write forwards extending the replacement file, which reverses the contents of the tape. So we have to copy each file twice.

    static void copy_and_reverse(FILE *read, FILE *write) {
        enum mark mark = infinite_blanks;
        do mark = read_write(read, write, mark);
        while(mark != infinite_blanks);
    }

    static void cleanup(FILE *fp, const char *name) {
        FILE *tmp = tmpfile();
        copy_and_reverse(fp, tmp);
        fclose(fp);
        fp = fopen(name, "wb"); // make it empty
        copy_and_reverse(tmp, fp);
        fclose(fp);
        fclose(tmp);
    }

done!

I have omitted any error checking from this code, since it’s a theoretical model, and in theory it cannot possibly go wrong.

tolower() small string performance

2024-07-31T12:27:09Z

I’m pleased that so many people enjoyed my previous blog post on tolower() with AVX-512. Thanks for all the great comments and discussion!

One aspect that needed more work was examining the performance for small strings. The previous blog post had a graph for strings up to about 1000 bytes long, mainly because it illustrated some curious behaviour – but that distracted me from what I really care about, which is strings up to about 100 bytes long, 10x smaller.

Eventually I managed to get some numbers that I think are plausible.

This article is part of a series on optimizing DNS names:

question

When processing very small string fragments, what is the cross-over point between scalar code and AVX-512 with masked loads and stores?

method

I am testing the same functions as described in my previous blog post. The copybytes functions are different implementations of memcpy; the tolower functions convert a string to lower-case while copying in a similar manner to memcpy. Most of the functions are autovectorized by Clang, apart from tolower64 and copybytes64 which use AVX-512 instrinsic functions with masked loads and stores. (The copybytes1 function never gets out of its scalar regime on the graph below.)

Each function is called 1000 times with varying alignments of the source and destination strings. The graph shows the total time divided by 1000, measured using Linux perf_event_open(2). There are two aeparate 256 byte buffers, one for the source strings and one for the destination strings, so the code is working entirely from L1 cache.

The functions are compiled separately, so that the compiler can’t over-optimize the benchmark loop. There is an MFENCE instruction before each call to the function under test. The baseline of 65 cycles includes the MFENCE, the benchmark loop, and the function call overhead.

You can find the source code in the bench-short.c and bench.c files.

results

The lines of interest are:

The purple tolower1 line, which is very simple C code autovectorized by Clang.
The pink tolower64 line, which is written using AVX-512 instrinsic functions, and uses masked loads and stores.

The crossover point for very small strings is 5 bytes: scalar code is faster for strings less than 5 bytes, and AVX-512 is faster for strings longer than 5 bytes.
Clang used an AVX-256 section in tolower1 for strings between 32 and 256 bytes long. The tolower1 line drops below the tolower64 line for lengths around a multiple of 32, which suggests AVX-256 is faster than AVX-512 for small numbers of chunks.

So it looks like masked loads and stores do pretty well for short strings, but they aren’t completely dominant. There’s still some performance to be gained from code tuned to a few more size classes.

perf_event_open

The graph in my previous blog post was made using timing measurements from clock_gettime(). I found that I was getting implausibly tiny numbers for short strings – not impossibly tiny, but there was clearly something hinky going on. So I thought I should try reading the CPU performance counters, which on Linux means perf_event_open(2)

There is a helpful example at the end of the man page which I used as a starting point. I also used Daniel Lemire’s C++ benchmarking harness and the linux/perf_event.h header file for reference.

The perf_event_open(2) API is fairly painful. You have to write your own syscall wrapper because it’s missing from glibc. It requires you to open a file descriptor for each event you are measuring, such as instructions or clock cycles.

You can create a group of events that are measured together, in which case you are mostly working with the group leader file descriptors, and the others are dead weight. (It would make more sense to me to add events to a group using ioctl()s on the group fd, instead of several open()s, but, shrug.)

Depending on how you configure your group, you get different data layouts when read()ing from the event fd.

You can reset a group’s measurement counters, but you can’t reset the total running time counter. I gave up and implemented my own reset logic.

Anyway, after too much faff I had some basic Linux perf_event_open() benchmarking functions

MFENCE

I was still getting implausibly tiny numbers.

For small strings, copies were taking about 2 nanoseconds, or about 10 clock cycles, or about 25 instructions. The instruction count was plausible, but it seemed to be running without touching memory at all, not even L1 cache.

I guess the inner loop was so tiny that the CPU could run multiple iterations concurrently and completely hide any latency? If so, it’s impressive, but it made a nonsense of my measurements, and I wasn’t sure how to solve the problem.

After sleeping on it, I cribbed another idea from Daniel Lemire’s benchmarking harness which includes memory fences around the function under test.

I added an _mm_mfence() call to my benchmarking loop and the results made a lot more sense! Most obviously, the tolower64 line was clearly slower than copybytes64.

conclusion

Benchmarking is hard.

I would love to hear more suggestions on how to get better measurements of these functions in isolation.

The graph above is definitely not a good indicator of the performance of these functions in real code, where they are likely to be inlined and optimized together with the surrounding code, and where the CPU has other things to work on.

Thanks to Daniel Lemire for putting lots of great vectorized code and benchmarking examples on his blog.

tolower() with AVX-512

2024-07-28T20:32:12Z

A couple of years ago I wrote about tolower() in bulk at speed using SWAR tricks. A couple of days ago I was interested by Olivier Giniaux’s article about unsafe read beyond of death, an optimization for handling small strings with SIMD instructions, for a fast hash function written in Rust.

I’ve long been annoyed that SIMD instructions can easily eat short strings whole, but it’s irritatingly difficult to transfer short strings between memory and vector registers. Oliver’s post caught my eye because it seemed like a fun way to avoid the problem, at least for loads. (Stores remain awkward!)

Actually, to be frank, Olivier nerdsniped me.

This article is part of a series on optimizing DNS names:

signs of hope

Reading more around the topic, I learned that some SIMD instruction sets do, in fact, have useful masked loads and stores that are suitable for string processing, that is, they have byte granularity. They are:

ARM SVE, which is available on recent big-ARM Neoverse cores, such as Amazon Graviton, but not Apple Silicon.
(edited to add) RISC-V Vector extension, which is similar in style to ARM SVE, and available on several small single-board computers.
AVX-512-BW, the bytes and words extension, which is available on recent AMD Zen processors. AVX-512 is a complicated mess of extensions that might or might not be available; support on Intel is particularly random.

I have an AMD Zen 4 box, so I thought I would try a little AVX-512-BW.

tolower64()

Using the Intel intrinsics guide I wrote a basic tolower() function that can munch 64 bytes at once.

Top tip: You can use * as a wildcard in the search box, so I made heavy use of mm512*epi8 to find byte-wise AVX-512 functions (epi8 is an obscure alias for byte).

First, we fill a few registers with 64 copies of some handy bytes.

We need the letters A and Z:

    __m512i A = _mm512_set1_epi8('A');
    __m512i Z = _mm512_set1_epi8('Z');

We need a number to add to uppercase letters to make them lowercase:

    __m512i to_lower = _mm512_set1_epi8('a' - 'A');

We compare our input characters c with A and Z. The result of each comparison is a 64 bit mask which has bits set for the bytes where the comparison is true:

    __mmask64 ge_A = _mm512_cmpge_epi8_mask(c, A);
    __mmask64 le_Z = _mm512_cmple_epi8_mask(c, Z);

If it’s greater than or equal to A, and less than or equal to Z, then it is upper case. (AVX mask registers have names beginning with k.)

    __mmask64 is_upper = _kand_mask64(ge_A, le_Z);

Finally, we do a masked add. We pass c twice: bytes from the first c are copied to the result when is_upper is false, and when is_upper is true the result is c + to_lower.

    return _mm512_mask_add_epi8(c, is_upper, c, to_lower);

bulk load and store

The tolower64() kernel in the previous section needs to be wrapped up in more convenient functions such as copying a string while converting it to lower case.

For long strings, the bulk of the work uses unaligned vector load and store instructions:

	__m512i src_vec = _mm512_loadu_epi8(src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_storeu_epi8(dst_ptr, dst_vec);

masked load and store

Small strings and the stub end of long strings use masked unaligned loads and stores.

This is the magic! Here is the reason I wrote this blog post!

The mask has its lowest len bits set (its first len bits in little-endian order). I wrote these two lines with perhaps more ceremony than required, but I thought it was helpful to indicate that the mask is not any old 64 bit integer: it has to be loaded into one of the SIMD unit’s mask registers.

	uint64_t len_bits = (~0ULL) >> (64 - len);
	__mmask64 len_mask =  _cvtu64_mask64(len_bits);

The load and store look fairly similar to the full-width versions, but with the mask stuff added. The z in maskz means zero the destination register when the mask is clear, as opposed to copying from another register (like in mask_add above).

	__m512i src_vec = _mm512_maskz_loadu_epi8(len_mask, src_ptr);
	__m512i dst_vec = tolower64(src_vec);
	_mm512_mask_storeu_epi8(dst_ptr, len_mask, dst_vec);

That’s the essence of it: you can see the complete version of copytolower64() in my git repository.

benchmarking

To see how well it works, I benchmarked several similar functions. Here’s a chart of the results, compiled with Clang 16 on Debian 11, and run on an AMD Ryzen 9 7950X.

The benchmark measures the time to copy about 1 MiByte, in chunks of various lengths from 1 byte to 1 kilobyte. I wanted to take into account differences in alignment in the source and destination strings, so there are a few bytes between each source and destination string, which are not counted as part of the megabyte.

On this CPU the L2 cache is 1 MiB per core, so I expect each run of the test spills into the L3 cache.

To be sure I was measuring what I thought I was, I compiled each function separately to avoid interference from inlining and code motion. In real code it’s more likely that you would want to encourage inlining, not prevent it!

The pink tolower64 line is the code described in this blog post. It is consistently near the fastest of all the functions under test. (It drops a little at 65 bytes long, where it spills into a second vector.)

The interesting feature of the line for my purposes is that it rises fast and lacks deep troughs, showing that the masked loads and stores were effective at handling small string fragments quickly.
The green copybytes64 line is a version of memcpy using AVX-512 in a similar manner to tolower64. It is (maybe surprisingly) not much faster. I had to compile copybytes64 with Clang 11 because more recent versions are able to recognise what the function does and rewrite it completely.
The orange copybytes1 line is a byte-by-byte version of memcpy again compiled using Clang 11. It illustrates that Clang 11 had relatively poor autovectorizer heuristics and was pretty bad for string fragments less than 256 bytes long.
The very slow red tolower line calls the standard tolower() from to provide a baseline.
The purple tolower1 line is a simple byte-by-byte version of tolower() compiled with Clang 16. It shows that Clang 16 has a much better autovectorizer than Clang 11, but it is slower and much more complicated than my hand-written version. It is very spiky because the autovectorizer did not handle short string fragments as well as tolower64 does.
The brown tolower8 line is the SWAR tolower() from my previous blog post. Clang valiantly tries to autovectorize it, but the result is not great because the function is too complicated. (It has the Clang-11-style 256-byte performance cliffs despite being compiled with Clang 16.)
The blue memcpy line calls glibc’s memcpy. There’s something curious going on here: it starts off fast but drops off to about half the speed of copybytes64. Dunno why!

conclusion

So, AVX-512-BW is very nice indeed for working with strings, especially short strings. On Zen 4 it’s very fast, and the intrinsic functions are reasonably easy to use.

The most notable thing is AVX-512-BW’s smooth performance: there’s very little sign of the performance troughs that the autovectorizer suffers from as it shifts to scalar code for small string fragments.

I don’t have convenient access to an ARM box with SVE support or RISC-V with the vector extension, so I have not investigated them in detail. It’ll be interesting to see how well they work for short strings.

I would like both of these instruction set extensions to be much more widely available. They should improve the performance of string handling tremendously.

The code for this blog post is available from my web site.

Thanks to Bruce Hoult on Lobsters for providing a version of tolower() using RISC-V vector instructions, and measuring it on multiple single-board computers.

Thanks to LelouBil on Hacker News for pointing out a variable was named backwards. Ooops!

semaphores in Golang and GNU make

2024-07-18T16:52:37Z

Semaphores are one of the oldest concurrency primitives in computing, invented over 60 years ago. They are weird: usually the only numbers of concurrent processes we care about are zero, one, or many – but semaphores deal with those fussy finite numbers in between.

Yesterday I was writing code that needed to control the number of concurrent operating system processes that it spawned so that it didn’t overwhelm the computer. One of those rare situations when a semaphore is just the thing!

a Golang channel is a semaphore

A Golang channel has a buffer size – a number of free slots – which corresponds to the initial value of the semaphore. We don’t care about the values carried by the channel: any type will do.

    var semaphore := make(chan any, MAXPROCS)

The acquire operation uses up a slot in the channel. It is traditionally called P(), and described as decrementing the value of the semaphore, i.e. decrementing the number of free slots in the channel. When the channel is full this will block, waiting for another goroutine to release the semaphore.

    func acquire() {
        semaphore <- nil
    }

The release operation, traditionally called V(), frees a slot in the channel, incrementing the value of the semaphore.

    func release() {
        <-semaphore
    }

That’s it!

the GNU make jobserver protocol is a semaphore

The GNU make -j parallel builds feature uses a semaphore in the form of its jobserver protocol. Occasionally, other programs support the jobserver protocol too, such as Cargo. BSD make -j uses basically the same semaphore implementation, but is not compatible with the GNU make jobserver protocol.

The make jobserver semaphore works in a similar manner to a Golang channel semaphore, but:

instead of a channel, make uses a unix pipe;
because make can’t control the amount of free space in a pipe’s buffer, the value of the semaphore is represented as the amount of used space in the pipe’s buffer;
the value of the semaphore can’t be more than PIPE_BUF, to ensure that release() will never block.

Here’s a C-flavoured sketch of how it works. To create a semaphore and initialize its value, create a pipe and write that many characters to it, which are buffered in the kernel:

    int fd[2];
    pipe(fd);

    char slots[MAXPROCS] = {0};
    write(fd[1], slots, sizeof(slots));

To acquire a slot, read a character from the pipe. When the pipe is empty this will block, waiting for another process to release the semaphore.

    char slot;
    read(fd[0], &slot, 1);

To release a slot, the worker must write the same character back to the pipe:

    write(fd[1], &slot, 1);

Error handling is left as an exercise for the reader.

bonus: waiting for concurrent tasks to complete

If we need to wait for everything to finish, we don’t need any extra machinery. We don’t even need to know how many tasks are still running! It’s enough to acquire all possible slots, which will block until the tasks have finished, then release all the slots again.

    func wait() {
        for range MAXPROCS {
            acquire()
        }
        for range MAXPROCS {
            release()
        }
    }

That’s all for today! Happy hacking :-)

inlined nearly divisionless random numbers

2024-06-25T17:26:13Z

a blog post for international RNG day

Lemire’s nearly-divisionless algorithm unbiased bounded random numbers has a fast path and a slow path. In the fast path it gets a random number, does a multiplication, and a comparison. In the rarely-taken slow path, it calculates a remainder (the division) and enters a rejection sampling loop.

When Lemire’s algorithm is coupled to a small random number generator such as PCG, the fast path is just a handful of instructions. When performance matters, it makes sense to inline it. It makes less sense to inline the slow path, because that just makes it harder for the compiler to work on the hot code.

Lemire’s algorithm is great when the limit is not a constant (such as during a Fisher-Yates shuffle) or when the limit is not a power of two. But when the limit is a constant power of two, it ought to be possible to eliminate the slow path entirely.

What are the options?

basic fast / slow split
problem
rely on the programmer?
check for a power of two?
move the split?
compile time hacks?

basic fast / slow split

The following function is a typical implementation of Lemire’s algorithm. (See my past blog post for an explanation.)

uint32_t
pcg32_rand(pcg32_t *rng, uint32_t limit) {
    uint64_t sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
    if ((uint32_t)(sample) < limit) {
        uint32_t reject = -limit % limit;
        while ((uint32_t)(sample) < reject)
            sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
    }
    return ((uint32_t)(sample >> 32));
}

To separate the fast and slow paths, we can put everything up to the slow path test into an inline function in a header:

static inline uint32_t
pcg32_rand_fast(pcg32_t *rng, uint32_t limit) {
    uint64_t sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
    if ((uint32_t)(sample) < limit) {
        return (pcg32_rand_slow(rng, limit, sample));
    return ((uint32_t)(sample >> 32));
}

And the division (or rather, the remainder) and the rejection sampling loop are shoved off into a separate compilation unit:

uint32_t
pcg32_rand_slow(pcg32_t *rng, uint32_t limit, uint64_t sample) {
    uint32_t reject = -limit % limit;
    while ((uint32_t)(sample) < reject)
        sample = (uint64_t)pcg32_random(rng) * (uint64_t)limit;
    return ((uint32_t)(sample >> 32));
}

problem

The basic split minimizes the amount of duplicated slow-path code. The problem is that the compiler cannot eliminate the slow path when pcg32_rand_fast() is called with a constant power of two.

rely on the programmer?

Maybe we can just add a note to the documentation saying, don’t call pcg32_rand_fast(rng, N) when N is a power of two, use pcg32_random(rng) & (N - 1) instead. But what are computers for if not taking care of tedious details for us?

check for a power of two?

The expression N & (N - 1) is zero if N is a power of two (or zero) and nonzero otherwise.

Maybe we can use this bitwise hack as a special case test in the fast path, like:

    if ((limit & (limit - 1)) != 0 && (uint32_t)(sample) < limit)
        return (pcg32_rand_slow(rng, limit, sample));

This is fine for constant limits, when the extra test can be optimized away. However, when the limit is not constant it increases the code size slightly, and adds a difficult-to-predict branch to the hot path. Not so great.

move the split?

Another option is to change where the split occurs, moving the division into the inlined code, like this:

    if ((uint32_t)(sample) < limit) {
        uint32_t reject = -limit % limit;
        if ((uint32_t)(sample) < reject)
            return (pcg32_rand_slow(rng, limit, reject));
    }

And change pcg32_rand_slow() so it gets the reject threshold from an argument instead of recalculating it.

This gives the compiler enough information to eliminate the slow path completely when the limit is a constant power of two.

In other cases it increases code size roughly the same amount as the explicit power-of-two check, but the extra code is hidden behind an easy-to-predict branch, so it’s cheaper overall.

compile time hacks?

The final option is to directly check if the limit is a compile-time constant power of two.

There is a gcc/clang extension __builtin_constant_p() that checks for a compile-time constant. More fun and more standard is the C11 version of Martin Uecker’s clever hack that relies on arcane details of how pointer types work with the ?: operator:

#define is_constexpr(x) \
        _Generic(0 ? (long *)(1) : (void *)(0 * (long)(x)), \
                 long *: 1, void *: 0)
// ...
    if (!(is_constexpr(limit) && (limit & (limit - 1)) == 0)
        && (uint32_t)(sample) < limit)
            return (pcg32_rand_slow(rng, limit, sample));

This is fast for constant powers of two, and minimizes the amount of duplicated slow path code in other cases. Best of both worlds!

But be warned: if you stare long into the void, the void stares also into you.

My implementations of pcg32 and pcg64-dxsm include this split version of Lemire’s algorithm, plus random floating point numbers

nsnotifyd-2.1 released

2024-06-12T16:26:36Z

I have made a new release of nsnotifyd, a tiny DNS server that just listens for NOTIFY messages and runs a script when one of your zones changes.

This nsnotifyd-2.1 release includes a few bugfixes:

more lenient handling of trailing . in domain names on the command line
avoid dividing by zero when the refresh interval is less than 10 seconds
do not error out when in TCP mode and debug mode and the refresh timer expires
explain how to fix incomplete github forks

Many thanks to Lars-Johann Liman, JP Mens, and Jonathan Hewlett for the bug reports. I like receiving messages that say things like,

thanks for nsnotifyd, is a great little program, and a good example of a linux program, does one thing well.

(There’s more like that in the nsnotifyd-2.0 release annoucement.)

I have also included a little dumpaxfr program, which I wrote when fiddling around with binary wire format DNS zone transfers. I used the nsnotifyd infrastructure as a short cut, though dumpaxfr doesn’t logically belong here. But it’s part of the family, so I wrote a dumpaxfr(1) man page and included it in this release.

I will be surprised if anyone else finds dumpaxfr useful!

regpg-1.12

2024-05-22T13:18:49Z

Yesterday I received a bug report for regpg, my program that safely stores server secrets encrypted with gpg so they can be commited to a git repository.

The bug was that I used the classic shell pipeline find | xargs grep with the classic Unix “who would want spaces in filenames?!” flaw.

I have pushed a new release, regpg-1.12, containing the bug fix.

There’s also a gendnskey subcommand which I used when doing my algorithm rollovers a few years ago. (It’s been a long time since the last regpg release!) It’s somewhat obsolete, now I know how to use dnssec-policy.

A bunch of minor compatibility issues have crept in, which mostly required fixing the tests to deal with changes in Ansible, OpenSSL, and GnuPG.

My most distressing discovery was that Mac OS crypt(3) still supports only DES. Good grief.

Blue paint in the C preprocessor

2024-05-21T09:12:07Z

In the C preprocessor, after a macro has been expanded the result is rescanned for further macros. To prevent recursion, [the C standard][n3220] says the following in section 6.10.5.4p2. (This text has been basically the same since C89.)

If the name of the macro being replaced is found during this scan of the replacement list (not including the rest of the source file’s preprocessing tokens), it is not replaced. Furthermore, if any nested replacements encounter the name of the macro being replaced, it is not replaced. These nonreplaced macro name preprocessing tokens are no longer available for further replacement even if they are later (re)examined in contexts in which that macro name preprocessing token would otherwise have been replaced.

Informally we say that when a macro name is unavailable for expansion, it is “painted blue”.

In [Dave Prosser’s C preprocessor algorithm][prosser] something more complicated happens. He attaches a “hide set” to each token, written T^HS.

Unix version control lore: what, ident

2024-05-13T16:08:16Z

There are a couple of version control commands that deserve wider appreciation: SCCS what and RCS ident. They allow you to find out what source a binary was built from, without having to run it – handy if it is a library! They basically scan a file looking for magic strings that contain version control metadata and print out what they discover.

keyword expansion

SCCS, RCS, cvs, and svn all have a way to expand keywords in a file when it is checked out of version control.

The POSIX SCCS get documentation describes its runes under the “identification keywords” heading. The relevant one is %W% which inserts the magic marker @(#) used by what.

RCS / cvs / svn keyword substitution uses more descriptive markers like $Revision$ .

a berkeley example

It was a lonstanding BSD practice to use keyword expansion everywhere. I first encountered it when I got involved in the Apache httpd project in the late 1990s – Apache’s CVS repository was hosted on a FreeBSD box and used a version of FreeBSD’s CVS administrative scripts.

Here’s an example from res_send.c in FreeBSD’s libc resolver.

static const char sccsid[] = "@(#)res_send.c	8.1 (Berkeley) 6/4/93";
static const char rcsid[] = "$Id: res_send.c,v 1.22 2009/01/22 23:49:23 tbox Exp $";
__FBSDID("$FreeBSD$");

There are geological strata of version control ident strings here:

the sccsid from the Berkeley CSRG SCCS repository
the rcsid from ISC’s BIND repository (tbox was ISC’s tinderbox build / CI system)
the FBSDID from FreeBSD’s cvs and later svn repositories (which has not been expanded)

an unifdef example

When unifdef was uplifted to git, I wanted to keep its embedded version control keywords – I have a sentimental liking for this old tradition. If you’ve installed cssc, rcs, and unifdef on a Debian box, you can run,

    :; sccs what /usr/bin/unifdef
    :; ident /usr/bin/unifdef

Both of those will produce similar output to

    :; unifdef -V

On a Mac with the developer command-line tools installed,

    :; what /Library/Developer/CommandLineTools/usr/bin/unifdef

You get the output twice because it’s a fat binary!

versioning three ways

In unifdef.c, the embedded version string looks like,

    static const char copyright[] =
        "@(#) $Version: unifdef-2.12 $\n"
        "@(#) $Date: 2020-02-14 16:49:56 +0000 $\n"
        "@(#) $Author: Tony Finch (dot@dotat.at) $\n"
        "@(#) $URL: http://dotat.at/prog/unifdef $\n"
    ;

Each line is prefixed with an SCCS magic marker @(#) so that what can find it, and wrapped in an RCS-style $Keyword$ so that ident can find it. There’s a fairly trivial version() function that spits out the copyright[] string when you run unifdef -V.

embedding versions from git

My projects have various horrible build scripts for embedding the version number from git. The basic idea is,

use an annotated or signed tag to mark a release, i.e. git tag -a or git tag -s
use git describe to get a version string that includes an extra number counting commits since the last release
maybe use git show --pretty=format:%ai -s HEAD to get a release date
stuff the outputs from git into the $Version$ and $Date$ RCS keywords

retro cool

I enjoy keeping this old feature working, even though it isn’t very useful if no-one knows about it! Maybe if I blog about it, it’ll become more widespread?

BIND9 dnssec-policy appendices

2024-05-10T22:06:15Z

Here are some miscellaneous unsorted notes about BIND9’s dnssec-policy that turned out not to be useful in my previous blog posts, but which some readers might find informative. Some of them I learned the hard way, so I hope I can make it easier for others!

contents of key files
changes to key files
fast timers?
parental agents

This is the third article in a three part series:

Introducing BIND9 dnssec-policy
Migrating to BIND9 dnssec-policy
BIND9 dnssec-policy appendices (this post)

contents of key files

The key files contain a bunch of stuff:

the key material itself:
- the secret part(s) in the .private file
- a DNSKEY record in zone file syntax in the .key file
whether it is a ZSK or KSK or CSK
- with auto-dnssec this was indicated by the DNSKEY flags field: 256 for a ZSK or 257 for a KSK
- the flags field is also explained by a comment in the .key file
- if there was only one key then it was implicitly a CSK
- with dnssec-policy the key type is more explicitly described in the .state file
its timing metadata
- when the keys were created, published, activated
- when they will be retired (or not)
- this information is repeated in all three files
- (except that the DSPublish time is missing from the .key file)
its dnssec-policy states
- for the key, its signatures, etc.
- “hidden” / “rumoured” / “omnipresent” / “unretentive”
- the times these last changed
- only in the .state file

Not all of this stuff can be manipulated with dnssec-settime.

There’s no option to set the dnssec-policy key type in the .state file.
The man page has firm imprecations that the options for setting dnssec-policy states are for developer testing only.

So I spent a fair amount of effort working out how to get the right dnssec-policy contents in the key files without using dnssec-settime. I got there eventually…

changes to key files

During testing, when I failed to prepare the key files properly, dnssec-policy would fill in the missing times for me. If the .key and/or .private files changed when I enabled it, that was a clue I had made a mistake.

The “Change” times in the .state file depend on when dnssec-policy is enabled, so the .state files from my test server were not the same as the ones on my live primary DNS server. Not a problem, but possibly useful to know in advance.

fast timers?

When I was repeatedly breaking my test server, I reduced the timing settings to their minium:

    # go fast
    purge-keys 0;
    parent-ds-ttl 1s;
    publish-safety 1s;
    retire-safety 1s;
    parent-propagation-delay 1s;
    zone-propagation-delay 1s;

With these settings, the DS-related state transitions took only a few seconds, instead of 26 hours.

At one point I thought this would be necessary to get the key files into the right state in a reasonable amount of time, until I worked out that my keys needed more preparation before switching on dnssec-policy.

parental agents

I have also added the following line to all my zone blocks. (It cannot be configured globally like many other zone settings.)

    parental-agents { ::1; };

A “parental agent” is a resolver that dnssec-policy can use to see when the zone’s DS records change during a rollover. When parental-agents are not configured, you need to tell dnssec-policy about changes to DS records using rndc dnssec -checkds.

I have left recursion enabled on my primary DNS server, so it will answer recursive queries from localhost. This means I can tell named to use itself (i.e. ::1) as its own parental agent.

Since I’m not planning any rollovers, this setting isn’t necessary, but I thought it might be a good idea to let dnssec-policy observe the state of the real world.

Migrating to BIND9 dnssec-policy

2024-05-11T18:36:16Z

Here are some notes on migrating a signed zone from BIND’s old auto-dnssec to its new dnssec-policy.

I have been procrastinating this migration for years, and I avoided learning anything much about dnssec-policy until this month. I’m writing this from the perspective of a DNS operator rather than a BIND hacker.

migrating from auto-dnssec
why I need a custom policy
matching an existing policy
more policy details
preparing the key files
the big config change
DONE

This is the second article in a three part series:

Introducing BIND9 dnssec-policy
Migrating to BIND9 dnssec-policy (this post)
BIND9 dnssec-policy appendices

migrating from auto-dnssec

My aim is to move my zones from old-style auto-dnssec to new-style dnssec-policy with minimal disruption.

Specifically, I want to continue treating my DNSSEC keys as static configuration. I will port my existing keys over to dnssec-policy without any rollovers, and give them an unlimited lifetime so that named does not try to replace them.

One change at a time! Maybe later on I will implement a more dynamic dnssec-policy.

risks to avoid

My fear with moving to dnssec-policy is that it can trigger an accidental key rollover or even an algorithm rollover. There are two possible causes:

The configured dnssec-policy does not match the existing keys.
The dnssec-policy machinery inside named misunderstands the state of the existing keys.

I’ll explain how to deal with them in turn, after a few preliminaries.

things to know

My previous blog post introducing dnssec-policy covers some basics, including:

the rndc dnssec command
debug logging
dnssec-policy state names

prior preparations

My zones are (mostly) using algorithm 13 (ECDSA P256 SHA256) since I did an algorithm rollover a few years ago. If you are following along at home, and you are still using RSA keys, you can upgrade them using my DNSSEC algorithm rollover HOWTO before upgrading to dnssec-policy. I’m not going to investigate algorithm rollovers with dnssec-policy right now.

which version

I upgraded my primary DNS server to latest Debian Stable (bookworm, 12.5) before this process, so I’m using BIND 9.18.24. Although I have not tried it, I think the procedure described below should work with BIND 9.16 as well – but 9.18 has the advangate of being an LTS release with more bug fixes.

test jig

I adapted my Ansible setup to make an isolated copy of my primary DNS server on my dev box. I can easily wipe the copy and rebuild it from scratch. I used it to experiment with dnssec-policy and work on the Ansible changes and migration plan in safety.

These notes are based on what I learned from repeatedly breaking and fixing this scratch server.

why I need a custom policy

As we saw with my previous experiments with a scratch zone, BIND’s default dnssec-policy wants a single CSK combined signing key per zone, using algorithm 13 (ECDSA P256 SHA256), with an unlimited lifetime.

It is a sensible default for new setups, however it almost certainly does not match the (implied) policy for a zone using auto-dnssec. BIND’s older tooling preferred zones to be set up with two keys, a ZSK zone signing key and a KSK key signing key.

Although most of my zones match the default algorithm, they don’t match the default CSK keying style, so I still need a custom policy. Oh, and I have other settings that need to be updated to the new style as well.

matching an existing policy

When I examine my key directory, I see two algorithm 13 keys for each zone, as follows. (I’ve abbreviated my shell prompt to :;)

    :; ls -1 Kdotat.at*.key
    Kdotat.at.+013+30212.key
    Kdotat.at.+013+53798.key

So my matching policy looks like:

    dnssec-policy fanf {
        keys {
            ksk lifetime unlimited algorithm 13;
            zsk lifetime unlimited algorithm 13;
        };
        # ... more here ...
    };

it has two keys, one ZSK and one KSK
algorithm 13 matches the existing keys
lifetime unlimited means no rollovers

By itself the dnssec-policy block does not alter the running of any zones, so I can add it to named.conf right away.

more policy details

The following extra settings fill the # ... more here ... space in my dnssec-policy definition.

The documentation for these settings is in the dnssec-policy block in the BIND ARM.

max zone TTL

The dnssec-policy machinery needs to ensure that its state transitions are slower than the relevant TTLs. It has a max-zone-ttl setting that enforces a limit on the TTL of records in the zone.

By default this is 24h, which is fine for my purposes.

But be warned! If your zone has longer TTLs, then named will reject it: the zone will not load and queries will fail.

This error causes log messages that look like:

    zoneload: error: zone fanf2.ucam.org/IN:
            loading from master file fanf2.ucam.org failed: out of range
    zoneload: error: zone fanf2.ucam.org/IN: not loaded due to errors.

DNSKEY TTL

I normally use a 1 hour TTL, except for “infrastructure” records which I give a 24 hour TTL. Infrastructure records are those that are used for resolution and validation but mostly not queried directly, i.e. NS records, addresses of nameservers, and DNSKEY and DS records.

Previously I used the dnssec-settime -L 24h command on the key files to set the TTL on DNSKEY records. With dnssec-policy that becomes a configuration statement:

    dnskey-ttl 24h;

signature lifetimes

I prefer shorter RRSIG lifetimes than is traditional. With auto-dnssec I adjusted them by putting the following in my named.conf options block:

    sig-validity-interval 10 8; # days

This means that signatures last 10 days, and are regenerated 8 days before they expire, i.e. the zone is re-signed every 2 days.

In dnssec-policy this becomes:

    signatures-refresh 8d;
    signatures-validity 10d;
    signatures-validity-dnskey 10d;

The default signatures-refresh is 5 days. It must be at least the zone’s SOA expire timer plus the max zone TTL, which in my zones is 7 days plus 1 day.

If there is a problem such that a secondary server is unable to refresh its copy of a zone, we want to ensure that the zone expires before its signatures become invalid, so that the secondary server does not serve bogus data.

The signatures-validity and signatures-validity-dnskey settings control signatures generated by the ZSK and KSK respectively.

other settings

There are several other dnssec-policy settings which mostly relate to rollover timing. Since I have given my keys lifetime unlimited to avoid rollovers, I can leave all the other settings at their defaults.

preparing the key files

The plan is to make sure that a zone’s DNSSEC key files contain a complete description of the current state of the zone before enabling dnssec-policy. This should ensure that when dnssec-policy is activated it believes everything is already “omnipresent”, so it will not think that the zone needs to go through any unplanned state transitions.

This is the part that took the most experimentation to work out…

I’ll edit the key files using the dnssec-settime command.

which key is which

I usually check the comments in the .key files to identify which one is the KSK and which is the ZSK:

    :; grep -h signing Kdotat.at.+013+*
    ; This is a key-signing key, keyid 30212, for dotat.at.
    ; This is a zone-signing key, keyid 53798, for dotat.at.

zone signing key

My ZSKs did not need any changes, since dnssec-keygen created them with the necessary timing metadata. To verify, I can inspect each ZSK and make sure that I see old times in the first three lines and all other times UNSET, like this:

    :; dnssec-settime -p all Kdotat.at.+013+53798
    Created: Mon Jan 13 14:42:55 2020
    Publish: Mon Jan 13 14:42:55 2020
    Activate: Mon Jan 13 14:42:55 2020
    Revoke: UNSET
    Inactive: UNSET
    Delete: UNSET
    SYNC Publish: UNSET
    SYNC Delete: UNSET
    DS Publish: UNSET
    DS Delete: UNSET

key signing key

Several of my KSKs had missing times. To fix them, I got the key’s creation time using:

    :; dnssec-settime -p all Kdotat.at.+013+30212

Then I set the “sync” (i.e. CDS) and DS publication times to the same as the creation time. This is not historically accurate; it just needs to be sufficiently far in the past that dnssec-policy believes everything is already “omnipresent”.

    :; time='Mon Jan 13 14:42:53 2020'
    :; dnssec-settime -Pds   "$time" Kdotat.at.+013+30212
    :; dnssec-settime -Psync "$time" Kdotat.at.+013+30212

After running these commands, I double check to be sure the output has the same old times in the first three lines, and in the “SYNC Publish” and “DS Publish” lines, and all the others are UNSET.

    :; dnssec-settime -p all Kdotat.at.+013+30212
    Created: Mon Jan 13 14:42:53 2020
    Publish: Mon Jan 13 14:42:53 2020
    Activate: Mon Jan 13 14:42:53 2020
    Revoke: UNSET
    Inactive: UNSET
    Delete: UNSET
    SYNC Publish: Mon Jan 13 14:42:53 2020
    SYNC Delete: UNSET
    DS Publish: Mon Jan 13 14:42:53 2020
    DS Delete: UNSET

deploy updated keys

After I updated all the KSK “SYNC Publish” and “DS Publish” times in my Ansible repository, I updated the copies on my live primary server.

This caused some zones to get CDS and CDNSKEY records where they were previously missing, but otherwise everything continued as before.

permissions change

In the past I have set up the key directory on my primary servers to be read-only for named.

To prepare for dnssec-policy I had to make the key directory writable. In particular, named will need to create a .state file for each key when I switch a zone to dnssec-policy.

If you have not explicitly set a key-directory, you don’t need to worry about this. The default is to keep keys in named’s working directory which must be writable.

the big config change

When I am ready to put my new dnssec-policy into effect, it will be a one line change for each zone:

    zone dotat.at {
        type primary;
        file "dotat.at";
        update-policy local;
        # remove this line
        #auto-dnssec maintain;
        # insert this line
        dnssec-policy fanf;
    };

activate the change

When it goes smoothly, named normally logs almost nothing about this change, so for reassurance I want to make sure that debug logging is on before I put the change into effect:

    :; rndc trace 3
    :; rndc reconfig

log messages

I’m not going to quote the log messages verbatim because they are long and repetitive, with variations for each key and each state machine.

There are several messages like:

    DNSKEY dotat.at/ECDSAP256SHA256/30212 (KSK)
             initialize DNSKEY state to OMNIPRESENT (policy fanf)

And several more like:

    KSK dotat.at/ECDSAP256SHA256/30212
            type DNSKEY in stable state OMNIPRESENT

What I want to see here is all the state machines for all the keys going straight to “OMNIPRESENT”.

In normal operation, with both auto-dnssec and dnssec-policy, you’ll see hourly info log messages for each zone like,

    zone tz.dotat.at/IN: reconfiguring zone keys
    zone tz.dotat.at/IN: next key event: 11-May-2024 13:42:01.407

When dnssec-policy is active, the debug log messages repeat that everything is “in stable state OMNIPRESENT” between each zone’s info messages.

query status

The rndc dnssec -status output confirms everything is everywhere all at once:

    :; rndc dnssec -status dotat.at
    dnssec-policy: fanf
    current time:  Sat May 11 12:51:01 2024

    key: 53798 (ECDSAP256SHA256), ZSK
      published:      yes - since Mon Jan 13 14:42:55 2020
      zone signing:   yes - since Mon Jan 13 14:42:55 2020

      No rollover scheduled
      - goal:           omnipresent
      - dnskey:         omnipresent
      - zone rrsig:     omnipresent

    key: 30212 (ECDSAP256SHA256), KSK
      published:      yes - since Mon Jan 13 14:42:53 2020
      key signing:    yes - since Mon Jan 13 14:42:53 2020

      No rollover scheduled
      - goal:           omnipresent
      - dnskey:         omnipresent
      - ds:             omnipresent
      - key rrsig:      omnipresent

key files

After dnssec-policy is enabled, another sign that all is well is that the zone’s .key and .private files are the same as before: there are no changes to the timing metadata.

There is a new .state file for each key, containing another copy of the old timing metadata, and notes that all states are “omnipresent”.

I imported the .state files into my Ansible repository and adjusted things so that they would get (re)deployed the same way as the other key files.

DONE

I have gone through this process for all my personal zones, including dotat.at, so they are now all running with dnssec-policy in production.

I have a few more miscellaneous notes, but I’ll put them in another post.

Introducing BIND9 dnssec-policy

2024-05-08T16:10:28Z

Here are some notes about using BIND’s new-ish dnssec-policy feature to sign a DNS zone that is currently unsigned.

I am in the process of migrating my DNS zones from BIND’s old auto-dnssec to its new dnssec-policy, and writing a blog post about it. These introductory sections grew big enough to be worth pulling out into a separate article.

what is dnssec-policy?
what is different
the default policy
signing an unsigned zone
inspect the logs
observe the keys
query the status
dnssec-policy states
await state transitions
publish the DS record
logging and timing
what’s next

This is the first article in a three part series:

Introducing BIND9 dnssec-policy (this post)
Migrating to BIND9 dnssec-policy
BIND9 dnssec-policy appendices

what is dnssec-policy?

BIND’s dnssec-policy automates the management of a zone’s DNSSEC keys, as well as keeping the zone signed.

You can put zero or more dnssec-policy blocks at the top level of your named.conf. A dnssec-policy block defines the policy’s name, and contains a fairly simple description of how often DNSSEC keys and signatures should be replaced.

Inside a zone block you can put a dnssec-policy statement that gives the name of the policy that should be applied to the zone. All the DNSSEC gubbins inside the zone will be taken care of by named.

One thing dnssec-policy does not take care of is maintaining the DS records in the parent zone. (It publishes CDS and CDNSKEY records, but not all parent zones pay attention to them.)

what is different

Compared to BIND’s old auto-dnssec, the main change is that DNSSEC keys are now under named’s control. In principle the keys become more like dynamic zone data than static configuration.

There is no longer any need to use the somewhat arcane dnssec-keygen or dnssec-settime commands.

the default policy

There is a predeclared dnssec-policy default that is fairly sensible for most purposes. It specifies:

one CSK “combined signing key” per zone
the key has an indefinite lifetime
algorithm 13 (ECDSA P256 SHA256)
signatures last 2 weeks

The default policy is not guaranteed to be stable so you may prefer to declare your own.

signing an unsigned zone

Let’s try dnssec-policy out on a scratch zone.

I’m going to use the default policy, so I don’t need to add any dnssec-policy blocks to my configuration.

In named.conf inside the zone fanf2.ucam.org { … block, I add the line:

    dnssec-policy default;

The zone also needs to have either inline signing:

    # I don't use this
    inline-signing yes;

Or dynamic updates:

    # my zones have this
    update-policy local;

After editing named.conf, I run the command:

    :; rndc reconfig

inspect the logs

In the logs I see some dnssec: info: messages:

    keymgr: DNSKEY fanf2.ucam.org/ECDSAP256SHA256/54827
                (CSK) created for policy default
    Fetching fanf2.ucam.org/ECDSAP256SHA256/54827
                (CSK) from key repository.
    DNSKEY fanf2.ucam.org/ECDSAP256SHA256/54827
                (CSK) is now published
    DNSKEY fanf2.ucam.org/ECDSAP256SHA256/54827
                (CSK) is now active

And because the zone has changed to add the keys and signatures, some follow-up messages:

    notify: info: zone fanf2.ucam.org/IN:
            sending notifies (serial 1715162591)

observe the keys

In named’s working directory (or in my case the key-directory that I configured), there are some new files which need to be backed up in the same way as the zone’s other working files:

    Kfanf2.ucam.org.+013+54827.key
    Kfanf2.ucam.org.+013+54827.private
    Kfanf2.ucam.org.+013+54827.state

query the status

I can get the zone’s DNSSEC status with the rndc dnssec -status command:

    :; rndc dnssec -status fanf2.ucam.org
    dnssec-policy: default
    current time:  Wed May  8 11:04:07 2024

    key: 54827 (ECDSAP256SHA256), CSK
      published:      yes - since Wed May  8 11:03:09 2024
      key signing:    yes - since Wed May  8 11:03:09 2024
      zone signing:   yes - since Wed May  8 11:03:09 2024

      No rollover scheduled
      - goal:           omnipresent
      - dnskey:         rumoured
      - ds:             hidden
      - zone rrsig:     rumoured
      - key rrsig:      rumoured

dnssec-policy states

The dnssec-policy states have somewhat quirky names:

HIDDEN: the records are not in the zone
RUMOURED: the records have been published and we are waiting for zone transfers to happen and TTLs to pass
OMNIPRESENT: everyone using the zone can see the records
UNRETENTIVE: the records have been deleted and are expiring from caches etc.

These state names appear in the output of rndc dnssec -status and in the dnssec-policy debug logs.

await state transitions

Now I need to wait for some TTLs to pass, so that the old unsigned records have expired from all caches and everyone can see the signed version of the zone. The dnssec-policy machinery keeps track of the timing for me.

After the “rumoured” states have transitioned to “omnipresent”, I can publish the zone’s DS record to complete the chain of trust. With the default policy’s timing parameters, these state transitions complete after 26 hours.

publish the DS record

To publish the DS record, I cd to my key directory and run the command:

    :; dnssec-dsfromkey Kfanf2.ucam.org.+013+54827

And give its output to the administrators of the parent zone.

(Or not, in this case, since ucam.org isn’t signed. This is why fanf2.ucam.org is a scratch zone!)

After the DS record appears in the parent zone, I can notify the dnssec-policy machinery what has happened:

    :; rndc dnssec -checkds published fanf2.ucam.org

After this point, named knows that the zone’s key is required for the zone’s chain of trust, so the key should be replaced without manual intervention.

logging and timing

By default, dnssec-policy does not log very much. Before setting a dnssec-policy for a zone, increase named’s debug level using rndc trace 3, then it becomes verbosely informative.

These debug logs include details of state transition time constraints, for example:

    time says no to CSK fanf2.ucam.org/ECDSAP256SHA256/54827
            type ZRRSIG state RUMOURED to state OMNIPRESENT
            (wait 90300 seconds)

90300 seconds is 26 hours.

I don’t know another way to get named to emit this timing information.

what's next

Stay tuned for the next exciting episode, in which our hero will migrate some zones from old and busted auto-dnssec to new hotness dnssec-policy!

Moaning about YAML frontmatter

2024-05-05T16:06:20Z

As is typical for static site generators, each page on this web site is generated from a file containing markdown with YAML frontmatter.

Neither markdown nor YAML are good. Markdown is very much the worse-is-better of markup languages; YAML, on the other hand, is more like better-is-worse. YAML has too many ways of expressing the same things, and the lack of redundancy in its syntax makes it difficult to detect mistakes before it is too late. YAML’s specification is incomprehensible.

But they are both very convenient and popular, so I went with the flow.

multiple documents

A YAML stream may contain several independent YAML documents delimited by --- start and ... end markers, for example:

    ---
    document: 1
    ...
    ---
    document: 2
    ...

string documents

The top-level value in a YAML document does not have to be an array or object: you can use its wild zoo of string syntax too, so for example,

    --- |
    here is a preformatted
    multiline string

frontmatter and markdown

Putting these two features together, the right way to do YAML frontmatter for markdown files is clearly,

    ---
    frontmatter: goes here
    ...
    --- |
    markdown goes here

The page processor can simply:

feed the contents of the file to the YAML parser
use the first document for metadata
feed the second document to the markdown processor
check that’s the end of the file

No need for any ad-hoc hacks to separate the two parts of the file: the YAML acts as a lightweight wrapper for the markdown.

markdown inside YAML

The crucial thing that makes this work is that the markdown after the --- | delimiter does not need to be indented.

Markdown is very sensitive to indentation, so all the tooling (most importantly my editor) gets righteously confused if markdown is placed in a container that introduces extra indentation.

YAML in Perl

The static site generator for www.dns.cam.ac.uk uses --- | to mark the start of the markdown in its source files. This worked really nicely.

The web site was written in Perl, because most of the existing DNS infrastructure was Perl and I didn’t want to change programming languages. YAML was designed by Perl hackers, and the Perl YAML modules are where it all ~~went wrong~~ started.

YAML in other languages

The static site generator for https://dotat.at is written in Rust, using serde-yaml.

I soon discovered that, unlike the original YAML implementations, serde-yaml requires top-level strings following --- | to be indented. This bug seems to be common in YAML implementations for languages other than Perl.

start and end delimiters

So I changed the syntax for my frontmatter so it looks like,

    ---
    frontmatter: goes here
    ...
    markdown goes here

That is, the file starts with a complete YAML document delimited by --- start and ... end markers, and the rest of the file is the markdown.

The idea is that a page processor should be able to:

feed the contents of the file to the YAML parser
read one document containing the metadata
feed the rest of the file to the markdown processor

However, I could not work out how to get serde-yaml to read just the prefix of a file successfully and return the remainder for further processing.

I know, I'll use regexps

(Might as well, I’m already way past two problems…)

As a result I had to add a bodge to the page processor:

split the file using a regex
feed the first part to the YAML parser
feed the second part to the markdown processor

mainstream frontmatter

My choice to mark the end of the frontmatter with the YAML ... end delimiter is not entirely mainstream. As I understand it, the YAML + markdown convention came from Jekyll, or at least Jekyll popularized it. Jekyll uses the YAML --- start delimiter to mark the end of the YAML, or maybe to mark the start of the markdown, but either way it doesn’t make sense.

Fortunately my ... bodge is compatible with Pandoc YAML metadata, and Emacs markdown mode supports Pandoc-style YAML metadata, so the road to hell is at least reasonably well paved.

grump

It works, but it doesn’t make me happy. I suppose I deserve the consequences of choosing technology with known deficiencies. But it requires minimal effort, and is by and large good enough.

sudon't

2024-05-02T22:25:47Z

My opinion is not mainstream, but I think if you really examine the practices and security processes that use and recommend sudo, the reasons for using it are mostly bullshit.

When I started my career in the late 1990s, I was already aware of really(8) and userv because one of my friends wrote them. As tools they embody criticisms of sudo’s design. (really came from Cambridge University’s central unix timesharing service CUS in about 1990, and before that from the Computer Lab unix systems in the 1980s.)

My first job after university was at Demon Internet where it was normal for ops staff to log in as root over ssh. Demon had patched sshd to change how it found user public keys, which allowed the security team to manage which staff had access to which servers. (In practice, most ops staff had access to everything, which was sometimes fun!) At that time I was paying attention to the prevailing sysadmin wisdom, and sudo was being suggested as a good idea. It was not yet dogma, but it was heading that way.

There were a bunch of reasons for recommending sudo:

As a safety guard.

If you never have a root shell prompt it’s harder to accidentally do something with root powers that has catastrophic consequences. This is also what really(8) is for, and it’s about 1000x simpler than sudo.

I have never been persuaded by this argument. As a safety guard sudo is pretty feeble. The idea is to provide some mechanical assistance to the sysadmin’s situational awareness: do I really mean to run this command as root? In my career I have always been working with distributed systems, so a simple privileged/unprivileged sudo-or-not guard fails to capture many of the other ways to fuck up: like, are you even logged into the right system?

Other tools you can use to avoid mistakes include:
- put your username, hostname, and working directory in your shell prompt, and pay attention to it so you maintain your situational awareness
- use multiple terminal windows – I always took this for granted, but only a few years before I started my career, X terminals were a rare privilege. Spatial awareness is a powerful tool for keeping track of things, so use it to separate root or not, this machine or that machine.
- use colours or decorations to make privileged terminal prompts or privileged windows more obvious
Because you should not log in to root directly.

There’s a lot of angst out there about the dangers of exposing a root login to the big bad internet. It is not backed by a rational assessment of the security of root’s login credentials (whether keys or passwords) nor by any analysis of whether you get a meaningful security improvement from logging in as another user then raising privilege with sudo.

By the time Ansible came along, this advice had turned into dogma. Ansible by default prefers to log in to a remote system using a non-root account then escalate to root using sudo or some alternative. Because Ansible is an automation tool, it must be able to escalate privilege without interaction. So there is no meaningful security boundary, and this non-root account is effectively equivalent to root. So Ansible has a lot of gratuitous complexity just to conform to this dogma.
For auditing.

I doubt I have ever seen or heard of sudo’s logs being used effectively.

But (assuming there is in fact some practical use to it) sudo was better at logging who did what than other privilege escalation tools, such as login or su. In particular, sshd was for a long time unable to log which public key was used to authenticate, which is a sad omission especially for shared privileged accounts such as root.
It’s a general-purpose tool.

This is sudo as a bad userv, rather than sudo as a bloated really(8).

In the early 1990s it was not easy to provide a tool that crossed privilege boundaries. That was before it made sense to make a web app to handle such things. It was wasteful of limited computing resources to have a persistent daemon (like userv) when you could use setuid (like sudo). And back then SCM_CREDS was advanced technology not available everywhere, which led to some compromises in userv’s design.

I think sudo became popular because of this broad utility, then it became the recommended way to escalate to root (for reasons 1 and 2). However sudo has a really complicated rules language to describe which users can run which tools with which privileges, which is the cause of many of its security vulnerabilities (undermining 1 and 2).

One of the things that surprised me about doas was that it still implements most of a sudo-style rules language. When they said it was radically stripped down I expected something like really(8), but no!

For single-user workstations.

After sudo was already a thing, Mac OS X and Ubuntu needed to solve the problem of allowing the workstation’s user to get administrative access without making their account trivially equivalent to root.

Instead of demanding the credentials of the target account, sudo authenticates the user making the request. So a single user workstation can be set up without a root password, and its owner only has to worry about the credentials for a single login.

And sudo allows you to authenticate once, then run it several times without re-authenticating. It’s almost as convenient as a root shell!

I think this use of sudo in the most widespread unix systems really cemented dogmas 1 and 2. It’s popular so it must be right!

In this case, sudo is more about usability than security. It’s doing an important job, but it’s far more complicated than necessary. Better would be something like really(8), with more authentication to verify the presence of the user.

So, to summarize, I think reasons 1, 2, 3 are bad reasons for using sudo. If you need to provide tools that cross security boundaries (4) then userv or a small web app is better. For single-user workstations (5) unfortunately there isn’t a good alternative to sudo. Personally I prefer to set up a root account that I can log into directly, but I’m not an average user.

My wireguard IPv6 tunnel

2024-04-30T22:49:55Z

Our net connection at home is not great: amongst its several misfeatures is a lack of IPv6. Yesterday I (at last!) got around to setting up a wireguard IPv6 VPN tunnel between my workstation and my Mythic Beasts virtual private server.

There were a few, um, learning opportunities.

incorrect ideas

I made a couple of wrong assumptions about the routing setup for my VPS. I thought my /64 was routed to some kind of virtual ethernet, with the gateway on $prefix::0 and my VPS on $prefix::1.

Based on this assumption, my plan was to allocate $prefix::2 to my workstation on the far end of the tunnel, and use some link-local trickery to allow it to appear on the Mythic Beasts network.

I reckoned this could be done fairly simply on Linux by turning on IPv6 Neighbour Discovery Protocol proxying, and by adding a neighbour table entry for my workstation:

    sysctl net.ipv6.conf.eth0.proxy_ndp=1
    ip neigh add proxy $prefix::2 dev eth0

However this setup never worked.

gateway gone

At some point during my ~~flailing~~ experiments, I noticed that my VPS had lost IPv6 connectivity. I bodged it by manually configuring a default route via $prefix::0 until I could fix it properly.

Eventually I worked out the problem: I had turned on IP forwarding,

    sysctl net.ipv6.conf.all.forwarding=1

This has the side-effect that Linux stops listening for router advertisements so it loses connectivity. To prevent this, I needed:

    sysctl net.ipv6.conf.eth0.accept_ra=2

The default is accept_ra=1 which means, accept router advertisements only if this machine is not being a router, i.e. if this machine does not have IP forwarding turned on.

I needed accept_ra=2 which means, always accepr router advertisements. But it must be set on a specific interface: setting net.ipv6.conf.all.accept_ra=2 does not, in fact, unconditionally accept router advertisements on “all” interfaces.

rectified routing

Eventually I worked out that my assumptions were incorrect.

Although a /64 had been reserved for me, only the bottom /127 was routed to my VPS.

And although $prefix::0 worked as a gateway address, the supported setup is to listen for router advertisements which give my VPS two routers on link-local adddresses.

I asked Mythic Beasts to route the whole /64 to me, but my VPS lost connectivity because of my earlier bodged default route. (No use for a server to send packets via $prefix::0 if $prefix::0 is one of the server’s addresses!) After I fixed the accept_ra=2 setting, the routing change worked as expected.

wireguard woes

The tunnel setup was not smooth sailing either.

For a while I was deeply confused about which IP addresses in wireguard’s configuration are inside the tunnel and which are outside. I was referring to the wg(8) man page , which doesn’t mention it, nor does the quick start guide. I should have read the wireguard homepage! I plan to submit some improvements to the wg(8) man page, to fix this omission, and to improve the formatting so it is less of a wall of text and easier to see which options are explained where.

Another hitch happened when I first ran ifup wg0. Amongst other errors, it complained fopen: File not found. Er, which file? and who is trying to fopen it? Eventually I worked out wireguard was failing to open its private key file. I have submitted a patch to wireguard to improve its error reporting (perror(3) should be avoided, use err(3) instead) so it is easier to fix this mistake in the future.

There’s one aspect of wireguard’s configuration file that I dislike: it includes secret keys inline in the clear. I strongly prefer to keep each secret in its own file separate from any non-secret configuration. I can work around this issue by generating the wireguard config from a jinja template, but that has the disadvantage that I need to decrypt and redeploy the secrets for non-secret config changes. I’ve submitted another patch to wireguard so that its config file can load secret keys from separate files, in the same way as its command line.

correct configuration

What I ended up with (eventually) is pleasingly neat. All the routing happens at the IP layer without ugly trickery.

On my VPS I have an entry in /etc/network/interfaces.d/wg0 that starts like:

    auto wg0
    iface wg0 inet6 manual

Weirdly, this wireguard endpoint doesn’t need an address of its own. The manual method suppresses all the IPv6 autoconfiguration magic.

Create the wireguard interface:

        pre-up  ip link add wg0 type wireguard
        pre-up  wg set wg0 private-key /etc/wireguard/private

Give it a fixed endpoint; 30567 is 0x7767 which is ASCII ‘w’ and ‘g’:

        pre-up  wg set wg0 listen-port 30567

Tell it that my workstation exists and which IP address it is allowed to use inside the tunnel:

        pre-up  wg set wg0 peer $(cat /etc/wireguard/basalt.pub) \
                           allowed-ips $prefix::2

After the interface is up, set the hard-won sysctls (they could probably be static settings in /etc/sysctl.conf but it makes sense to have all the tunnel-related stuff in one place):

        post-up  sysctl net.ipv6.conf.eth0.accept_ra=2
        post-up  sysctl net.ipv6.conf.all.forwarding=1

And finally, route packets to my workstation over its tunnel:

        post-up  ip -6 route add $prefix::2 dev wg0

The teardown process is slightly simpler:

        pre-down  ip -6 route del $prefix::2 dev wg0
        pre-down  sysctl net.ipv6.conf.all.forwarding=0
        pre-down  sysctl net.ipv6.conf.eth0.accept_ra=1
        post-down ip link del wg0

My workstation’s version of /etc/network/interfaces.d/wg0 has more addresses to care about. It has a static IPv6 address and router:

    auto wg0
    iface wg0 inet6 static
        gateway $prefix::1
        address $prefix::2

The wireguard interface is created in the same way:

        pre-up  ip link add wg0 type wireguard
        pre-up  wg set wg0 private-key /etc/wireguard/private

Tell it where to send encrypted packets, and that the whole IPv6 internet is on the other end of the tunnel:

        pre-up  wg set wg0 peer $(cat /etc/wireguard/shale.pub) \
                           endpoint 93.93.130.7:30567 \
                           allowed-ips ::/0

The teardown process is simply:

        post-down ip link del wg0

another author

Last year Chris Siebenmann wrote about his wireguard IPv6 tunnel. Since his server network works the way I incorrectly thought mine did, he is using NDP proxying. His considerations on link-local IPv6 addresses on wireguard interfaces are useful; after some vague thinking I decided not to have link-local addresses in my tunnel, since my setup doesn’t depend on SLAAC nor NDP at either end.

Resurrected link log

2024-03-27T23:10:12Z

After an extremely long hiatus, I have resurrected my link log.

As well as its web page, https://dotat.at/:/, my link log is shared via:

an Atom feed https://dotat.at/:/feed.atom
the fediverse https://mendeddrum.org/@fanf
Dreamwidth https://dotaturls-feed.dreamwidth.org/profile

The Dreamwidth feed has not caught this afternoon’s newly added links, so I am not sure if it is actually working…

There is a lengthy backlog of links to be shared, which will be added to the public log a few times each day.

The backlog will be drained in a random order, but the link log’s web page and atom feed are sorted in date order, so the most-recently shared link will usually not be the top link on the web page.

I might have to revise how the atom feed is implemented to avoid confusing feed readers, but this will do for now.

The code has been rewritten from scratch in Rust, alongside the static site generator that runs my blog. It’s less of a disgusting hack than the ancient Perl link log that grew like some kind of fungus, but it still lacks a sensible database and the code is still substantially stringly typed. But, it works, which is the important thing.

edited to add …

I’ve changed the atom feed so that newly-added entries have both a “published” date (which is the date displayed in the HTML, reflecting when I saved the link) plus an “updated” date indicating when it was later added to the public log.

I think this should make it a little more friendly to feed readers.

On "the OSI deprogrammer"

2024-03-26T12:06:14Z

Back in December, George Michaelson posted an item on the APNIC blog titled “That OSI model refuses to die”, in reaction to Robert Graham’s “OSI Deprogrammer” published in September. I had discussed the OSI Deprogrammer on Lobsters, and George’s blog post prompted me to write an email. He and I agreed that I should put it on my blog, but I did not get a round tuit until now…

The main reason that OSI remains relevant is Cisco certifications require network engineers to learn it. This makes OSI part of the common technical vocab and basically unavoidable, even though (as Rob Graham correctly argues) it’s deeply unsuitable.

It would be a lot better if the OSI reference model were treated as a model of OSI in particular, not a model of networking in general, as Jesse Crawford argued in 2021. OSI ought to be taught as an example alongside similar reference models of protocol stacks that are actually in use.

One of OSI’s big problems is how it enshrines layering as the architectural pattern, but there are other patterns that are at least as important:

The hourglass narrow waist pattern, where a protocol stack provides a simple abstraction and only really cares about how things work on one side of the waist.

For instance, IP is a narrow waist and the Internet protocol stack only really cares about the layers above it. And Ethernet’s addressing and framing are another narrow waist, where IEEE 802 only really cares about the layers below.
Recursive layering of entire protocol stacks. This occurs when tunnelling, e.g. MPLS or IPSEC. It works in concert with narrow waists that allow protocol stacks to be plugged together.

Tunneling starkly highlights what nonsense OSI’s fixed layers are, leading to things like network engineers talking about “layer 2.5” when talking about tunneling protocols that present Ethernet’s narrow waist at their endpoints.

Speaking of Ethernet, it’s very poorly served by the OSI model. Ethernet actually has three layers:

PHY (cables, MAU, SFP)
MAC (AUI, MII)
LLC (LLDP, LACP)

Then there’s WiFi which looks like Ethernet from IP’s point of view, but is even more complicated. And almost everything non-ethernet has gone away or been adapted to look more like ethernet…

Whereas OSI has too few lower layers, it has too many upper layers: its session and presentation layers don’t correspond to anything in the Internet stack. I think Rob Graham said that they came from IBM SNA, and were related to terminal-related things like simplex or block-mode, and character set translation. Jack Haverty said something similar on the Internet History mailing list in 2019. The closest the ARPANET / Internet protocols get is Telnet’s feature negotiation; a lot of the problem solved by the OSI presentation layer is defined away by the Internet’s ASCII-only network virtual terminal. Graham also said that when people assign Internet functions to layers 5 and 6, they do so based on the names not based on how the OSI describes what they do.

One of the things that struck me when reading Mike Padlipsky’s Elements of Networking Style is the amount of argumentation that was directed at terminal handling back then. I guess in that light it’s not entirely surprising that OSI would dedicate two entire layers to the problem.

Padlipsky also drew the ARPANET layering as a fan instead of a skyscraper, with intermediate layers shared by some but not all higher-level protocols, e.g. the NVT used by Telnet, FTP, SMTP. I expect if he were drawing the diagram later there might be layers for 822 headers, MIME, SASL – though they are more like design patterns than layers since they get used rather differently by SMTP, NNTP, HTTP. The notion of pick-and-mix protocol modules seems more useful than fixed layering.

Anyway, if I could magically fix the terminology, I would prefer network engineers to talk about specific protocols (e.g. ethernet, MPLS) instead of bogusly labelling them as layers (e.g. 2, 2.5). If they happen to be working with a more diverse environment than usual (hello DOCSIS) then it would be better to talk about sub-IP protocol stacks. But that’s awkwardly sesquipedalian so I can’t see it catching on.

	(1 − e^δx) / −δx
=	(e^δx − 1) / δx
=	(e^{0 + δx} − e⁰) / δx
=	( f(0 + δx) − f(0) ) / δx
=	δy / δx
≈	dy / dx
=	e^x
=	1