As before, I made an interactive ellipse workbench to experiment with the problem. I got something working, but I have questions…
What I wanted to do is swing a curve around a corner without cutting it. (I could solve the problem with Bézier curves, but they create other issues.) The purple ellipse in the following picture illustrates what I want to avoid: it snips off the corner of the inner rectangle.
I managed to solve my problem by joining two ellipses so that they meet at the corner, with the constraint that (like the four-point egg) their tangents are 45°.
In the workbench you can drag around the big circles to see how the ellipses join.
The top right and bottom left circles control the horizontal and vertical radii of the purple ellipse. The purple ellipse isn’t directly part of the solution; it’s mostly for comparison.
The workbench draws a load of extra scaffolding lines. The lines joining at the corner of the inner rectangle show that the tangents and normals are 45°.
The lines joining the top right and bottom left circles are my eyeballed guesstimates of the region in which a solution can be found.
Is there a formula?
I tried to work out how to directly calculate the radii of the solution ellipses. I thought I had something plausible based on a parametric ellipse using the tangent as the parameter, but it didn’t work when I tried it out.
In reaction to that failure, I made the workbench to try to get a better intuition of what is going on. Eventually I tried bodging it with a brute-force search for a solution. This worked better than expected.
Why does a simple iteration converge so nicely?
The search isn’t really a search as such:
I was surprised that this does in fact converge, and does so rapidly. I didn’t even have to break out Newton-Raphson or quadtrees! Accidentally logarithmic?
What are the constraints on a solution?
The workbench draws lines with gradient 1/2 radiating from the top right and bottom left corners. The two lines pointing inside the rectangle are empirically a fairly accurate match for the boundary between possible and impossible solutions.
I am surprised these are simple straight lines, not curves.
It’s even possible to solve if you drag the corner outside the rectangle, though the ellipses make interesting cusps if you do that. However my eyeballed guesstimate at the outer boundary lines is not accurate.
Freyja’s construction uses a straight edge and compasses in classical style, so it lacks dimensions. Below is my version, with the numbers needed to draw it on a computer.
At first this construction seems fairly rigid, but some of its choices are more arbitrary than they might appear to be. I have made an interactive four-point egg so you can drag the points around and observe how its shape changes.
In the following, I will measure angles in fractions of a turn 𝜏
,
clockwise from noon because the egg’s pointy end is upwards.
x: 0 y: -1 radius: 2 from: 𝜏*3/8 to: 𝜏*5/8
x: -1 y: 0 radius: 2+√2 from: +𝜏*2/8 to: +𝜏*3/8
x: +1 y: 0 radius: 2+√2 from: -𝜏*3/8 to: -𝜏*2/8
The egg’s north centre is where this scaffolding circle meets the
Y axis at +1+√2
.
Imagine two new diagonal scaffolding lines: one through the egg’s west point and north centre; and another through its east point and north centre.
x: -1-√2 y: 0 radius: 2+2√2 from: +𝜏*1/8 to: +𝜏*2/8
x: +1+√2 y: 0 radius: 2+2√2 from: -𝜏*2/8 to: -𝜏*1/8
x: 0 y: 1+√2 radius: √2 from: -𝜏*1/8 to: +𝜏*1/8
The six coloured points marked in the diagram above are the centres of the six arcs that make the egg. These comprise the four points the egg is named after, plus their mirror images.
Moss’s egg is a three-point egg. It has the same top half as this four-point egg, but its bottom half is a simple semi-circle.
]]>I think the instigation was a YouTube food video which led me to try making popcorn at home from scratch with Nico. It was enormous fun! And several weeks later it’s still really entertaining to make (especially when a stray kernel pops after I take the lid off the pan, catapaulting a few pieces in random directions!)
Turn on the captions while watching the YouTube vid!
There is a standard measure of popcorn quality called the Metric Weight Volume Test, which states the volume (ml) of popcorn produced per weight (g) of kernels. Typical figures are around 40 ml/g, which was useful to know when I was working out how much corn to put in a pan when I was not yet familiar enough to eyeball it.
The MWVT is weird:
There are a couple of kinds of popcorn:
Mushroom popcorn is good for heavy coatings such as caramel, which is more faff than I care for. Butterfly popcorn has a better crunch and texture for lighter seasoning.
I did some experiments with different fats, trying to get a good flavour.
Butter is nice but it burns at popcorn temperatures. I tried doing a quick-and-dirty clarified butter, by melting it in the microwave and pouring the fat off the solids, but it was kind of faffy and didn’t help much. Some recipes suggest adding the butter after the corn has been popped, but I thought that would make it too greasy. (I have not actually tried it, tho.)
Eventually I settled on sunflower oil, which has a nice subtle nutty flavour.
Our favourite is based on an idea from another random YouTube video. I combine roughly equal parts by weight of:
The latter has a nice yellow-ish colour and a delicious umami flavour.
I whiz them together in a blender to make a fine powder, which I keep in the cupboard in an old spice jar. A jar of 75g of seasoning is enough for many batches of popcorn.
We have a steel pan with a thick base that’s good for making about 50g (2 litres) of popcorn.
I have also used our larger cast iron casserole for bigger batches. For some reason it works a bit slower than the steel pan, so some of the popcorn can get a bit Cricket St Thomas (near Chard). But the cast iron gets nicely seasoned in the process.
We have glass lids for both pans, which is great, it’s so fun to be able to watch the corn popping as well as listen to it!
Put a generous splash of sunflower oil in the pan with the popcorn kernels. The eyeball quantity is at most one layer of kernels in the bottom of the pan.
Set it on a medium heat with a lid on.
Wait a few minutes to get up to temperature, then wait for the popping to proceed until all is quiet for 5 seconds or so.
Pour the popcorn into a big bowl in thirds or quarters, with a sprinkle of seasoning in between each one. Put a plate on top and give it a good shake to distribute the seasoning evenly.
Enjoy!
]]>The Novelkeys Kailh Big Switch is a working MX-style mechanical keyboard switch, but 4x larger in every dimension.
I realised at the weekend that the Big Switch should fit nicely in a simple Lego enclosure. Because an MX-style switch is usually mounted in a 14x14 mm square plate cutout, at 4x larger the Big Switch would need a 56x56 mm mounting hole. Lego aficionados know that studs are arranged on an 8x8 mm grid; this means the Big Switch hole is exactly 7x7 studs. A plan was hatched and a prototype was made.
Apart from the Lego enclosure (scrounged from a decade or two of accumulated miscellaneous bricks), I needed a Big Switch, a microcontroller, and some way to wire them together.
The easiest way, no solder needed, is to go to The Pi Hut and get:
a Pimoroni Tiny 2040 with pre-soldered headers
Leftpondians can get a similar package from Adafruit:
Adafruit’s jumpers are not as convenient as the Pi Hut’s: you’ll need to chop off the JST connectors and solder to the microcontroller.
You don’t need to get a super small dev board like I did: a Raspberry Pi Pico is 51mm long so it will fit inside a 7x7 lego stud compartment with room to spare, tho the USB plug will stick out.
My plan was to use my existing collection of jumper wires to hook up the Tiny 2040 to the Big Switch. I decided to solder wires to all eight GPIO pins, in case I want to do something more complicated with it in the future, and because my jumper wires came in a rainbow ribbon.
Unfortunately, soldering wires to the pins of the Big Switch was more difficult than I expected.
The pins are about 4mm wide and 13mm long. One of them is copper-coloured and relatively thin, and I soldered that connection without trouble.
The other is steel-coloured and thicker. The solder would not form a meniscus, it would just ball up and slide off. Eventually I tried roughing up the surface of the pin with a file, and the scratches were enough for the solder to grab hold.
Later on I found an Adafruit Big Switch build guide which uses spade connectors, so that’s what I recommend above.
I slapped together a Big Switch QMK configuration which maps
the Big Switch to ENTER
.
The info.json
file says the Big Switch is wired to GP0
. You might
need to change this if you are using a different dev board; for
instance, GPIO0 is not broken out on the Adafruit QT Py RP2040.
So now I have a comically huge button on my desk that I can whack dramatically when a command needs to be fired off with an extra flourish.
]]>Rough quantities per person:
100g pasta
Spaghetti is traditional but I’ll use any shape.
50g streaky bacon
The traditional ingredient is guanciale; maybe I’ll try that one day for a special occasion. I use 4 rashers of the thin-sliced bacon that we get, which is 60g.
one large egg
Typically about 60g
40g grated parmesan
Again, for a special occasion I might try the traditional pecorino romano. My rule of thumb is there should be as much cheese as half the weight of the egg, but I usually round it up so there’s 100g of mixture.
lots and lots of ground black pepper
Get the kettle on the boil and measure out the pasta.
While waiting for the kettle, shred the bacon into a pan.
I use kitchen scissors. (I ought to get our knives sharpened.)
The pan needs to be big enough to stir everything together at the end.
Get the pasta cooking in another pan.
Don’t salt the water, there’s plenty in the bacon and cheese.
Use relatively little water so that it becomes starchy while cooking. The pasta water will loosen and stabilize the sauce.
Fry the bacon until it has taken on some nice colour.
I bash it about with a wooden spoon to make sure the bits have separated. It will probably be done before the pasta, which is fine. Turn off the heat and let it rest.
When the cooking is under control, break the egg(s) into a bowl, and grate the cheese into the eggs.
I do this on top of the weighing scales.
Grind lots of pepper onto the cheese and egg and mix them all together. It will make a thick sludge.
When the pasta is done to your liking, use a slotted spoon to transfer it to the pan with the bacon.
I find a slotted spoon carries a nice quantity of water with the pasta. Many of the recipes I have seen say that the pasta should be slightly under-done at this point, because it will finish cooking in the sauce, but that doesn’t work for me.
Mix the bacon and pasta and deglaze the pan.
It should be cool enough after this point that the egg will not curdle immediately when you add it.
Add the egg and cheese and mix over a gentle heat.
As you stir, the cheese will melt and the sauce will become smooth and creamy. If it’s too thick, add a tablespoon of pasta water. If it’s too runny, boost the heat to help the egg thicken up.
Dish up and serve.
Best eaten immediately: it’s nicest hot but it cools relatively fast.
To avoid all the difficult parts of the PCB design, I used the Waveshare RP2040-Tiny. This is, basically, a Raspberry Pi Pico that has been split in two. There are some other differences: USB-C instead of micro-USB, a reset button, fewer GPIO pins broken out to the board edge.
The RP2040-Tiny microcontroller board is about the size of an ALT keycap, so it easily fits under the space bar. The USB-C board is smaller, tho I would have preferred it to be wider and shallower.
Apart from the controller, all that is needed is a switch and a diode for each key.
Because this keyboard is experimental, I wanted the switches to be socketed, using Kailh MX hotswap sockets. This makes it possible to change switches without desoldering them (I am trying out different switches for my modifier keys) but it also makes it possible separate the switch mounting plate from the PCB to fettle the stabilizers or add/remove sound damping foam.
My plan was to get the PCBs manufactured with diodes and sockets already soldered on, but that did not quite work out - about which more below.
A keyboard matrix has some row wires and some column wires; each switch connects a row and a column, via a diode that allows the controller to disambiguate when multiple keys are pressed at the same time. The row and column wires are connected to the microcontroller GPIO pins.
Normally the rows and column wires correspond fairly directly to the visible layout of the keyboard, but for some reason I decided to use ai03’s duplex matrix design where one row of keys has two row wires, and two columns of keys use the same column wire.
So the Keybird69 matrix is 8 x 9, using 17 GPIO pins, where a more normal layout would be 16 x 5 using 21 GPIO pins. (The RP2040-Tiny has more than enough GPIOs for a normal matrix, so there was no need for me to use the fancy duplex matrix.)
There are a few options for keyboard firmware:
I got QMK working on the Waveshare RP2040-Tiny without too much difficulty.
I can configure QMK with just its data-driven JSON configuration,
without needing any C – the differences from the GENERIC_RP_RP2040
platform are minimal.
I also got QMK working on my Pimoroni Keybow 2040.
Amusingly (in retrospect) I found it more difficult to get the LEDs working than the keyboard matrix scanner – I expected board bringup to involve blinking an LED first, before moving on to more complicated things. But it turns out that there’s a lot more complexity around keyboard LEDs than scanning a keyboard matrix, both from the hardware and firmware points of view. (See the Keybow 2040 pull request for some of my frustrations.)
To test the firmware, I plugged the bare microcontroller board into my computer and used a wire to connect pairs of GPIO pins. This made it generate the right USB HID events, so I could use it like a really super awkward keyboard.
This picture comes out a lot paler and redder than it appears in the flesh: in reality the PCB has a deeper imperial purple lustre.
I generally followed Ruiqi Mao’s keyboard PCB design guide and ai03’s keyboard PCB design guide.
They are both helpful guides to KiCad’s preferred workflow of drawing up the schematic first, then turning the abstract diagram into a physical PCB layout.
I like being able to reconcile the schematic and the PCB in both directions, so when I found the layout would work better with a different assignment of row and column wires to GPIO pins, KiCad would warn me when things were inconsistent. I felt reassured that it was helping me to avoid mistakes.
I had a bit of fun with the duplex matrix: Each row has a pair of PCB tracks running across the bottom of the board; I left out the ground plane fill between the row tracks, so they appear as much darker stripes on the PCB. (This is probably horrible for EMI?) Each switch’s diode bridges over one of the row tracks, so the diode’s cathode can connect to either of the row tracks without extra faff.
The bottom-right modifier keys are rotated to make extra space for the bus (labelled “to or for by with or from everybody”) that connects the microcontroller tp the rows.
I also used KiCad to design the switch mounting plate and the base plate for the case.
I made some custom footprints for the LEGO Technic beams, which were mostly outlines on the “user comments” layer. Along with some scaffolding to fix the geometry of the corners, the beams indicated where the plate outlines and fastening holes needed to be.
This made it easier to get everything into the right place than it was for the PCB. The constraints imposed by using LEGO were helpful.
I had some trouble with the cutouts for the stabilizers that keep the long keys (space, return, etc.) level. The Cherry MX data sheet, ai03’s plate tool, the swillkb plate tool, and measurements of actual stabilizer housings all disagreed with each other. It does not help that the measurements use a ridiculous mixture of inches and millimetres. I chose a cutout size that I thought should fit, using round numbers of millimetres (Cherry is a German company).
I was not sure what to do about kerf, the difference between the tooling path and the edge of the cut. The tolerances specified in the Cherry MX data sheet are much finer than JLCPCB’s manufacturing tolerances, so I simply drew the switch cutouts as 14mm squares and hoped that would be good enough.
I got the boards made by JLCPCB, who are impressively cheap and fast. (Though a keyboard PCB is larger than the 100mm x 100mm limit for their super-cheap tier.)
When you submit an order, the JLCPCB web site shows you a picture of the board so you can do some final verification. I was worried by a bug: it does not render holes that are described in the board edge cuts layer, so my switch mounting plate was drawn without the holes for the switches. I found some forum conversations discussing this issue, saying that their boards were made correctly despite the rendering bug.
There were questions from the JLCPCB engineering staff about whether the mounting holes for the keyswitches were supposed to be plated or not, which caused a few days of delay. Once we were past that, I was impressed by JLCPCB’s live manufacturing progress tracker.
Another thing I found difficult about the submitting the order to JLCPCB was understanding what is part of their “Economic PCBA” offer.
One aspect is choice of components: JLCPCB has a huge parts library which has three important levels of classification:
“basic parts” are pre-loaded onto the pick-and-place machines, so you only pay the per-part cost;
“extended parts” have to be manually loaded, so there’s a fee for each different extended part you use;
there are also “standard only” parts which are not available for “Economic PCBA”
For my PCB, the Kailh hotswap sockets are “standard only” which meant it was better for me to obtain them elsewhere and solder them on myself. So the only soldering JLCPCB did for me was the diodes.
For a while I was also worried about the design-for-manufacturing requirements for panelization (how multiple PCBs are assembled as part of as single large sheet), fiducials (optical markers used by pick-and-place machines for precise positioning), tooling holes (for mounting the board in various machines). Eventually I worked out that for Economic PCBA, JLCPCB will take care of all of these details.
(Their documentation is excellent, but not always clear which flavour of PCB asembly is relevant for a particular requirement.)
An amusing thing I found when browsing the parts library is the term “brick nogging” which is apparently a mistranslation of an abbreviation for “vertical surface mount”. This came up when I was investigating options for USB-C sockets that occupy less vertical space. One of the options is “mid-mount” where the socket is recessed into the PCB; the JLCPCB parts library gives this the evocative name “sink board”.
When the boards arrived, I started by checking that mechanical assembly would work. Happily the stabilizers and switches fit very nicely into their cutouts in the top plate.
There was a worrying moment when I expected the switches to fit into the PCB easily, but they didn’t. It turned out that the mounting holes are very snug, and the switches need a little force to make them fit. (I thought the insertion force in my HHKBeeb came from the hotswap sockets; turns out it isn’t just the sockets.)
After I got past that, everything fitted together perfectly. I was hugely relieved.
Then I started attaching the microcontroller. After some fumbling attempts with my old plug-in-get-hot soldering iron, I decided I would need something with a finer tip.
I had heard that the Pine64 Pinecil is a reasonably effective and inexpensive soldering iron; bonus points for being open hardware designed to be hackable. Amazingly there is open source firmware for soldering irons called IronOS that works on the Pinecil and a number of other irons. Much credit is due to Miniware for making the TS100 soldering iron open enough to establish this niche. Soldering irons are now my second example (after mechanical keyboards) of how amazing open hardware and open firmware can be.
After soldering on the controller, I checked the switch matrix worked by plugging it into my computer and using tweezers to connect the pads for each hotswap socket. Everything worked!
Then I soldered on the keyswitch sockets. A meditative process. Except that lead-free solder is vexingly difficult to use.
To help my middle-aged eyes I used a 5x magnifying glass, 9cm diameter, mounted with a flexible neck on a desk stand like a lamp. I learned recently that magnifiers have an odd damping effect with the hand-eye coordination feedback loop that makes your hands more steady.
I also used some smaller more powerful lenses for visual inspection. There were a few badly soldered joints, and one where I was surprised that one of the metal connectors had pinged out of its socket without me noticing. Fortunately it was easy enough to find and solder it back where it should be.
After that, another test, and everything still worked! Amazing!
]]>Two years ago I planned to make a typical acrylic sandwich case for HHKBeeb, in the style of the BBC Micro’s black and yellowish beige case. But that never happened because it was too hard to choose a place to get the acrylic cut so my spec.
My idea for using LEGO in a keyboard case was originally inspired by James Munns, who uses LEGO for mounting PCBs, including at least one keyboard.
Howver, I could not work out how to make a case that is nice and slender and into which the parts would fit. It is possible – the KBDcraft Adam solves the problem nicely, and by all reports it’s pretty good as a keyboard, not just a gimmick.
To make the PCB design easier, I am using a Waveshare RP2040-Tiny. It’s more flexible than the usual dev boards used in custom keyboards because it has a separate daughterboard for the USB socket, but I had the devil of a time working out how to make it fit with LEGO.
Instead of using LEGO for the base, use FR-4, same as the switch mounting plate;
There isn’t enough space for SNOT so I can’t use LEGO studs to attach both the top and bottom of the case; why not use non-LEGO fasteners instead?
That will need through-holes, so maybe LEGO Technic beams will work?
Maybe the fasteners I got for the HHKBeeb case will work?
I wanted the fasteners for the HHKBeeb case to be as flat as possble; but acrylic does not work well with countersunk screws. Instead I looked for fasteners that protrude as little as possible.
For machine screws, I found the magic phrase is “ultra thin super flat wafer head”. These typically protrude 1mm or less, whereas the more common button head or pan head protrude about 2mm or more.
I also discovered rivet nuts. They are designed to be inserted into a sheet metal panel and squashed so that they grip the panel firmly. But I just wanted them for their thin flange, less than 1mm.
The usual fasteners for a sandwich case are machine screws inserted top and bottom, with a standoff in between. But Keybird69 uses machine screws in the top and rivet nuts in the bottom.
I’m using M3 rivet nuts and machine screws. The outer diameter of the rivet nuts is 5mm; the inner diameter of the Technic holes is 4.8mm. Fortunately the beams are made from flexible ABS, so the rivet nuts can be squeezed in and make a firm press fit. They can be pushed out again with a LEGO Brick Separator.
Many dimensions of the keyboard are determined by the Cherry MX keyswitch de-facto standard.
The switch mounting plate must be about 1.5mm – the standard PCB thickness of 1.6mm works fine.
The top of the PCB is 5mm below the top of the plate. The bottom of the PCB is also 5mm below the bottom of the plate because they are the same thickness. (Usually.)
The electronics are soldered to the bottom of the PCB.
A LEGO Technic beam is 8mm high (along the length of its holes).
The bodies of the switches and the PCB use 5mm of the beam height, leaving 3mm for the electronics. Plenty of space!
The height of the enclosure is 8 + 1.6 + 1.6 = 11.2 mm, which is pretty slender.
HHKBeeb’s generic case uses 10mm acrylic so it’s 2mm thicker, and the NightFox is about the same.
The Waveshare RP2040-Tiny daughterboard is problematic: its PCB is 1mm thick, and the USB-C receptacle is about 3.5mm high. It also has a couple of buttons for resetting or reflashing the RP2040, and they are a similar height.
I could not find a comfortable way to make space for it by cutting away part of the PCB to give it enough headroom. Then I had another brainwave!
I am not constrained by LEGO’s rectilinear grid, so I could make space by angling the back of the case outwards slightly. The middle of the back of the case has the extra few milimetres needed for the USB daughterboard.
If you look closely at the picture above, behind the USB-C receptacle and the M2 nuts, you can see the whiteish top of one of the buttons, and behind that is the beige textured edge of the PCB.
(Also, I need to turn the beams round so that the injection moulding warts are not visible!)
LEGO studs use an 8mm grid. Keys are on a 3/4 in grid, or 19.05mm.
Keybird69 is 5 keys deep, which is slightly less than 12 LEGO studs.
It is 16 keys wide, which is slightly more than 38 LEGO studs. Three LEGO Technic 13 hole beams are 39 studs long.
The front and sides of Keybird69 are enclosed with 5 beams of 13 holes each, which stick out half a stud past the block of keys. They meet at the corners so that the tangent is 45° where the rounded ends of the beams are closest.
This arrangement leaves about 1mm clearance around the PCB. Spacious.
Technic beams are not as square in cross-section as you might expect. Their height (through the holes) is 8mm, whereas their width (across the holes) is 7.2mm. In Keybird69 I left 0.4mm gap between them – I could have cut that down to 0.2mm without problems.
I used a 10mm radius of curvature for the corners. Apart from where the beams meet, the switch plate and base plate are very nicely flush with the beams.
I tried using a Sharpie permanent marker to blacken the edges of my Keybow 2040, but the ink did not stick very well. On Keybird69 I used an acrylic paint marker pen, which worked much better. Compare the raw fibreglass beige of the edges in the picture above to the black edges below.
One thing that probably isn’t clear from the pictures is that the FR-4 plates have an unexpectedly lovely matte reflective quality. I think it might be because the black solder mask is not completely opaque so the layer of copper underneath gives it a bit of shine.
I am also getting some black 13 hole Technic beams to replace the dark grey ones, gotta make sure the dust shows up clearly on every possible surface!
]]>I found out that the remaining stock of Matteo Spinelli’s NightFox keyboards were being sold off cheap because of a manufacturing defect. I grabbed one to find out what it’s like, because its “True Fox” layout is very similar to the unix69 layout I wanted.
My NightFox turned out to have about three or five unreliable keyswitches, which meant it was just about usable – tho the double Ts and unexpected Js rapidly became annoying.
But it was usable enough that I was able to learn some useful things from it.
The black-on-grey keycaps look cool, but they are basically illegible. (The picture above exaggerates their albedo and contrast.) This is a problem for me, especially while I was switching from the HHKBeeb ECMA-23-ish layout to an ANSI-ish TrueFox-ish unix68 layout.
Fortunately I learned this before making the mistake of buying some fancy black-on-grey keycaps.
I had seen a few keycap sets with extra up arrows, which puzzled me (For example.) The NightFox came with an extra up arrow, and eventually I twigged that it makes the profile of the arrow cluster a bit nicer.
Usually, in a sculpted keycap profile (where each row of keycaps has a differently angled top surface) the bottom two rows have the same angle, sloping away from the typist. This means the up arrow key slopes away from the other arrows on the row below.
The extra up arrow keys typically match the tab row, which is flat or angled slightly towards the typist. This gives the arrow cluster a more dishy feeling.
Unfortunately the keycaps I ordered do not have extra up arrow keys with tab row angle as an option. I did not realise until after I ordered them that I could have got a similar effect by using a reversed down arrow as an up arrow – it makes a sharper angle, but still feels nicer. So I’m using a reversed arrow key for Keybow69’s up button and my up/down legends both point the same way.
Some keycap sets have multiple page up / page down / home / end keys with different row profiles so that people with 65% and 75% keyboards can rearrange the right column of keys. (For example.)
Instead of the superfluous navigation keys, I used the NightFox novelty keycaps on my keyboard. (You can see the ANY KEY, the cute fox logo, etc. in the picture above.) These all had a top row profile, and at first I thought this was an ugly compromise.
But it turns out that the difference in height between the right column and the main block of keys is really useful for helping to keep my fingers in the right places. It makes me less likely to accidentally warp a window when I intend to delete a character.
The mismatched angle of the up arrow key is similarly helpful. Matt3o added a gap next to the arrow keys in his True Fox design to make the arrow keys easier to locate, but I think that isn’t necessary with an out-of-profile up arrow (which is also one of Matt3o’s favourite features).
I previously thought I wanted a uniform keycap profile (e.g. DSA like the keycaps that came with my Keybow 2040) but these discoveries taught me a sculpted profile is more practical for the keyboard I was making.
Another research purchase was a grab bag of random surplus keycaps, which is about as useless as you might expect: hundreds of keycaps, but too many duplicates to populate a keyboard. (My Keybow 2040 now has a very colourful mixture of miscellaneous keycaps.) The grab bag I got was mostly SA profile, which is tall and steeply angled on the near and far rows. In their normal orientation, SA function keys would probably not work so well on the right column of my keyboard, making a shape like a warehouse roof. Maybe they would work when rotated 90°? Dunno.
One of my old beliefs remained: I still prefer spherical indentations in the tops of my keycaps. They are more cuddly around my fingertips than the more common cylindrical indentations.
Annoyingly, many of the newer sculpted spherical keycap sets are hard to get hold of: often the only places that have them in stock will not ship to the UK at an affordable price. (For example.) Also annoyingly, the cheaper keycap sets almost never have the extras needed for unix69 compatibility. Bah.
The black-on-grey NightFox keycaps are Cherry profile (cylindrical indentations, sculpted rows, very short), and the keycaps that WASD printed for my HHKBeeb are OEM profile (like Cherry profile but taller). The HHKBeeb doesn’t have spherical keycaps because I don’t know anywhere that will do affordable one-off prints other than OEM profile. I also have a set of TEX ADA keycaps (uniform rows, short) which have lovely deeply scooped spherical tops, tho I am not a fan of their Helvetica legends.
So instead of a set of DSA keycaps (DIN height, spherical top, uniform) as I originally planned, I got DSS keycaps (DIN height, spherical top, sculpted). I love the Gorton Modified legends on Signature Plastics keycaps: as a business they descend from Comptec who made most BBC Micro keycaps.
I think Matt3o’s MTNU Susu keycaps are closer to my ideal, but I missed the group buy period and they have not been manufactured yet. And I wish they had an option for icons on the special keys instead of WORDS. I suspect the MTNU profile will become very popular, like Matt3o’s previous MT3 profile, so there will be chances to get some in the future.
]]>Compared to the usual ANSI layout, backquote is displaced from its common position next to 1. But a proper Unix keyboard should cover the entire ASCII repertoire, 94 printing characters on 47 keys, plus space, in the main block of keys.
To make a place for backquote, we can move delete down a row so it is above return, and put backslash and backquote where delete was.
(Aside: the delete key emits the delete character, ASCII 127, and the return key emits the carriage return character, ASCII 13. That is why I don’t call them backspace and enter.)
This produces a layout similar to the main key block of Sun Type 3, Happy Hacking, and True Fox keyboard layouts.
Personally, I prefer compact keyboards so I don’t have to reach too far for the mouse, but I can’t do without arrow keys. So a 65% keyboard size (5 rows, 16 keys wide) is ideal.
If you apply the Unix layout requirements to a typical ANSI 68-key 65% layout, you get a 69-key layout. I call it unix69. (1969 was also the year Unix started.)
http://www.keyboard-layout-editor.com/#/gists/2848ea7a272aa571d140694ff6bbe04c
I have arranged the bottom row modifiers for Emacs: there are left and right meta keys and a right ctrl key for one-handed navigation. Meta is what the USB HID spec calls the “GUI” key; it sometimes has a diamond icon legend. Like the HHKB, and like Unix workstations made by Apple and Sun, the meta keys are either side of the space bar.
There are left and right fn keys for things that don’t have dedicated keys, e.g. fn+arrows for page up/page down, home, end. The rightmost column has user-programmable macro keys, which I use for window management.
http://www.keyboard-layout-editor.com/#/gists/6610c45b1c12f962e6cf564dc66f220b
ANSI 65% keyboards have caps lock where control should be.
They have an ugly oversized backslash and lack a good place for backquote.
The right column is usually wasted on fixed-function keys.
It’s common for 65% keyboards to have 67 or 68 keys, the missing key making a gap between the modifiers and arrow keys on the bottom row. I prefer to have more rather than fewer modifier keys.
http://www.keyboard-layout-editor.com/#/gists/f1742e8e1384449ddbb7635d8c2a91a5
Matteo Spinelli’s Whitefox / Nightfox “True Fox” layout has top 2 rows similar to unix69. It sometimes has backslash and backquote swapped.
Unfortunately it has caps lock where control should be. Its right column is wasted on fixed-function keys (though the keyboards are reprogrammable so it’s mainly a keycap problem).
On the bottom row, True Fox has two modifers and a gap between space and arrows, whereas unix69 has three modifiers and no gap.
http://www.keyboard-layout-editor.com/#/gists/c654dc6b4c7e30411cad8626302e309f
The Happy hacking keyboard layout is OK for a 60% Unix layout. However it lacks a left fn key, and lacks space for full-size arrow keys, so I prefer a 65% layout.
https://dotat.at/graphics/keybird69.jpg
Owing to the difficulty of getting keycaps with exactly the legends I would like, the meta keys on my keybird69 are labelled super and the delete key is labelled backspace. I used F1 to F4 keycaps for the macro keys, tho they are programmed to generate F13 to F16 which are set up as Hammerspoon hot keys.
But otherwise keybird69 is a proper unix69 keyboard.
]]>A couple of years ago I made a BBC Micro tribute keyboard in the runup to the beeb’s 40th anniversary. I called it HHKBeeb:
The HHKBeeb is made from:
I planned to make a beeb-style acrylic sandwich case, but it was too hard to choose a place to get the acrylic cut, so that never happened.
In practice I find 60% keyboards (like the Happy Hacking Keyboard) too small – I need an arrow cluster. So I used the HHKBeeb with a Keybow 2040 macro pad to give me arrows and a few function keys for moving windows around.
My new keyboard is for a Finch and it has 69 keys, so it’s called Keybird69. (I was surprised that this feeble pun has not already been used by any of the keyboards known to QMK or VIA!)
It is made from:
A combination of reasons:
I have been mildly obsessed with compact keyboards practically forever, but back in the 1990s there were no good options available to buy, so I made do without.
The first small keyboard I liked was the (now discontinued) HHKB Lite
2,
which has an arrow cluster unlike the pure HHKB. I have a couple of
these lurking in the Boxes Of Stuff in the corner. But I’m not a huge
fan of the limited modifiers, or the Topre HHKB Lite 2 key
switches (they’re a bit mushy), or the styling of the HHKB case.
Correction: the HHKB Lite 2 did not actually use Topre switches.
I gradually used Macs more, and switched to using the Apple Aluminium keyboard - the model A1242 compact wired version, and the model A1314 wireless version. I also switched from a Kensington Expert Mouse trackball to an Apple Magic Trackpad.
But then Apple lost the plot with its input devices, so I thought I should plan to wean myself off. And in the mean time, the custom keyboard scene had flourished into a vibrant ecosystem of open source hardware and software.
So instead of relying on someone else to make a keyboard I like, I could make one myself! My own PCB and switch plate, designed for just the layout I want.
And with QMK open source firmware, I can make good use of the fn key that was so disappointingly unconfigurable on the HHKB and Apple keyboards.
I’m planning to write some more notes about various details of the design:
]]>Here’s an addendum about an alternative model of uniformity.
There are 2^62 double precision floats between 0.0 and 1.0, but as I described before under “the problem”, they are not distributed uniformly: the smaller ones are much denser. Because of this, there are two ways to model a uniform distribution using floating point numbers.
Both algorithms in my previous note use a discrete model: the functions return one of 2^52 or 2^53 evenly spaced numbers.
You can also use a continuous model, where you imagine a uniformly random real number with unbounded precision, and return the closest floating point result. This can have better behaviour if you go on to transform the result to model different distrbutions (normal, poisson, exponential, etc.)
Taylor Campbell explains how to generate uniform random double-precision floating point numbers with source code. Allen Downey has an older description of generating pseudo-random floating-point values.
In practice, the probability of entering the arbitrary-precision loop in Campbell’s code is vanishingly tiny, so with some small adjustments it can be omitted entirely. Marc Reynolds explains how to generate higher density uniform floats this way, and Pete Cawley has terse implementations that use one or two random integers per double. (Reynolds also has a note about adjusting the range and fenceposts of discrete random floating point numbers.)
]]>0.0
<=
n <
1.0
using an unbiased
random bit generator and IEEE 754 double precision arithmetic. Both of
them depend on details of how floating point numbers work, so before
getting into the algorithms I’ll review IEEE 754.
The first algorithm uses bit hacking and type punning. The second uses a hexadecimal floating point literal. They are both fun and elegant in their own ways, but these days the second one is better.
I wrote some follow-up notes on more random floating point numbers discussing another model of uniformity, and making better use of floating point’s dynamic range.
A floating point number has three parts:
It works like a binary version of decimal scientific notation, so its value is
+/- 1.fff...fff * 2ᴱ
Instead of using two’s complement, the exponent is stored as a biased integer,
E = eeeeeeeeeee - 1023
The maximum value of the exponent bits is 2047, and the bias is half
that, so 2⁰
is represented as 1023 in the exponent bits.
(A consequence of using biased integers instead of two’s complement is that IEEE 754 floating point numbers sort correctly if you treat them as sign-magnitude integers.)
In decimal scientific notation, a number is normalized by adjusting its exponent so its leftmost non-zero digit (most significant digit) is immediately to the left of the decimal point.
Similarly, a floating point number has its leftmost non-zero bit immediately to the left of its binary point. Because a non-zero bit is always 1, there is no need to store it explicitly. So double-precision numbers actually have 53 significant bits, of which 52 appear in their representation.
To get a uniform distribution of random
floating point numbers 0.0
<=
n <
1.0
:
0.5
<=
n <
1.0
, with E = -1
;0.25
<=
n <
0.5
, with E = -2
;0.125
<=
n <
0.25
, with E = -3
;To solve this problem we need to take control of integer to floating conversion.
The algorithm implemented by the FPU to convert a nonzero unsigned number to double precision floating point is:
Observe that floating point numbers 1.0
<=
n <
2.0
span the same range of values as we want,
but all the numbers have the same exponent E = 0
.
This allows us to avoid all the exponent shenanigans I described in
the previous sections.
So the idea is to use bitwise integer operations to put together a number with:
eeeeeeeeeee = 1023 = 0x3FF
so that E = 0
;fff...fff
filled with unbiased random bits.Then the bits of the integer are reinterpreted as the bits of a floating point number, subtract 1.0, and that’s the result.
However, C’s strict aliasing rules make it difficult to do this kind
of reinterpretation. For example, it is often done using a union
,
but the C standard says that is undefined behaviour. But there is a
loophole in the rules.
double pcg64_bithack_double(pcg64_t *rng) {
double d;
uint64_t u = pcg64_random(rng);
/* 52 bits of randomness in the little end */
u = u >> 12;
/* set exponent to 0 + bias = 1023 */
u = u | (1023 << 52);
/* reinterpret as floating point */
memmove(&d, &u, sizeof(d));
return(d - 1.0);
}
Modern compilers will inline the memmove()
so the function ends up
being compiled correctly and efficiently.
In practice, using a union
is OK because too much code would break
if compilers were as strict as they could be, and a union
will be
more efficient with older compilers.
C99 added support for hexadecimal floating point literals, which look like
0xHHHH.HHHHp+dddd
Slightly weirdly, while the fraction is written in hex, the exponent
(p
for power of 2) is written in decimal.
Hex floating literals are useful when it’s important to get exact control over the bit pattern in the number, which is much harder if you have to go via decimal.
The idea is to get a random N-bit integer,
/* 53 bits of randomness in the little end */
uint64_t u = pcg64_random(rng) >> 11;
convert it to a floating point number, 0.0
<=
d <
2N,
double d = (double)u;
then divide it by the maximum range of the integer, 2N, so the result is between 0.0 and 1.0. Or get the same effect by multiplying by 2-N.
d *= 0x1.0p-53;
Note that this solution returns 53 bits of randomness because it is able to make use of the implicit most-significant bit, whereas the bithack solution only has 52 bits of randomness. The 53rd bit ends up being encoded in the exponent when the integer is converted to floating point.
As well as using a hex floating literal, this solution assumes that double precision multiplication is fast. Which is true, because it has been the subject of important benchmarks for decades.
Putting it together gives us this beautifully lucid one-liner.
double pcg64_random_double(pcg64_t *rng) {
return((double)(pcg64_random(rng) >> 11) * 0x1.0p-53);
}
]]>math/rand/v2
for Golang’s standard library. It mentioned a new-ish flavour
of PCG random number generator which I had not previously encountered,
called PCG64 DXSM. This blog post collects what I have learned about
it. (I have not found a good summary elsewhere.)
At the end there is source code for PCG64 DXSM that you can freely copy and use.
Here’s the bit where the author writes their life story, and all the readers skip several screens full of text to get to the recipe.
Occasionally I write randomized code that needs a pseudorandom number generator that is:
The various libc
random number generators can satisfy one or two of
my requirements, if I am lucky.
So I grab a copy of PCG and use that. It’s only like 10 lines of code, and PCG’s creator isn’t an arsehole.
PCG random number generators are constructed from a collection of linear congruential generators, and a collection of output permutations.
A linear congruential random number generator looks like:
state = state * mul + inc;
The multiplier mul
is usually fixed; the increment inc
can be
fixed, but PCG implementations usually allow it to be chosen when the
RNG is seeded.
A bare LCG is a bad RNG. PCG turns an LCG into a good RNG:
PCG’s output permutations have abbreviated names like XSH (xor-shift), RR (random rotate), RXS (random xor-shift), XSL (xor shift right [sic]), etc.
The reference implementation of PCG in C++ allows you to mix and match LCGs and output permutations at a variety of integer sizes. There is a bewildering number of options, and the PCG paper explains the desiderata at length. It is all very informative if you are a researcher interested in exploring the parameter space of a family of random number generators.
But it’s all a bit too much when all I want is a replacement for
rand()
and srand()
.
For 32-bit random numbers, PCG has a straightforward solution in the
form of pcg_basic.c
.
In C++ PCG this standard 32-bit variant is called
pcg_engines::setseq_xsh_rr_64_32
or simply pcg32
for short.
(There is a caveat, tho: pcg32_boundedrand_r()
would be faster if it
used Daniel Lemire’s nearly-divisionless unbiased rejection sampling
algorithm for bounded random numbers.)
For 64-bit random numbers it is not so simple.
There is no 64-bit equivalent of pcg_basic.c
. The reference
implementations have a blessed flavour called pcg64
, but it isn’t
trivial to unpick the source code’s experimental indirections to get a
10 line implementation.
And even if you do that, you won’t get the best 64-bit flavour, which is:
PCG64 DXSM is used by NumPy. It is a relatively new flavour of PCG,
which addresses a minor shortcoming of the original pcg64
that arose
in the discussion when NumPy originally adopted PCG.
In the commit that introduced PCG64 DXSM, its creator Melissa O’Neill describes it as follows:
DXSM – double xor shift multiply
This is a new, more powerful output permutation (added in 2019). It’s a more comprehensive scrambling than RXS M, but runs faster on 128-bit types. Although primarily intended for use at large sizes, also works at smaller sizes as well.
As well as the DXSM output permutation, pcg64_dxsm()
uses a “cheap
multiplier”, i.e. a 64-bit value half the width of the state, instead
of a 128-bit value the same width as the state. The same multiplier is
used for the LCG and the output permutation. The cheap multiplier
improves performance: pcg64_dxsm()
has fewer full-size 128 bit
calculations.
O’Neill wrote a longer description of the design of PCG64 DXSM, and the NumPy documentation discusses how PCG64DXSM improves on the old PCG64.
In C++ PCG PCG64 DXSM’s full name is
pcg_engines::cm_setseq_dxsm_128_64
. As far as I can tell it doesn’t
have a more friendly alias. (The C++ PCG typedef
pcg64
still refers
to the previously preferred xsl_rr
variant.)
In the Rust rand_pcg
crate PCG64 DXSM is called
Lcg128CmDxsm64
, i.e. a linear congruential generator with 128 bits
of state and a cheap multiplier, using the DXSM permutation with 64
bits of output.
That should be enough search keywords and links, I think.
OK, at last, here’s the recipe that you were looking for:
// SPDX-License-Identifier: 0BSD OR MIT-0
typedef struct pcg64 {
uint128_t state, inc;
} pcg64_t;
uint64_t pcg64_dxsm(pcg64_t *rng) {
/* cheap (half-width) multiplier */
const uint64_t mul = 15750249268501108917ULL;
/* linear congruential generator */
uint128_t state = rng->state;
rng->state = state * mul + rng->inc;
/* DXSM (double xor shift multiply) permuted output */
uint64_t hi = (uint64_t)(state >> 64);
uint64_t lo = (uint64_t)(state | 1);
hi ^= hi >> 32;
hi *= mul;
hi ^= hi >> 48;
hi *= lo;
return(hi);
}
The algorithm for seeding PCG takes some raw seed values and conditions them to make a new RNG state that is “not ludicrous” (in the words of Simon Tatham). It is the same for all flavours of PCG:
pcg64_t pcg64_seed(pcg64_t rng) {
/* must ensure rng.inc is odd */
rng.inc = (rng.inc << 1) | 1;
rng.state += rng.inc;
pcg64_dxsm(&rng);
return(rng);
}
You can pass pcg64_seed()
a structure literal containing the raw
seed values, or use it like:
pcg_t pcg64_getentropy(void) {
pcg64_t rng;
if (getentropy(&rng, sizeof(rng)) < 0)
err(1, "getentropy");
return (pcg64_seed(rng));
}
The code above works on 64-bit systems with clang
and gcc
, which
have __uint128_t
built in.
If you need a C implementation that works on systems without a handy 128-bit integer type, then the PCG64 implementation in NumPy has its own support for 128-bit arithmetic.
Or if you are using C++ then the reference implementation of PCG has another 128-bit arithmetic class.
]]>There were a couple of things that I thought would make a fun talk:
A curious detail I have not found an answer to is why Louis Essen teamed up with William Markowitz to calibrate his clock, not someone more local such as the Astronomer Royal. It’s clear that Markowitz had a good deal of enthusiasm for the project, but I wonder if there were logistical reasons too: at that time the Royal Greenwich Observatory was moving to Herstmonceux.
I did not mention any satellite navigation systems other than GPS. I don’t know much about Galileo’s ground segment (its equivalent to Schriever SFB and the USNO) - I should really read more of Bert Hubert’s writing on the topic!
The layers I was peeling back are all below the complications of time zones; I couldn’t fit them neatly into the narrative. But there’s an interesting parallel (as it were) with the International Meridian Conference, which is the basis of the standard time zones.
Years ago I read the proceedings of the meridian conference, and it struck me that the French were quite opinionated, and grumpy that Greenwich had the lead. The Americans, who were the conveners, preferred Greenwich as the status quo. The British were relatively quiet - perhaps they lobbied off the record, or were just smug?
In the end, Greenwich won largely because it was already the meridian used by the majority of nautical charts. But it’s another example of international science being led by the French and Americans.
I think I was a bit unfair to the Paris Observatory, which had a more important role than my talk suggested.
Before responsibility for international time was split in the 1980s between the IERS and the BIPM, there was the Bureau International de l’Heure, which was responsible for both earth rotation and atomic time. It was based at the Paris Observatory.
Relatedly, I skipped over the awkward history of atomic time between the 1950s and 1970s (and today). I previously wrote about that in my update on leap seconds last year.
Finally, the last slide has the punch line that the Royal Greenwich Observatory did not show up in the story at all. I could have delivered the joke better: I think it needed a bit more context to land effectively.
I think it’s funny to deliver this story as an English person who lives almost on top of the Greenwich meridian; in fact for the last few years of its life, the rump of the RGO was based in Cambridge, and after it finally closed, my spouse Rachel worked in Greenwich House.
]]>I wrote a follow-up note, “Where does ‘where does my computer get the time from?’ come from?” about some things I left out of the talk.
Where does my computer get the time from?
from NTP - here’s a picture of an NTP packet
and here’s a picture of David Mills who invented NTP
simple question, easy answer, end of talk? No!
let’s peel off some layers…
stratum 3 NTP servers get the time from stratum 2 NTP servers,
stratum 2 NTP servers get the time from stratum 1 NTP servers,
stratum 1 NTP servers get the time from some reference clock
maybe a radio signal such as MSF in Britain or DCF77 in Germany
but in most cases the reference clock is probably a GPS receiver
here’s a GPS timing receiver
and here’s a GPS satellite
where does GPS get the time from?
Schriever Space Force Base in Colorado
they look after a lot of different top secret satellites and other stuff at Schriever, as you can see from all the mission logos
so you can’t get close enough to take a nice photo
Where does Schriever SFB get the time from?
the US Naval Observatory Alternate Master Clock is on site at Schriever in Colorado
the US Naval Observatory Alternate Master Clock gets the time from the US Naval Observatory in Washington DC
there are three answers
the first answer is atomic clocks, lots of atomic clocks
in the background there are dozens of rack mounted caesium beam clocks
in the foreground the black boxes house hydrogen masers
these shiny cylinders are rubidium fountains
the USNO has so many atomic clocks they have entire buildings dedicated to them
When I was preparing this talk I noticed on Apple Maps that there’s a huge building site in the middle of the USNO campus. It turns out they are building a fancy new clock house; the main limit on the accuracy of their clocks is environmental stability: temperature, humidity, etc. so the new building will have serious air handling.
the second answer is that UTC is a horrible compromise between time from atomic clocks and time from earth rotation
so the USNO gets the time from the international earth rotation service, which is based at the Paris Observatory
twice a year the IERS sends out Bulletin C, which says whether or not there will be a leap second in six months time; leap seconds are added (or maybe removed) from UTC to keep it in sync with earth rotation
the IERS is spread across several organizations which contribute to its scientific work
for example, you can subscribe to IERS Bulletin A, which is a weekly notice with precise details of the earth orientation parameters
Bulletin A is sent out by the US Naval observatory
they need to know the exact orientation of the earth under the GPS satellites, so they can provide precise positioning
the third answer is, how does the USNO know its atomic clocks are working well?
that information comes from the international bureau of weights and measures in Paris, who maintain the global standard UTC
how does the BIPM determine what UTC is?
the BIPM collects time measurements from national timing laboratories around the world, and uses those measurements to determine official UTC
periodically they send out Circular T which has information about the discrepencies between official UTC and UTC from the various national time labs
the BIPM is responsible for maintaining the international system of units, which is defined by the general conference on weights and measures
the CGPM is an international treaty organization established by the metre convention of 1875
UTC is an implementation of the SI unit of time, based on quantum measurements of caesium atoms
where did this magic number, about 9.2 GHz, come from?
in 1955, Louis Essen (on the right) and Jack Parry (left) built the first caesium atomic clock
the current definition of the second came from the calibration of this clock
before atomic clocks, the definition of the second was based on astronomy, so Essen and Parry needed help from astronomers to find out how fast their clock ticks according to the existing standard of time
they got help from the astronomers at the US Naval Observatory
the way it worked was William Markowitz measured time by looking at the skies, and Louis Essen measured time by looking at his atomic clock, and to correlate their measurements, they both listened to the WWV radio time signal broadcast by the national bureau of standards in Washington DC
this project took 3 years, 1955 - 1958
Markowitz was measuring the “ephemeris second”
in 1952 the international astronomical union changed the definition of time so that instead of being based on the rotation of the earth about its axis, it was based on the orbit of the earth around the sun
in the 1930s they had discovered that the earth’s rotation is not perfectly even: it slows down and speeds up slightly
clocks were now more precise than the rotation of the earth, so the ephemeris second was a new more precise standard of time
the ephemeris second is based on an astronomical ephemeris, which is a mathematical model of the solar system
the standard ephemeris was produced by Simon Newcomb in the late 1800s
he collected a vast amount of historical astronomical data to create his mathematical model
it remained the standard until the mid 1980s
here’s a picture of Simon Newcomb
he is a fine-looking Victorian gentleman
where did he work?
at the US naval observatory!
(and the US nautical almanac office)
I have now run out of layers: before this point, clocks were set more straightforwardly by watching stars cross the sky
so, to summarise my talk, where does my computer get the time from?
it does not get it from the Royal Greenwich Observatory!
]]>About 50 people gathered with several ideas for potential projects: things like easier DNSSEC provisioning, monitoring DNS activity in the network, what is the environmental cost of the DNS, …
At the start of the weekend we were asked to introduce ourselves and say what our goals were. My goal was to do something different from my day job working on BIND. I was successful, tho I did help some others out with advice on a few of BIND’s obscurities.
The team I joined was very successful at producing a working prototype and a cool demo.
The project that was the second most interesting to me was “DNS OOPS”, out-of-protocol signalling. The idea there was to find out things like when a zone has been loaded and is ready to serve, so that it can be added to a BGP route advertisement.
I talked about it with Willem Toorop, and I said OOPS sounded close to
what I had done with nsnotifyd
,
but OOPS would need a little more information added to the NOTIFY
messages, for instance if you want to monitor key rollovers.
The OOPS team ended up using nsnotifyd
as part of their demo, and
Lars-Johann Liman reported a usability bug to me, which I fixed.
I joined a team to work on Emile Aben’s idea for a scripting language that could run on RIPE Atlas measurement probes. Instead of collecting voluminous data that later needs to be reduced in a complicated manner, a script on the probe could strip out unwanted data before collection.
I suggested that Starlark might be a good language to use. It’s a simplified dialect of Python designed for sandboxed scripts embedded in applications. It was originally created for the Bazel build system, but it’s a general purpose alternative to Tcl, Lua, Guile, etc.
Annika Hennig suggested WASM as an alternative, which sounded cool to me. But it was too difficult to get a WASM runtime up and running with functions plugged in for DNS resolution. So we switched to the Golang implementation of Starlark.
We got Starlark to invoke dig
or the RIPE Atlas tool evdig
I suggested using dig +yaml
but we ended up using jc --dig
which is new to me.
Annika Hennig got us started with Starlark-go, and made an awesome
front end based on Codepen that could
dynamically deploy scripts to probes. Raffaele Sommese wrote a bunch
of Starlark scripts, and got it working with evdig
. Jonas Andersson
set up a RIPE Atlas probe onto which we could install our prototype
software, and sketched out some security considerations. Emile
gathered use-cases and wrote the presentation. I hooked more functions
up to Starlark and examined security of the Starlark implementation
and our prototype code.
The demo was deploying a Starlark script from the Codepen front-end to
a couple of probes; the script monitored changes to the .com
SOA
serial number across gtld-servers.net
.
Starlark turned out to be nice to work with, and Golang is a good hackathon language. I have not used Golang much before so I didn’t write much code; I was much more successful at coming up with ideas and suggestions.
For the context of this project, it looks like Starlark will be more accessible for data scientists than the possible alternatives, as I hoped.
]]>Since I started work at ISC my main project has been to adapt the NSD prototype into a qp-trie for use in BIND. The ultimate aim is to replace BIND’s red-black tree database, its in-memory store of DNS records.
Yesterday I merged the core qp-trie implementation into BIND so it’s a good time to write some blog notes about it.
The core of the design is still close to what I sketched in 2021 and implemented in NSD, so these notes are mostly about what’s different, and the mistakes I made along the way…
A qp-trie is a kind of radix tree, derived from Dan Bernstein’s crit-bit tree (one two) and Phil Bagwell’s HAMT (one two).
A qp-trie supports lexically ordered lookups on string keys, such as longest match (e.g., to find which zone can answer a DNS query), predecessor (e.g., to find an NSEC record for a DNSSEC proof of nonexistence), etc.
The version in BIND supports lock-free reads and strictly serialized transactional writes. It has multi-version concurrency with snapshots and rollbacks.
It typically uses about 20 bytes of memory (often less) per object stored in the trie. When keys are normal DNS names, an interior branch in the trie corresponds to one character in the name. A trie containing a million items can sustain millions of lookups per second per core.
NSD was a good place to implement the prototype, because it avoids a lot of the complexity of BIND.
NSD doesn’t use its tree data structures for as many different things as BIND does, so it didn’t need much support for polymorphism;
NSD is authoritative-only, so it doesn’t have to support the kind of high-frequency fine-grain updates that a DNS cache is subjected to;
NSD has a multi-process architecture, so I could handwave the details of multithreading and pretend to myself that it was enough to get the copy-on-write logistics working.
So the goals of the project were to fill in these gaps. Simples!
In BIND, when you create a qp-trie, you give it a few methods which
tell the trie how to work with the leaf objects you are going to store
in it. (See dns_qpmethods_t
)
attach()
and detach()
for reference counting:
When an interior leaf node is duplicated for copy-on-write, it can be just a sibling node to where the modify action is happening. The leaf objects hanging off the trie are much bigger than interior nodes, so we don’t want to duplicate them needlessly. Hence, we keep track of them using reference counting, like memory management in the rest of BIND.
makekey()
to get the qpkey
corresponding to the leaf object:
This turned out to be unexpectedly nice. In almost all cases, a
qp-trie key will be derived from a DNS name stored somewhere in
the leaf object. But instead of just getting the name, makekey()
is responsible for turning the name into a key,
usually by calling dns_qpkey_fromname()
. So there’s no
memory management difficulty if the method needs to temporarily
rehydrate the name, and the key could even be derived from some
non-DNS-name string.
For cases where multithreading isn’t needed, BIND has a bare
dns_qp_t
type which saves a bit of space. I’m not sure it’s going to
get much use, but I have used it in tests to deduplicate randomly
generated names.
All multithreaded access is done in the context of a transaction. When
you start a modify transaction, you get a pointer to the internal
dns_qp_t
which you can use just like the single threaded version.
For a read-only transaction, you get a pointer to a specific type
depending on the kind of transaction.
All the read-only operations on a qp-trie can take any of these types,
because the types share a common prefix. There’s a neat GNU C extension to
make this kind of polymorphism safer: you can declare a transparent
union of the compatible pointer types, and any type in
the union can be passed to the functions without casting. Unlike a
void *
argument, passing other types makes the compiler report an
error.
There are 4 kinds of transactions:
query transactions are lightweight and read-only
snapshots are heavyweight and read-only
update transactions are for heavyweight modifications
write transactions are for lightweight modifications
This is the workhorse transaction. It is lightweight because it is completely lock-free, relying on QSBR.
For years I was under the misapprehension that two versions of each trie would be enough: one stable shared read-only one, and an exclusive mutable one. However, two is not enough for lock-free read-only access and decent write performance, because there can be many modify transactions during a QSBR “grace period”, and each one creates a new version of the trie.
I wanted to avoid having to keep track of a fresh memory allocation for each version. The solution I came up with is to allocate read-only versions as a special type of node, so they can be cleaned up by the existing machinery for interior nodes.
A read-only version consists of a reference to the root of the trie, a pointer to a table of memory chunks, and a pointer to the trie object it belongs to. The root reference is unique per version; it identifies which cell within which chunk contains the root node. The chunk table is typically shared by many versions of the trie, so it has a reference count so we know when it is no longer needed. All we need the trie object for is the method table and user context pointer; it’s also used for some safety checks.
The way I integrated QSBR into BIND means that qp-trie queries can
only live as long as one uv_loop
callback. This is fine for
most DNS requests: the main exception is zone transfers, which might
need to keep a stable view of one version of a zone for several
seconds. For that they need a snapshot.
Snapshots are heavyweight, because to create one we need to briefly get the qp-trie’s mutex in order to copy some of its metadata into a new allocation.
A qp-trie snapshot has its own memory allocation, so it can live as many loops as necessary for a zone transfer to complete. Originally, I simply suppressed memory reclamation while a trie had any snapshots, until Paul Khuong pointed out that this would lead to trouble if (in DNS terms) a zone was subject to a heavy load of outgoing zone transfers (i.e. snapshots exist all the time) and concurrent UPDATEs or incoming zone transfers. To cope with that, snapshots are now aware of exactly which chunks they use, so the qp-trie knows which chunks become free when a snapshot is destroyed.
Update transactions are named after the DNS UPDATE opcode for making dynamic updates to zones. There are a couple of ways they are heavyweight.
It is fairly common for authoritative DNS servers to have huge numbers
of zones, most of which have only a couple of names (e.g.
example.com
and www.example.com
). So it would be enormously
wasteful to allocate for every zone a whole chunk of 1024 nodes and
use only 0.5% of them.
So, when an update transaction commits, it compacts the trie
to use as little space as possible, and uses realloc()
to shrink the
last chunk so it is only just as big as necessary.
The other thing is that update transactions can be rolled back. More than once I made the mistake of underestimating how much of the qp-trie allocator metadata needs to be restored in order to roll back correctly, which made me quite irritated with myself. In the qp-trie code that was merged this week, rollback support is both more correct and more straightforward than my previous attempts: when an update transaction is opened, it allocates a fresh copy of the trie’s metadata; when the transaction commits, the rollback copy is discarded; when the transaction is rolled back, the trie metadata is overwritten with its rollback copy.
Write transactions are lightweight because they avoid allocation (so no rollback support) and they do not eagerly compact the trie. They are intended for the DNS cache, which (compared to authoritative zones) has a high rate of small changes.
Allocating interior nodes for a qp-trie is very cheap with a chunk-based memory layout: just increment an index into the latest chunk and check that it has not run out of space. An update transaction has to allocate a fresh allocation chunk each time, because the previous one got shrunk; instead, write transactions re-use the same chunk until it fills up.
Another nice aspect of the chunk-based memory layout is that the metadata is per-chunk, not per node, saving a lot of overhead. One of the bits of metadata is a flag indicating that a chunk contains nodes from previous transactions and is therefore immutable. (Modifications need to use copy-on-write, not overwrite-in-place.) But one bit is not enough when write transactions are re-using chunks!
One of my more hilariously frustrating bugs was to do with an extra
strict debug-mode qp-trie. Instead of immutability being just a
flag, it can be enforced in hardware using mprotect()
. This is a
very effective way to make my computer point out to me that I have got
some of the fine details of write transactions wrong.
A big disadvantage of cheap sequential memory allocation is that it usually makes it more expensive to reclaim memory. This happens in a couple of ways in BIND’s qp-trie:
compaction, which is necessary to evacuate chunks that are nearly empty, so the trie isn’t holding on to memory that is mostly unused.
scanning chunks before they are free()
ed, to decrement refcounts.
Both of these can take ages, which I am not very happy about, but both of them are designed not to happen in performance-sensitive situations. I guess we’ll see if they turn out to be slow enough to be painful.
The core qp-trie and qsbr code (in libdns
and libisc
) consist of:
There is some design and overview documentation for the qp-trie and QSBR:
And in the tests,
I’m reasonably happy with the way the code has turned out. It has taken about twice as long as I hoped to get to this point, for two main reasons:
liburcu
, so I designed and wrote a QSBR
implementation and redesigned the qp-trie for good performance
with safe memory reclamation.The last few weeks I have started to put this new qp-trie to work as a replacement for BIND’s zone table. Every query has to go through the zone table, so it is crucial for good performance, but it is also relatively simple. The main thing I hope to learn from this exercise is how to get all this new machinery to shut itself down gracefully.
I am looking forward to seeing some practical results!
]]>__builtin_longjmp
! There’s a small aside in this article about the
signal mask, which skates past another horrible abyss - which might
even make it sensible to DIY longjmp
.
Some of the nastiness can be seen in the POSIX rationale for
sigsetjmp
which says that on BSD-like systems, setjmp
and
_setjmp
correspond to sigsetjmp
and setjmp
on System V Unixes.
The effect is that setjmp
might or might not involve a system call
to adjust the signal mask. The syscall overhead might be OK for
exceptional error recovery, such as Chris’s arena out of memory
example, but it’s likely to be more troublesome if you are
implementing coroutines.
But why would they need to mess with the signal mask? Well, if you are
using BSD-style signals or you are using sigaction
correctly, a
signal handler will run with its signal masked. If you decide to
longjmp
out of the handler, you also need to take care to unmask the
signal. On BSD-like systems, longjmp
does that for you.
The problem is that longjmp
out of a signal handler is basically
impossible to do correctly. (There’s a whole flamewar in the wg14
committee documents on this subject.) So this is another example of
libc being optimized for the unusual, broken case at the cost of the
typical case.
McKenney goes on to propose an RCU classification system based on the API an implementation provides to its users. (I am curious that the criteria do not involve how RCU works.)
Here’s how I would answer the questions for QSBR in BIND:
Are there explicit RCU read-side markers?
No, it relies on libuv
callbacks to bound the lifetime of a
read-side critical section.
Are grace periods computed automatically?
Yes. There is an internal isc__qsbr_quiescent_state()
function,
but that mainly exists to separate the QSBR code from the event
loop manager, and for testing purposes, not for use by
higher-level code.
Are there synchronous grace-period-wait APIs?
No. (Because they led me astray when designing a data structure to use RCU.)
Are there asynchronous grace-period-wait APIs?
Yes, but instead of one-shot call_rcu()
, a subsystem (such as
the qp-trie code) registers a permanent callback
(isc_qsbr_register()
), and notifies the QSBR when there is work
for the callback to do (isc_qsbr_activate()
). This avoids having
to allocate a thunk on every modification, and it automatically
coalesces reclamation work.
If so, are there callback-wait APIs?
No. At the moment, final cleanup work is tied to event loop teardown.
Are there polled grace-period-wait APIs?
No.
Are there multiple grace-period domains?
One per event loop manager, and there’s only one loopmgr
.
My qp-trie code supports concurrent readers and at most one writer at a time. Writes are strictly serialized.
The old code reclaimed memory synchronously, immediately after commiting a modify transaction. It could be built to use one of two different concurrency control mechanisms:
A reader/writer lock
This has poor read-side scalability, because every thread is hammering on the same shared location. Writers perform reasonably well (BIND’s rwlock prefers writers) but a writer blocks all readers when committing, just long enough to swap the trie’s root pointer.
liburcu
, userland read-copy-update
RCU has a fast and scalable read side, nice! But on the write side
I used synchronize_rcu()
, which is rather slow, so write
performance is terrible. In effect, readers block the writer for a
limited but relatively long period of time.
The new code reclaims memory asynchronously, using a callback
instead of blocking like synchronize_rcu()
, so it should be able to
get much better performance with liburcu
. However, it can only be
built one way:
QSBR has a fast scalable read side. Reads and writes do not interfere with each other at all, except for contention on the trie’s root pointer, which only occurs on commit.
The aim of the refactored qp-trie is to have one good concurrency mode instead of two bad ones.
I needed to re-do my multithreaded qp-trie benchmark so that it uses
isc_qsbr
in a reasonably realistic manner. This means that each CPU
is running a libuv
event loop; libuv
callbacks drive
both the QSBR phase transitions and the benchmark runs.
The old benchmark used unvarnished pthreads
, so the raw numbers from
the old and new benchmarks are not directly comparable. I have cooked
most of the numbers from the new benchmark to discount time spent in
libuv
. (The benchmark does print the time taken including libuv
,
and some of the benchmark series explore how much overhead it adds,
but I’m not concerned with that aspect today.)
The old and new benchmarks have the same overall plan:
Each benchmark run exercises a qp-trie for 250 ms
Each run is part of a series where most of the parameters are fixed (e.g. number of threads, size of transactions), and one parameter is varied
There is a collection of series each of which examines some aspect of performance that I am concerned about.
My main aim is to confirm that the code behaves as expected, not so much to get the most flattering absolute numbers.
I ran the benchmarks on my MacBook Pro M1 Pro, without making any effort to reduce background activity, so the numbers are rather noisy.
How does read performance vary with the number of concurrent threads? The number of ops per microsecond per core should be flat.
rwlock urcu qsbr
read Kops ops/us Kops ops/us Kops ops/us
1 1530 6.00 1348 5.31 1251 5.32
2 3003 5.89 3030 6.05 2746 5.87
3 4626 6.05 4521 5.91 3964 5.66
4 5568 5.46 5721 5.61 5600 6.03
5 7045 5.53 7102 5.57 7047 6.10
6 8582 5.61 7988 5.32 8387 6.09
7 9628 5.39 9653 5.41 9448 5.89
8 11190 5.48 10997 5.39 10998 5.97
9 11364 4.95 11578 5.13 11352 5.68
10 12840 4.80 12511 4.91 11716 5.31
OK, the urcu
and qsbr
numbers are very noisy, but generally flat;
there’s a perceptible downward trend in the rwlock
performance. (The
lower absolute count of ops for qsbr
is due to the libuv
loop
overhead, which is discounted from the ops/us/core numbers.)
What if we use one of the cores to add some write activity?
rwlock urcu qsbr
read Kops ops/us Kops ops/us Kops ops/us
1 627 2.46 1406 5.52 1134 4.83
2 1253 2.46 2832 5.65 2447 5.22
3 1666 2.17 4256 5.56 3717 5.32
4 2002 1.96 5831 5.80 4749 5.36
5 2313 1.81 6933 5.54 6147 5.43
6 2688 1.75 8389 5.48 7512 5.50
7 2703 1.52 9535 5.44 8928 5.59
8 2643 1.29 11769 5.77 9284 5.26
9 3153 1.37 11686 5.09 10344 5.28
The rwlock
read performance takes a dive, and scales even worse
with the number of threads. The urcu
and qsbr
performance remains
flat and is not much slower. (I think it could be better if I get rid
of some false sharing?)
But while that reading is going on, what is the write thread doing?
rwlock urcu qsbr
read ops ops/us ops ops/us ops ops/us
1 65440 0.03 320 0.00 72208 0.03
2 54704 0.03 304 0.00 75200 0.04
3 50384 0.03 304 0.00 77184 0.05
4 44672 0.03 240 0.00 70400 0.06
5 45504 0.04 240 0.00 67024 0.06
6 40944 0.04 192 0.00 65632 0.07
7 42256 0.05 208 0.00 61056 0.09
8 40320 0.08 176 0.00 60800 0.16
9 41424 0.16 144 0.00 42400 0.25
As I said, the urcu
write performance is terrible (note the ops
columns are not divided by 1000 in this table!). Pleasingly, the
qsbr
code gets better write throughput than the wrlock
code as
well as decent read throughput.
That’s possibly even better than what I wanted, yay!
In the benchmark runs above, I did 32 read operations per transaction
(and in the qsbr
case, 32 transactions before returning to the
libuv
event loop). TBH 32 ops per read transaction is
unrealistically large. To increase write contention I did a more
realistic 4 ops per write transaction (and 4 tx per loop with qsbr
).
To get some idea of the read-side overhead, I did a series of runs that keep the constant 32*32 == 1024 ops per loop, but vary the number of ops per transaction.
A lightweight read side should have performance that does not degrade so much towards the top of the table, where there are only a few ops per transaction.
rwlock urcu qsbr
tx/loop ops/tx Kops ops/us Kops ops/us Kops ops/us
1024 1 822 0.36 9828 4.28 8962 4.54
512 2 1230 0.54 10566 4.60 9645 4.89
256 4 1938 0.84 10752 4.68 9864 5.03
128 8 2525 1.10 11595 5.04 10148 5.23
64 16 2658 1.16 11756 5.11 10198 5.24
32 32 3176 1.36 12118 5.28 10432 5.34
16 64 3206 1.40 12170 5.30 10256 5.27
8 128 3886 1.70 12141 5.29 10345 5.30
4 256 5065 2.21 12008 5.23 10267 5.29
2 512 5899 2.57 11555 5.03 10384 5.30
1 1024 6406 2.84 11885 5.17 10246 5.25
All of these benchmark runs are working on a qp-trie containing a million random string keys. An operation chooses a key uniformly at random. This thrashes the CPU’s caches fairly hard: there’s almost one GByte of leaf objects (mostly consisting of oversized buffers containing the keys) plus 15 MBytes for the trie interior nodes.
It goes quite a lot faster with a smaller trie, and the space used by the trie structure remains reasonably modest. (I think more realistic / less random keys would need more like 20 bytes of interior nodes per item, because they would have a deeper narrower trie, needing more interior branch nodes. Each item in the trie needs at least 12 bytes for its interior leaf node.)
rwlock urcu qsbr
B/item items Kops ops/us Kops ops/us Kops ops/us
64.00 10 37474 14.70 53225 20.90 46567 27.73
17.08 100 35880 13.88 45519 17.86 39366 22.31
14.49 1000 31671 12.43 37312 14.64 37173 20.46
16.01 10000 27270 10.70 29958 11.75 29821 15.80
14.85 100000 19915 7.81 19228 7.61 18874 9.31
15.14 1000000 12125 4.79 12686 4.97 11578 5.30
There isn’t much to put these numbers into context.
Previously, when I prototyped an earlier version of my qp-trie in NLnetLabs NSD, I was able to compare it fairly directly with NSD’s existing red-black tree and radix tree: the interface between the data structures and the rest of NSD is straightforward.
In BIND, things are more complicated, because BIND is multi-threaded (NSD uses multiple processes) and makes changes to its working data structures all the time (because it is a resolver, and because its authoritative zones support dynamic updates and automatic DNSSEC signing).
BIND’s existing red-black tree is single-threaded. On top of the rbt
is the rbtdb
, which is an in-memory database of DNS records that
uses the rbt
as an index. The rbtdb
is deeply intertwingled with
the rbt
in order to add multithreading support, and with the
rdataslab
storage for DNS records. So it is difficult (and I have
not tried) to set up a simple benchmark that can compare a qp-trie
with BIND’s rbtdb
.
I hope that the qp-trie will give BIND some cleaner layering. The data
structure is general-purpose, and has built-in support for concurrency
and versioning. (Versioning allows zone transfers and updates to
happen concurrently without interference, and it supports
ixfr-from-differences
.) Its API enforces a clean
separation between its interior nodes and the leaf objects stored in
the trie, which doesn’t exist in BIND’s rbt
.
As the qp-trie gets used for real work inside BIND, I expect there will be some more meaningful benchmarks, so we can show there are performance improvements, or at least no regressions.
]]>