.@ Tony Finch – blog

There's a category of bit-twiddling hacks called "SWAR": SIMD within a register. The essence is to treat a register not as a single number but as a packed array of bit fields. The x86 vector processing extensions (MMX / 3Dnow / SSE) have a SWARish flavour, but you can use the idea with just basic bitwise operations. The ultimate reference for this is the book Hacker's Delight, and there are several useful sites on the web (grep for "hack" in my URL log).

A few years ago I wrote a fairly neat implementation of Conway's Game of Life that did a lot of SWAR. (I described the evolution of the code soon after I finished it.) The key SWAR-related idea is to use four 64-bit registers to represent 64 four-bit numbers: one register has bit 0 times 64, one has bit 1 times 64, one has bit 2 times 64, etc. Then you can use bitwise operations to implement parallel addition in exactly the way that is taught in basic digital electronics. The following macros take 2 or 3 bit vectors of equal value and produce sum and carry result vectors.

The half adder:
	#define ADD2(s0,s1,a0,a1) do {		\
			s1 = (a0) & (a1);	\
			s0 = (a0) ^ (a1);	\
		} while(0)
The full adder:
	#define ADD3(s0,s1,a0,a1,a2) do {	\
			uint64_t c0, c1;	\
			ADD2(s0,c0,a0,a1);	\
			ADD2(s0,c1,s0,a2);	\
			s1 = c0 | c1;		\
		} while(0)

More recently I have been thinking about Bill Gosper's Hashlife algorithm. I'm not going to describe its mind-bending brilliance here, because I want to concentrate on the SWAR tangent that spun off from it. Hashlife's core algorithm is strategically very powerful, but it is not very efficient at low-level tactical bit banging. For example, Golly uses table lookup for calculating small parts of the universe instead of relying on Hashlife all the way down.

My previous SWAR Life code worked on 66x3 cell bitmaps (i.e. a 64 bit register plus its surrounding cells) which arose from the way it raster-scanned its way across the universe. However Hashlife works on square subdivisions of the universe that are a power of two cells on a side. An 8x8 square fits perfectly into a 64 bit register, which is ideal for SWAR. The real beauty (and what makes this Life in a register and not around a register) is that Hashlife deals with the splicing of adjoining squares at its strategic level, so we don't have to look beyond the register file to produce a useful result.

LIAR represents each row of the 8x8 square as an octet in the register. We don't have to worry if the big byte is the top row or the bottom row, or if the big bit of a byte is at the left or right end of the row, because that's handled at the strategic level. By the time the square gets to us symmetry has eliminated worries of endianness. I'll use the big endian convention in the code since it produces more consistent nomenclature.

In order to SWARishly add the eight 1-bit numbers corresponding to a cell's neighbours, they must be shifted to equivalent places in their bit vectors. (In fact we also add the cell itself into the count, since this gives us a neat short-cut.) Start off by getting the left and right cells into the proper place. We need to mask so that rows don't bleed into each other.

	uint64_t liar(uint64_t bmp) {
		uint64_t left  = bmp << 1 & 0xFEFEFEFEFEFEFEFE;
		uint64_t right = bmp >> 1 & 0x7F7F7F7F7F7F7F7F;

Now we can add each cell to its left and right neighbours. (The numeric variable suffix is the value of the bits in the vector.) This does a 3-to-2 horizontal compression of the whole bitmap.

		uint64_t mid1, mid2;
		ADD3(mid1, mid2, left, bmp, right);

By shifting again we can line up the counts of the upper and lower neighbours with each cell. This is where it is helpful to count the cell as well as its neighbours, so all rows are treated the same.

		uint64_t up1   = mid1 << 8;
		uint64_t up2   = mid2 << 8;
		uint64_t down1 = mid1 >> 8;
		uint64_t down2 = mid2 >> 8;

Now we can add these together to compress vertically. We're lazy about propagating the carries in the value-2 bits since it doesn't buy us anything. We end up with bits valued 1, 2, 2, and 4, which makes a maximum total value of 9.

		uint64_t sum1, sum2a, sum2b, sum4;
		ADD3(sum1, sum2a, up1, mid1, down1);
		ADD3(sum2b, sum4, up2, mid2, down2);

The usual way of expressing the Life rule is: if the cell is dead and it has 3 neighbours, it is born, otherwise it stays dead; if a cell is live and it has 2 or 3 neighbours, it survives, otherwise it dies. We need to adjust this because we count the cell too, so if a cell is live and its count is 3 or 4, it survives, otherwise it dies. Notice that when the count is 4 the cell's state is the same in the previous and next generations. Translated into boolean arithmetic, this logic becomes:

		return(bmp & ~sum1 & (~sum2a & ~sum2b  &  sum4 |
				       sum2a &  sum2b  & ~sum4)
			    | sum1 & ( sum2a ^  sum2b) & ~sum4);

This code uses just 39 ALU operations, most of which can run concurrently on a superscalar chip. It needs (I think) a maximum of 8 live values (which is slightly obscured by the coding style) so it shouldn't even need to hit the stack. By contrast, Golly's equivalent code (leafres() near the start of its lifealgo.cpp file) uses 52 ALU ops and 9 random reads of a 64KB table (plus overhead owing to a sub-optimal 8x8 bitmap layout). Hashlife likes to calculate two generations of an 8x8 square at a time, and the second one is easier because you don't need to worry about the edges of the square, so it takes Golly 30 ALU ops and 4 random reads. The two generation total is more than 82 ALU ops and 13 reads for Golly, and just 78 ALU ops for LIAR.

A success, I think :-) With a bit more cunning it might be possible to squeeze LIAR even more...

(After thinking some more...) Actually, propagating carries does simplify it. We don't need to propagate them all the way, because we can treat counts of 8 and 9 identically to 0 and 1. This means we can replace the return statement above with the following. This saves 4 ALU ops, reducing the cost from 39 to 35. I've also worked out that the maximum number of registers needed is six, not eight, and it fits into an in-order triple-issue instruction pipeline so it can run in 15 clock cycles.

		uint64_t sum2, sum4b;
		ADD2(sum2, sum4b, sum2a, sum2b);
		sum4 = sum4 ^ sum4b;
		return(bmp & ~sum1 & ~sum2 &  sum4 |
			      sum1 &  sum2 & ~sum4);

The final code is available on my web site