Faster LIAR (Life in a register)

This week I have been discussing the Game of Life with the amazingly enthusiastic Brice Due, after he commented on my original post about LIAR. He put me in touch with Tom Rokicki who, as well as writing probably the best implementation of Life, recently featured in New Scientist for his work on God's solution to Rubik's Cube.

Tom pointed me at a better bit-twiddling implementation of Life which comes in at 26 operations compared to my code's 34 - a significant improvement. However his code is ultra-compressed and really hard to understand. So I wrote a comprehensible version. This code has some rather clever features.

	left = bmp << 1;
	right = bmp >> 1;
	side1 = left ^ right;
	side2 = left & right;
	mid1 = side1 ^ bmp;
	mid2 = side1 & bmp;
	mid2 = side2 | mid2;

In the first stage he uses a normal full adder to do the horizontal 3-to-2 compression which counts the number of cells in each row (bits mid1 and mid2). The first two ops of the full adder form a half adder which he uses to count the cells to either side of the middle cell (bits side1 and side2). Thus he gets four useful outputs from just five ops!

	upper1 = mid1 << 8;
	lower1 = mid1 >> 8;
	upper2 = mid2 << 8;
	lower2 = mid2 >> 8;

The side bits are kept in the middle row, whereas the mid bits are shifted to become the sums for the upper and lower rows.

#define REDUCE(g,f,a,b,c) do {	\
		uint64_t d, e;	\
		d = a ^ b;	\
		e = b ^ c;	\
		f = c ^ d;	\
		g = d | e;	\
	} while(0)

	REDUCE(sum12, sum13, upper1, side1, lower1);
	REDUCE(sum24, sum26, upper2, side2, lower2);

The second stage is to compress the three rows into separate sums of the low bits and the high bits. Tom does not use a normal adder to do this. It turns out that there are 24 adder-like boolean functions of three arguments, which produce two values that together encode the number of input bits. Only one of them requires four operations - the others require five or more. The encoding of the result is a bit strange:

count	result
0	f=0 g=0
1	f=1 g=1
2	f=0 g=1
3	f=1 g=0

As well as saving two ops compared to using a full adder, this function makes the final twiddling particularly brief.

	tmp = sum12 ^ sum13;
	bmp = (bmp | sum13) & (tmp ^ sum24) & (tmp ^ sum26);
	return(bmp & 0x007E7E7E7E7E7E00);

I wondered if it's possible to use the 4-op adder-alike in the first stage too. The problem with adding together all nine cells is overflowing the three bit result. This is OK when using a full adder because the overflow happens to confuse a very large count (therefore a dead result) with a very small count (again a dead result) so it makes no difference. The modified adder confuses a middling small count (live result) with a middling big count (dead result) so it fails. The dual-purpose half+full-adder trick neatly avoids overflow problems.

Despite successfully unpicking this code I still find it rather unintuitive, especially the final twiddle. I am in awe of people who can invent such cunning tricks.