$dotat: life/liphe.txt,v 1.3 2008/09/04 16:58:46 fanf2 Exp $
Some notes on parallel hashlife.
================================
* High level plan
The general idea is to use lock-free implementation techniques for the
highly-contended data structures, which means that when a thread
cannot make progress it does not block but instead saves the current
computation's state and finds something else to work on. This means we
need no more than one thread per CPU. We do use traditional locks when
passing work between CPUs.
* Nodes
A node can be in one of three states, as follows:
- no result (node->result == NULL)
- working on result (node->result == &dummy)
- result available (node->result has some other value)
The dummy node is a global variable whose contents don't matter.
The "working" state encompasses both active and suspended work.
These are distinguished by a thread's posession of the node or its
suspension record (active) or an un-owned suspension record on a work
queue (fullly suspended).
* Suspensions
The calculation for a node proceeds in two stages, either of which can
block waiting for a sub-node's calculation to complete. Blocking is
implemented by saving the state in a suspension record and going to
work on something else.
struct susp1 {
// pointer to the suspended node
node *sq;
// the nine sub-squares whose results we are waiting for
node *nw, *nn, *ne,
*ww, *cc, *ee,
*sw, *ss, *se;
};
struct susp2 {
// pointer to the suspended node
node *sq;
// the four sub-squares whose results we are waiting for
node *nw, *ne,
*sw, *se;
};
The suspension records need to be traced by the garbage collector.
* Getting a result
In normal hashlife, the result function either returns a node's cached
result immediately, or calculates the result and caches it before
returning. That is, its result is always useful to the caller.
In this code there is a third possibility that the node is being
worked on, in which case the result function returns NULL. This causes
the parent node's calculation to be suspended.
In order to take possession of a node atomically, we use the
compare-and-swap instruction to replace a NULL result with a dummy
result. If this fails (because of a collision with another thread) we
try again.
The following is a small adjustment from normal hashlife; there's a
more radical version further below.
node *calc(node *sq, int lev) {
do {
if(sq->result == &dummy)
return(NULL);
if(sq->result != NULL)
return(sq->result);
} while(!CAS(sq->result, NULL, &dummy));
// now we own the node and can work on it
// XXX see below
int down = lev - 1;
// calculate stage one
node
*nw1 = find_nw(sq), *nn1 = find_nn(sq), *ne1 = find_ne(sq),
*ww1 = find_ww(sq), *cc1 = find_cc(sq), *ee1 = find_ee(sq),
*sw1 = find_sw(sq), *ss1 = find_ss(sq), *se1 = find_se(sq),
*nw2 = calc(nw, down), *nn2 = calc(nn, down), *ne2 = calc(ne, down),
*ww2 = calc(ww, down), *cc2 = calc(cc, down), *ee2 = calc(ee, down),
*sw2 = calc(sw, down), *ss2 = calc(ss, down), *se2 = calc(se, down);
// if any of the above collided with another thread,
// suspend this node and return NULL
if(!nw2 || !nn2 || !ne2 ||
!ww2 || !cc2 || !ee2 ||
!sw2 || !ss2 || !se2)
return suspend1(sq, lev, nw1, nn1, ne1,
ww1, cc1, ee1,
sw1, ss1, se1);
// calculate stage two
node
*nw3 = find(nw, nn, ww, cc), *nw4 = calc(nw3, down),
*ne3 = find(nn, ne, cc, ee), *ne4 = calc(ne3, down);
*sw3 = find(ww, cc, sw, ss), *sw4 = calc(sw3, down);
*se3 = find(cc, ee, ss, se), *se4 = calc(se3, down);
// if any of the above collided with another thread,
// suspend this node and return NULL
if(!nw4 || !ne4 || !sw4 || !se4)
return suspend2(sq, lev, nw3, ne3, sw3, se3);
// set the final result and release ownership of the node
return sq->result = find(nw4, ne4, sw4, se4);
}
I'm not sure if we need a write barrier at the end of the function.
* Parallelism
The above doesn't have any way of introducing parallelism because the
calc() function is too eager. To split the calculation across CPUs we
should make it lazier, by suspending before doing any recursion. This
should probably be done only for high-level nodes, so that we don't
waste too much time on the overhead of suspending and resuming
calculations.
struct susp0 {
// just a pointer to the suspended node
node *back;
};
// in the above sketch of calc() at the XXX comment
// high_lev is some external or class member variable
if(lev > high_lev)
return suspend0(sq, lev);
* Suspending and resuming computations
This part is crucial to the efficiency of distributing the work across
CPUs as well as the efficiency of each individual thread. There's lots
of scope for experimentation here.
The most important thing is to avoid contention on the suspension
record queues. Thus there should be per-thread queues for efficiency
(no locking or memory barriers to access them) plus global queues to
distribute work (which are locked with conventional mutexes). If we
get the balance right then the global queues should not be too
highly-contended, so a simple locking strategy should be OK. This
means we can avoid complicated lock-free data structures in the
scheduler, which is likely to be complicated enough for other reasons.
Perhaps a starting point would be:
- allocate queues in blocks of (some number of) suspension records
- each level has three queues of susp0, susp1, and susp2 records
- when suspendN() fills a block, it puts the block on a global list of
blocks and obtains an empty block for future suspensions
(or some other heuristic for moving work to other threads)
- when resuming we scan a block at a time; if any of the records in
the block get re-suspended the new record is put in the current
allocation block (different from the block we are resuming)
// for N in 0,1,2
struct suspN_block {
suspN_block *next;
suspN rec[SUSPN_BLOCK_SIZE];
};
When we call the block_resumeN() functions, the current thread owns
the block, therefore it owns all the suspension records, therefore it
owns all the suspended nodes. Therefore no locking is required.
void block_resume0(susp0_block *blk, int lev) {
for(int i = 0; i < SUSP0_BLOCK_SIZE; i++)
resume0(blk->rec[i].sq, lev);
free_block0(blk);
}
void block_resume1(susp1_block *blk, int lev) {
for(int i = 0; i < SUSP1_BLOCK_SIZE; i++)
resume1(blk->rec[i].sq, lev,
blk->rec[i].nw, blk->rec[i].nn, blk->rec[i].ne,
blk->rec[i].ww, blk->rec[i].cc, blk->rec[i].ee,
blk->rec[i].sw, blk->rec[i].ss, blk->rec[i].se);
free_block1(blk);
}
void block_resume2(susp1_block *blk, int lev) {
for(int i = 0; i < SUSP2_BLOCK_SIZE; i++)
resume2(blk->rec[i].sq, lev,
blk->rec[i].nw, blk->rec[i].ne,
blk->rec[i].sw, blk->rec[i].se);
free_block2(blk);
}
The resume functions are basically suffixes of the calc() function
above. It probably makes sense to split it up as follows. The function
boundaries are the suspend/resume points. It's a bit like event-driven
networking code.
node *calc(node *sq, int lev) {
do {
if(sq->result == &dummy)
return(NULL);
if(sq->result != NULL)
return(sq->result);
} while(!CAS(sq->result, NULL, &dummy));
// now we own the node and can work on it
// should we be eager or lazy?
if(lev > high_lev)
return suspend0(sq, lev);
else
return resume0(sq, lev);
}
node *resume0(node *sq, int lev) {
return resume1(sq, lev,
find_nw(sq), find_nn(sq), find_ne(sq),
find_ww(sq), find_cc(sq), find_ee(sq),
find_sw(sq), find_ss(sq), find_se(sq));
}
node *resume1(node *sq, int lev, node *nw1, node *nn1, node *ne1,
node *ww1, node *cc1, node *ee1,
node *sw1, node *ss1, node *se1) {
int down = lev - 1;
// calculate stage one
node
*nw2 = calc(nw1, down), *nn2 = calc(nn1, down), *ne2 = calc(ne1, down),
*ww2 = calc(ww1, down), *cc2 = calc(cc1, down), *ee2 = calc(ee1, down),
*sw2 = calc(sw1, down), *sn2 = calc(sn1, down), *se2 = calc(se1, down);
// if any of the above collided with another thread,
// suspend this node and return NULL
if(!nw2 || !nn2 || !ne2 ||
!ww2 || !cc2 || !ee2 ||
!sw2 || !ss2 || !se2)
return suspend1(sq, lev, nw1, nn1, ne1,
ww1, cc1, ee1,
sw1, ss1, se1);
else
return resume2(sq, lev,
find(nw2, nn2, ww2, cc2),
find(nn2, ne2, cc2, ee2)
find(ww2, cc2, sw2, ss2),
find(cc2, ee2, ss2, se2));
}
node *resume2(node *sq, int lev, node *nw1, node *ne1,
node *sw1, node *se1) {
int down = lev - 1;
// calculate stage two
node
*nw2 = calc(nw1, down), *ne2 = calc(ne1, down),
*sw2 = calc(sw1, down), *se2 = calc(se1, down);
// if any of the above collided with another thread,
// suspend this node and return NULL
if(!nw2 || !ne2 || !sw2 || !se2)
return suspend2(sq, lev, nw1, ne1, sw1, se1);
// set the final result and release ownership of the node
return sq->result = find(nw2, ne2, sw2, se2);
}
Note that after scanning a block, we may have newly suspended
computations at the current level and at the level below, plus lower
levels if we are low enough to be eager. This means here can be
partially-filled blocks at multiple levels, which would not normally
be shifted out to the global queues. This can lead to starvation of
other threads if some of the partially filled blocks are high level,
but the current thread has lots of work to do at low levels, so never
fills the high level blocks. This implies we should be able to shift
partial blocks onto the global work queues.
There's a more general question: what order should we work on the
blocks to minimize the chance of re-examining a suspension record too
soon?
* Hash consing
The main hash table will be subject to seriously heavy contention.
Perhaps we could use techniques like..
http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-639.html
http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf
-- eof --