$dotat: life/liphe.txt,v 1.3 2008/09/04 16:58:46 fanf2 Exp $ Some notes on parallel hashlife. ================================ * High level plan The general idea is to use lock-free implementation techniques for the highly-contended data structures, which means that when a thread cannot make progress it does not block but instead saves the current computation's state and finds something else to work on. This means we need no more than one thread per CPU. We do use traditional locks when passing work between CPUs. * Nodes A node can be in one of three states, as follows: - no result (node->result == NULL) - working on result (node->result == &dummy) - result available (node->result has some other value) The dummy node is a global variable whose contents don't matter. The "working" state encompasses both active and suspended work. These are distinguished by a thread's posession of the node or its suspension record (active) or an un-owned suspension record on a work queue (fullly suspended). * Suspensions The calculation for a node proceeds in two stages, either of which can block waiting for a sub-node's calculation to complete. Blocking is implemented by saving the state in a suspension record and going to work on something else. struct susp1 { // pointer to the suspended node node *sq; // the nine sub-squares whose results we are waiting for node *nw, *nn, *ne, *ww, *cc, *ee, *sw, *ss, *se; }; struct susp2 { // pointer to the suspended node node *sq; // the four sub-squares whose results we are waiting for node *nw, *ne, *sw, *se; }; The suspension records need to be traced by the garbage collector. * Getting a result In normal hashlife, the result function either returns a node's cached result immediately, or calculates the result and caches it before returning. That is, its result is always useful to the caller. In this code there is a third possibility that the node is being worked on, in which case the result function returns NULL. This causes the parent node's calculation to be suspended. In order to take possession of a node atomically, we use the compare-and-swap instruction to replace a NULL result with a dummy result. If this fails (because of a collision with another thread) we try again. The following is a small adjustment from normal hashlife; there's a more radical version further below. node *calc(node *sq, int lev) { do { if(sq->result == &dummy) return(NULL); if(sq->result != NULL) return(sq->result); } while(!CAS(sq->result, NULL, &dummy)); // now we own the node and can work on it // XXX see below int down = lev - 1; // calculate stage one node *nw1 = find_nw(sq), *nn1 = find_nn(sq), *ne1 = find_ne(sq), *ww1 = find_ww(sq), *cc1 = find_cc(sq), *ee1 = find_ee(sq), *sw1 = find_sw(sq), *ss1 = find_ss(sq), *se1 = find_se(sq), *nw2 = calc(nw, down), *nn2 = calc(nn, down), *ne2 = calc(ne, down), *ww2 = calc(ww, down), *cc2 = calc(cc, down), *ee2 = calc(ee, down), *sw2 = calc(sw, down), *ss2 = calc(ss, down), *se2 = calc(se, down); // if any of the above collided with another thread, // suspend this node and return NULL if(!nw2 || !nn2 || !ne2 || !ww2 || !cc2 || !ee2 || !sw2 || !ss2 || !se2) return suspend1(sq, lev, nw1, nn1, ne1, ww1, cc1, ee1, sw1, ss1, se1); // calculate stage two node *nw3 = find(nw, nn, ww, cc), *nw4 = calc(nw3, down), *ne3 = find(nn, ne, cc, ee), *ne4 = calc(ne3, down); *sw3 = find(ww, cc, sw, ss), *sw4 = calc(sw3, down); *se3 = find(cc, ee, ss, se), *se4 = calc(se3, down); // if any of the above collided with another thread, // suspend this node and return NULL if(!nw4 || !ne4 || !sw4 || !se4) return suspend2(sq, lev, nw3, ne3, sw3, se3); // set the final result and release ownership of the node return sq->result = find(nw4, ne4, sw4, se4); } I'm not sure if we need a write barrier at the end of the function. * Parallelism The above doesn't have any way of introducing parallelism because the calc() function is too eager. To split the calculation across CPUs we should make it lazier, by suspending before doing any recursion. This should probably be done only for high-level nodes, so that we don't waste too much time on the overhead of suspending and resuming calculations. struct susp0 { // just a pointer to the suspended node node *back; }; // in the above sketch of calc() at the XXX comment // high_lev is some external or class member variable if(lev > high_lev) return suspend0(sq, lev); * Suspending and resuming computations This part is crucial to the efficiency of distributing the work across CPUs as well as the efficiency of each individual thread. There's lots of scope for experimentation here. The most important thing is to avoid contention on the suspension record queues. Thus there should be per-thread queues for efficiency (no locking or memory barriers to access them) plus global queues to distribute work (which are locked with conventional mutexes). If we get the balance right then the global queues should not be too highly-contended, so a simple locking strategy should be OK. This means we can avoid complicated lock-free data structures in the scheduler, which is likely to be complicated enough for other reasons. Perhaps a starting point would be: - allocate queues in blocks of (some number of) suspension records - each level has three queues of susp0, susp1, and susp2 records - when suspendN() fills a block, it puts the block on a global list of blocks and obtains an empty block for future suspensions (or some other heuristic for moving work to other threads) - when resuming we scan a block at a time; if any of the records in the block get re-suspended the new record is put in the current allocation block (different from the block we are resuming) // for N in 0,1,2 struct suspN_block { suspN_block *next; suspN rec[SUSPN_BLOCK_SIZE]; }; When we call the block_resumeN() functions, the current thread owns the block, therefore it owns all the suspension records, therefore it owns all the suspended nodes. Therefore no locking is required. void block_resume0(susp0_block *blk, int lev) { for(int i = 0; i < SUSP0_BLOCK_SIZE; i++) resume0(blk->rec[i].sq, lev); free_block0(blk); } void block_resume1(susp1_block *blk, int lev) { for(int i = 0; i < SUSP1_BLOCK_SIZE; i++) resume1(blk->rec[i].sq, lev, blk->rec[i].nw, blk->rec[i].nn, blk->rec[i].ne, blk->rec[i].ww, blk->rec[i].cc, blk->rec[i].ee, blk->rec[i].sw, blk->rec[i].ss, blk->rec[i].se); free_block1(blk); } void block_resume2(susp1_block *blk, int lev) { for(int i = 0; i < SUSP2_BLOCK_SIZE; i++) resume2(blk->rec[i].sq, lev, blk->rec[i].nw, blk->rec[i].ne, blk->rec[i].sw, blk->rec[i].se); free_block2(blk); } The resume functions are basically suffixes of the calc() function above. It probably makes sense to split it up as follows. The function boundaries are the suspend/resume points. It's a bit like event-driven networking code. node *calc(node *sq, int lev) { do { if(sq->result == &dummy) return(NULL); if(sq->result != NULL) return(sq->result); } while(!CAS(sq->result, NULL, &dummy)); // now we own the node and can work on it // should we be eager or lazy? if(lev > high_lev) return suspend0(sq, lev); else return resume0(sq, lev); } node *resume0(node *sq, int lev) { return resume1(sq, lev, find_nw(sq), find_nn(sq), find_ne(sq), find_ww(sq), find_cc(sq), find_ee(sq), find_sw(sq), find_ss(sq), find_se(sq)); } node *resume1(node *sq, int lev, node *nw1, node *nn1, node *ne1, node *ww1, node *cc1, node *ee1, node *sw1, node *ss1, node *se1) { int down = lev - 1; // calculate stage one node *nw2 = calc(nw1, down), *nn2 = calc(nn1, down), *ne2 = calc(ne1, down), *ww2 = calc(ww1, down), *cc2 = calc(cc1, down), *ee2 = calc(ee1, down), *sw2 = calc(sw1, down), *sn2 = calc(sn1, down), *se2 = calc(se1, down); // if any of the above collided with another thread, // suspend this node and return NULL if(!nw2 || !nn2 || !ne2 || !ww2 || !cc2 || !ee2 || !sw2 || !ss2 || !se2) return suspend1(sq, lev, nw1, nn1, ne1, ww1, cc1, ee1, sw1, ss1, se1); else return resume2(sq, lev, find(nw2, nn2, ww2, cc2), find(nn2, ne2, cc2, ee2) find(ww2, cc2, sw2, ss2), find(cc2, ee2, ss2, se2)); } node *resume2(node *sq, int lev, node *nw1, node *ne1, node *sw1, node *se1) { int down = lev - 1; // calculate stage two node *nw2 = calc(nw1, down), *ne2 = calc(ne1, down), *sw2 = calc(sw1, down), *se2 = calc(se1, down); // if any of the above collided with another thread, // suspend this node and return NULL if(!nw2 || !ne2 || !sw2 || !se2) return suspend2(sq, lev, nw1, ne1, sw1, se1); // set the final result and release ownership of the node return sq->result = find(nw2, ne2, sw2, se2); } Note that after scanning a block, we may have newly suspended computations at the current level and at the level below, plus lower levels if we are low enough to be eager. This means here can be partially-filled blocks at multiple levels, which would not normally be shifted out to the global queues. This can lead to starvation of other threads if some of the partially filled blocks are high level, but the current thread has lots of work to do at low levels, so never fills the high level blocks. This implies we should be able to shift partial blocks onto the global work queues. There's a more general question: what order should we work on the blocks to minimize the chance of re-examining a suspension record too soon? * Hash consing The main hash table will be subject to seriously heavy contention. Perhaps we could use techniques like.. http://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-639.html http://www.azulsystems.com/events/javaone_2007/2007_LockFreeHash.pdf -- eof --