It turns out the code was broken, first a simple typo, and then a deeper problem: I hadn't noticed that the loop which multiplies out the rows depends on earlier results. Confusingly of course this worked just fine using a CPU driver since the work items within a work-group are actually executed in serial.
So I had to come up with another solution.
First I wrote a bit of code that just printed out the calculations actually being performed.
For column 0 this amounted to a no-op.
For column 1, it was some edge cases, then something like:
for i : 0 to n
col 1 [i] -= col 0 [i] * = col 1 [ 0 ]
For column 2, edge cases plus:
for i : 0 to n
col 2 [ i ] -= col 0 [ i ] * col 2 [ 0 ] + col 1 [ i ] * col 2 [ 1 ]
And so on.
As the emboldened calculations depend on a previous iteration of the same loop (for n=1 in case 1, and n=2 in case 2), each column result cannot be calculated independently.
Since the amount of calculation is small, using the shared memory to propagate the partial results didn't seem viable, so instead I simply calculate the required previous values manually for each column for all threads. I fiddled with the code I had which printed out the calculations and turned it into a code generator to expand all loops: it's only a 6x6 matrix so it isn't very much code. For the edge cases I use some logic that zeros out the multiplicands so the entire algorithm is completely branchless.
For example, column 2 is calculated using:
tmp0 = getlu(lu, 2);
tmp1 = getlu(lu, 8);
tmp1 -= getlu(lu, 6) * tmp0;
tmp0 = cid >= 1 ? tmp0 : 0;
tmp1 = cid >= 2 ? tmp1 : 0;
v = getlu(lu, cid * 6 + 2);
v -= getlu(lu, cid * 6 + 0) * tmp0;
v -= getlu(lu, cid * 6 + 1) * tmp1;
putlu(lu, cid * 6 + 2, v);
pivotexchange(lu, piv, 2, cid);
cidis the column id (get_local_id(0) % 6). I use macros to access the array since I am also flattening it out to avoid (or at least reduce) bank conflicts. The barriers are needed since some of the rows span wave-fronts: but in any event only add 3% or so to the processing time.
I've had to write code more than once to understand the data pattern of an algorithm: multi-dimensional array access is bad enough to visualise without adding multi-dimensional processing elements as well. In some cases I've managed to remove huge amounts of redundant calculations this way - matlab seems to encourage code that has very poor data flow for example.
I otherwise stuck to the same thread topology: 6 threads work on each 6x6 matrix together, and 192 threads in the local work group for a total of 32 matrices.
For 4096 solutions of the same 6x6 matrix (i.e. best-case branch divergence), it takes 26uS on the GTX 480. That seems reasonable enough to me. Although I still have a small number of bank conflicts I think this result is good enough for an early minute for today.