It turns out the code was broken, first a simple typo, and then a deeper problem: I hadn't noticed that the loop which multiplies out the rows depends on earlier results. Confusingly of course this worked just fine using a CPU driver since the work items within a work-group are actually executed in serial.

So I had to come up with another solution.

First I wrote a bit of code that just printed out the calculations actually being performed.

For column 0 this amounted to a no-op.

For column 1, it was some edge cases, then something like:

for i : 0 to n

col 1 [i] -= col 0 [i] * =col 1 [ 0 ]

For column 2, edge cases plus:

for i : 0 to n

col 2 [ i ] -= col 0 [ i ] *col 2 [ 0 ]+ col 1 [ i ] *col 2 [ 1 ]

And so on.

As the emboldened calculations depend on a previous iteration of the same loop (for n=1 in case 1, and n=2 in case 2), each column result cannot be calculated independently.

Since the amount of calculation is small, using the shared memory to propagate the partial results didn't seem viable, so instead I simply calculate the required previous values manually for each column for all threads. I fiddled with the code I had which printed out the calculations and turned it into a code generator to expand all loops: it's only a 6x6 matrix so it isn't very much code. For the edge cases I use some logic that zeros out the multiplicands so the entire algorithm is completely branchless.

For example, column 2 is calculated using:

tmp0 = getlu(lu, 2);

tmp1 = getlu(lu, 8);

tmp1 -= getlu(lu, 6) * tmp0;

tmp0 = cid >= 1 ? tmp0 : 0;

tmp1 = cid >= 2 ? tmp1 : 0;

v = getlu(lu, cid * 6 + 2);

v -= getlu(lu, cid * 6 + 0) * tmp0;

v -= getlu(lu, cid * 6 + 1) * tmp1;

barrier(CLK_LOCAL_MEM_FENCE);

putlu(lu, cid * 6 + 2, v);

barrier(CLK_LOCAL_MEM_FENCE);

pivotexchange(lu, piv, 2, cid);

barrier(CLK_LOCAL_MEM_FENCE);

Where

`cid`

is the column id (get_local_id(0) % 6). I use macros to access the array since I am also flattening it out to avoid (or at least reduce) bank conflicts. The barriers are needed since some of the rows span wave-fronts: but in any event only add 3% or so to the processing time.I've had to write code more than once to understand the data pattern of an algorithm: multi-dimensional array access is bad enough to visualise without adding multi-dimensional processing elements as well. In some cases I've managed to remove huge amounts of redundant calculations this way - matlab seems to encourage code that has very poor data flow for example.

I otherwise stuck to the same thread topology: 6 threads work on each 6x6 matrix together, and 192 threads in the local work group for a total of 32 matrices.

For 4096 solutions of the same 6x6 matrix (i.e. best-case branch divergence), it takes 26uS on the GTX 480. That seems reasonable enough to me. Although I still have a small number of bank conflicts I think this result is good enough for an early minute for today.