What was a fairly small routine with a simple loop - wanted to use 63 registers/work item. No matter how i tried to unroll the loop using #pragma unroll, re-arrange the work to use vectors or not, and so on.
// local group size = 16, 16, 1Looking at the intermediate code it had about a thousand(!) redundant register moves to/from other registers. For the small problem I had it was taking about 100uS which probably wouldn't have bothered me apart from the weird compiler output.
kernel void
somefunc(..., constant float *f0b, ...) {
local float *localdata[];
... load local data ...
for (int i=0;i<9;i++) {
float a0 = localdata[i*2];
float a1 = localdata[i*2+1];
...
v0 += f0a[i*2] * a0 + f1a[i*2] * a1;
v1 += f0b[i*2] * b0 + f1b[i*2] * b1;
v2 += f0a[i*2+1] * a0 + f1a[i*2] * a1;
v3 += f0b[i*2+1] * a0 + f1b[i*2] * b1;
}
}
So I removed the loop entirely by hand, using C macros to implement each step.
Result: 73uS & 21 registers.
And the intermediate code was much smaller and more compact.
NVidia's compiler seems to do a pretty crappy job with vectors in any event, the vector version was even worse - half the speed of a scalar version - around 200uS. It's nor normally this extreme but it seems it's almost always faster not to use vector code. It would also (only sometimes!) hang for 20 seconds or more whilst compiling this file, and these changes fixed that too.
 
 
2 comments:
but isn't this expected on NV HW? The CUs are scalar, not super scalar like AMD's CUs. All device.getPreferredVectorWidth() methods return 1 for me (old GTX295 GPU).
Yeah but for the routines i have, the use of vectors amounts to little more than fancy names for simple values.
There's no reason it couldn't compile to reasonable scalar code - a trivial hand-re-arrangement of exactly the same code just compiles to faster code. i.e. exactly the same algorithm on the same hardware.
Half the time I only use vectors to save some typing, I didn't expect it to have a measurable cost.
Post a Comment