## Friday, 27 December 2013

### not extending bitonic sort to non-powers of 2

Update: Turns out I was a bit over-optimistic here - this stuff below doesn't work, it just happened to work for a couple of cases with the data I had. Bummer. Maybe it can be made to work but maybe not, but I didn't have much luck so I gave up for now.

Original post ...

This morning I started having a look at combining the OpenCL code I have into a GA solver. Which mostly involved lots of bug fixing in the classifier code and extending it to process all classifiers and images in a single run.

All went fine till I hit the sort routine for the population fitness when I found out that the bitonic sort I wrote doesn't handle totals which are non-powers of two.

e.g. with 11 elements:

```input : b a 9 8 7 6 5 4 3 2 1
output: 1 2 3 7 8 9 a b 4 5 6
--------------- +++++
```
Each sub-list is sorted, there is just a problem with the largest-stride merge. I noticed that this merge processes indices:
```  0 -> 8
1 -> 9
2 -> 10

data  : 1 2 3 7 8 9 a b 4 5 6
a a a           b b b
```

Which misses out on comparing 3-7, which are already going to be greater than 0-2.

So a simple solution is just to offset the first comparison index so that it compares indices 5-7 to indices 8-10 instead. The calculation is simply:

```  off = max(0, (1 << (b + 1)) - N); // (here, b=3, N=11, poff = 5)
```

Which results in the correct merge indices for the largest offset:

```data  : 1 2 3 7 8 9 a b 4 5 6
a a a b b b
```

So an updated version of the sort becomes:

```  lid = local work item id;
array = base address of array;
for (bit=0; (1<<bit) < N; bit++) {
for (b=bit; b >= 0; b--) {
// insertion masks for bits
int upper = ~0 << (b+1);
int lower = (1 << b)-1;
// extract 'flip' indicator - 1 means use opposite sense
int flip = (lid >> bit) & 1;
// insert a 0 bit at bit position 'b' into the index
int p0 = ((lid << 1) & upper) | (lid & lower);
// insert a 1 bit at the same place
int p1 = p0 | (1 << b);

p0 += max(0, (1 << (b+1)) - N);

swap_if(array, p0, p1, compare(p0, p1) ^ flip);
}
}
```

I haven't extensively tested it but I think it should work. It doesn't!

I'm probably not going to investigate it but I also realised that a serial-cpu version of a bitonic sort wouldn't need to frob around with bit insertions in the inner loop; it could just nest a couple of loops instead. Actually I may well end up investigating it because a similar approach would work on epiphany.