Tuesday 31 December 2013

CL+GA ML

Got the OpenCL version of the genetic algorithm based classifier working this morning. Actually I think it already worked, I just had a problem with the way I was testing the output. Because I don't have a better sort just yet I forced the image set to be a power of two. I left the breeding code in Java for now so I don't have to deal with random number generation.

With a population of 512 individuals, 4096 total training images (each 32x16), and using 4x4x4 classifiers it executes a little over 100 generates per second. I was hoping for a bit better but I guess a 30x speedup over single-thread cpu is good enough for now - certainly getting to a solution in 10 minutes is better than 5 hours (just a bit).

The largest single processing element is classifying all the images. However; I am re-executing all classifiers, even those which haven't changed. So I can pretty much double the performance with my current generational settings which keeps half each iteration. The only stage which needs to run over all the whole population is the fitness ranking.

Interestingly on one of the first test runs I managed to create a perfect classifier for the training set after about 40K generations (~8 minutes). Apart from data setup i'm not sure how to tweak this to be more general - as there isn't really 'training' involved, just searching for a good solution to fit the data - i can't really cross-verify with a verify set without it just becoming the training set. A multi-stage classifier could use a verification set for adjusting the thresholds though. The quality of the training set is the most critical factor.

I also tried the three different fitness tests again (approximate area under roc, number of true-positives-before-any-false-positive and number of true-negatives before any true-positivies) and it seems the area under the ROC is by far the best measure. I guess it does work after all, and I think I was just looking at bugs before.

I also upped the mutation rate to 5 bits per off-spring. It gives a bit more 'jiggle' as the improvement rate levels off. It still requires multiple runs because each solution is different, but it seems to help it stave off stagnation for a bit longer. I thought I had a problem with breeding making too many clones of the best solution after a while; but after putting in some checking code I was only ending up with 0-10 clones in any given generation which isn't enough to care about.

I guess I will keep working toward some sort of object detector to see if everything holds up in practice, but i'd like to experiment a bit too - finding time is always the problem.

Monday 30 December 2013

multi-stage

Cause I was stuck a bit with the OpenCL conversion I tried manually creating a multi-stage classifier to verify it would work.

Given a set of training data I trained a classifier, and then used that to partition the training data - very good true positives or negatives were removed. I then repeated this 3 more times. The final negative data set consists of just 52 images that are very eye-like in the LBP domain: bits of mouth and under the nose. I started with about 4000 negative images and 1000 positive images.

For a fitness score I was maximising the number of false positives with a score better than the worst true positive. This created an ROC curve that rises relatively slowly to start with but keeps rising to a full-pass rate at a fairly low FPR (false positive rate). This isn't normally the ROC curve one seeks because it tends to include plenty of false positives but since i'm seeking to partition the data for a separate stage i'm after the lowest threshold that achieves a TPR of 1 and rejects as much as possible, not the highest TPR vs FPR.

Anyway - it was quite successful, and by 4 stages it had essentially created a perfect classifier to partition that data-set. Each stage tests 4 groups of 4x4 pixels so it only requires a fairly modest amount of work and a very small amount of memory to describe the entire classifier (9*4*4*4 = 576 bytes total). Being a cascade it only processes all stages a small fraction of the time as well.

Of course this is a little too specific to be useful and the training data-sets also need some tweaking - pose and minor scale variation for the positive set and more problematic images for the negative set. So more setup work and experimentation is still required but i'm fairly sure it'll still work on a more general problem.

I got a bit messy with the code and added some big bugs which caused no end of frustration. One I think had been there from the start in certain code-paths - I was showing the LBP image in a window and as part of that it was converting the code to 8 bits not only for display but also for the stored codes. Lets just say this breaks the algorithm although not nearly as much as I would have expected. As part of the OpenCL version I was ordering the scores differently - based on integers and using bitfields to keep track of other info, and I inadvertently added back the bug which interferes with the sort order. Both of these combined to causing the reported success rate to keep climing even though the classifier was just getting worse - what was more confusing is that over time it tended to increase in quality and then start to decline - but only after about 1/2 an hour of processing.

Friday 27 December 2013

not extending bitonic sort to non-powers of 2

Update: Turns out I was a bit over-optimistic here - this stuff below doesn't work, it just happened to work for a couple of cases with the data I had. Bummer. Maybe it can be made to work but maybe not, but I didn't have much luck so I gave up for now.

Original post ...

This morning I started having a look at combining the OpenCL code I have into a GA solver. Which mostly involved lots of bug fixing in the classifier code and extending it to process all classifiers and images in a single run.

All went fine till I hit the sort routine for the population fitness when I found out that the bitonic sort I wrote doesn't handle totals which are non-powers of two.

e.g. with 11 elements:

input : b a 9 8 7 6 5 4 3 2 1
output: 1 2 3 7 8 9 a b 4 5 6 
        --------------- +++++
Each sub-list is sorted, there is just a problem with the largest-stride merge. I noticed that this merge processes indices:
  0 -> 8
  1 -> 9
  2 -> 10

data  : 1 2 3 7 8 9 a b 4 5 6 
        a a a           b b b

Which misses out on comparing 3-7, which are already going to be greater than 0-2.

So a simple solution is just to offset the first comparison index so that it compares indices 5-7 to indices 8-10 instead. The calculation is simply:

  off = max(0, (1 << (b + 1)) - N); // (here, b=3, N=11, poff = 5)

Which results in the correct merge indices for the largest offset:

data  : 1 2 3 7 8 9 a b 4 5 6 
                  a a a b b b

So an updated version of the sort becomes:

  lid = local work item id;
  array = base address of array;
  for (bit=0; (1<<bit) < N; bit++) {
    for (b=bit; b >= 0; b--) {
      // insertion masks for bits
      int upper = ~0 << (b+1);
      int lower = (1 << b)-1;
      // extract 'flip' indicator - 1 means use opposite sense
      int flip = (lid >> bit) & 1;
      // insert a 0 bit at bit position 'b' into the index
      int p0 = ((lid << 1) & upper) | (lid & lower);
      // insert a 1 bit at the same place
      int p1 = p0 | (1 << b);

      p0 += max(0, (1 << (b+1)) - N);

      swap_if(array, p0, p1, compare(p0, p1) ^ flip);
    }
  }

I haven't extensively tested it but I think it should work. It doesn't!

I'm probably not going to investigate it but I also realised that a serial-cpu version of a bitonic sort wouldn't need to frob around with bit insertions in the inner loop; it could just nest a couple of loops instead. Actually I may well end up investigating it because a similar approach would work on epiphany.

Thursday 26 December 2013

parallel prefix sum 2

This morning I had a quick look at the next stage of the OpenCLification of the GA code. This requires trying to create a fitness measure from the raw score values.

I'm using a sort of approximation of a sum of the area under the ROC curve:

  for each sorted result
    sum += true positive rate so far - false positive rate so far
  sum /= result.length
This gives a value between -0.5 and +0.5, equating to perfectly incorrect ROC and perfectly correct ROC curve respectively. Like the ROC curve itself this calculation is independent of the actual score values and only their order and class is important.

At first I didn't think I would be able to parallise it but on further inspection I realised I could implement it using a parallel prefix sum followed by a parallel sum reduction. The first prefix sum calculates the running totals of the occurance of each class (positive training examples, negative training examples). These values are then used to calculate the two rates above and from those partial summands which are then parallel summed. Because I don't need to keep the intermediate results I process this in batches of work-group sized blocks sequentially across each result array and it is a simple matter to accumulate results across blocks internally.

Then I realised I'd forgotten how to implement a parallel prefix sum properly ... it's been nearly 2 years ... as I found out when using a search engine and finding my own rather useless post on the matter comes up on the first page of results. I had a look at the code generator in socles and distilled it's essence out.

Mostly for my own future reference, the simplest version boils down to this:

  lx = get_local_id(0);
  buffer[lx] = sum;
  barrier(local);
  for (int i=1; i < N; i <&kt; 1) {
     if (lx - i >= 0)
       sum += buffer[lx - i];
     barrier(local);
     buffer[lx] = sum;
     barrier(local);
  }

This of course assumes that the problem is only as wide as the work-group or can be processed in work-group sized chunks; which happens often enough to be useful. There is a very simple trick to remove the internal branch but it's probably not very important on current GPU designs. There are also some fairly simple modifications to widen this to some integer multiple of the work-group size and reduce the processing by the same factor but I don't think I need that optimisation here.

I can also use basic parallel reduction techniques to calculate the lowest-positive-rank and highest-positive-rank values, should I want to experiment with them as fitness values.

This is now most of the work required in evaluating each individual in the population which is the main processor intensitve part of the problem. If I had an APU i'd just drop back to the CPU to do the breeding but since I don't i'll look at doing the breeding on the GPU too, so long as it doesn't get too involved.

Wednesday 25 December 2013

parallel batch sorting

Had what was supposed to be a 'quick look' at converting the GA code to OpenCL from the last post. Got stuck at the first hurdle - sorting the classifier scores.

I wasn't going to be suckered into working on a 'good' solution, just one that worked. But the code I had just didn't seem fast enough compared to single-threaded Java Arrays.sort() (only 16x) so I ended up digging deeper.

For starters I was using Batcher's sort, which although it is parallel it turns out it doesn't utilise the stream units very well in all passes and particuarly badly in the naive way I had extended it to sorting larger numbers of items than the number of work items in the work-group. I think I just used Batchers because I got it to work at some point, and IIRC took it from Knuth.

So I looked into bitonic sort - unfortunately the "pseudocode" on the wikipedia page is pretty much useless because it has some recursive python - which is about as useful for demonstrating massive concurrency as a used bit of toilet paper. The pictures helped though.

So I started from first principles and wrote down the comparison pairs as binary and worked out how to reasonably cheaply create compare-and-swap index pairs from sequential work item id's.

For an 8-element sort, the compare-and-swap pairs are:

 pass,step 0,1
  pair  binary   flip comparison
   0-1  000 001  0
   2-3  010 011  1
   4-5  100 101  0
   6-7  110 111  1

 pass 1,2
  pair  binary   flip comparison
   0-2  000 010  0
   1-3  001 011  0
   4-6  100 110  1
   5-7  101 111  1
 (pass 1,1 is same as pass 0,1)

 pass 2,4
  pair  binary   flip comparison
   0-4  000 100  0
   1-5  001 101  0
   2-6  010 110  0
   3-7  011 111  0
 (pass 2,2 is same as pass 1,2)
 (pass 1,1 is same as pass 0,1)
One notices that bit 'pass' is all 0 for one element of comparison, and all 1 for the other one. The other bits simply increment naturally and are just the work item index with a bit inserted. The flip rule is always the first bit after the inserted bit.

In short:

  lid = local work item id;
  array = base address of array;
  for (bit=0; (1<<bit) < N; bit++) {
    for (b=bit; b >= 0; b--) {
      // insertion masks for bits
      int upper = ~0 << (b+1);
      int lower = (1 << b)-1;
      // extract 'flip' indicator - 1 means use opposite sense
      int flip = (lid >> bit) & 1;
      // insert a 0 bit at bit position 'b' into the index
      int p0 = ((lid << 1) & upper) | (lid & lower);
      // insert a 1 bit at the same place
      int p1 = p0 | (1 << b);

      swap_if(array, p0, p1, compare(p0, p1) ^ flip);
    }
  }

The nice thing about this is that all accesses are sequential within the given block, and the sort sizes decrement within each sub-group. Which leads to the next little optimisation ...

Basically once the block size (1<<b) is the same size as the local work group instead of going wide the work topology changes to running in blocks of work-size across. This allows one to finish all the lower stages entirerly via LDS memory. Although this approach is also much more 'cache friendly' than full passes at each offset, LDS just has a much higher bandwidth than cache. Each block is acessed through the LDS via a workgroup-wide coalesced read/write and an added bonus is that for the larger sized blocks all memory accesses are fully sequential across the whole work-group - i.e. are also fully coalesced. It might even be a reasonable merge sort algorithm for a serial processor: although I think the address arithmetic will be too expensive in that case unless the given cpu has an insert-bit instruction that makes room at the same time.

On closer inspection there may be another minor optimisation for the first LDS based stage - it could perform the previous stage in two parts whilst still retaining full-width reads using a register.

I'm testing by sorting 200 arrays, each with 8192 integers (which is what i need for the task at hand). I'm using a single workgroup per array. Radeon HD 7970 original version.

  algorithm         time

  batchers          8.1ms
  bitonic           7.6ms
  bitonic + lds     2.7ms

  java sort        80.0ms  (1 thread, initialised with random values)

Nice!

I think? Sorting isn't something i've looked into very deeply on a GPU so the numbers might be completely mediocre. In any event it's interesting that the bitonic sort didn't really make much difference until I utilised the LDS because although it requires fewer comparisions it still needs the same number of passes over the data. Apart from that I couldn't see a way to utilise LDS effectively with the Batchers' sort algorithm I was using whereas with bitonic it just falls out automagically.

Got a roast lamb in the oven and cracked a bottle of red, so i'm off outside to enjoy the perfect xmas weather - it's sunny, 30-something, with a very light breeze. I had enough of family this year so i'm going it alone - i'll catch up with friends another day and family in a couple of weeks.

Merry X-mas!

Update: See also a follow-up where I explore extension to non-powers of two elements.

Tuesday 24 December 2013

genetic algorithm experiments

Last couple of days I was playing with some genetic algorithm code to implement an object detector. This is the first time i've ever played with genetic algorithms so is simply exploration to become aquainted with the technique.

First I started with a local binary pattern histogram feature test based on the paper Genetic Based LBP Feature Extraction and Selection for Facial Recognition, but these were just too slow to evaluate to work out whether I was making any progress or not. It wasn't quite the problem I wanted to solve but I did have the paper and am familiar with the algorithm itself.

So I changed to using a variation of my own 'single-bit-histogram' algorithm which is much faster to evaluate both because the mechanics of it's implementation and the mathematics of the fitness determination. Because it generates a classifier and not a distance measure evaluation of fitness is O(N) rather than O(N^2). Initially my results seemed too good - and they were, a simple sort-order problem meant that a completely zero detector came out as perfect. After fixing that I did still have some promising results but my feature tests were just too small so the discriminative ability wasn't very high.

When I left that for the day I didn't think it was really working that well because the feature tests weren't descriptive enough on their own - only a few single probes. But I was looking at the output a bit wrong - it doesn't need to be 'good', it just needs to have a low false-negative rate. The next morning I tried creating a more complex descriptor based on a 4x4 feature test - and this worked much better.

The GA

After reading a few light tutorials on the subject I basically came up with my own sexual and asexual (i.e. copy with mutate) breeding rules and mucked about with the heurstics a bit. Throwing out years of research no doubt but without some understanding of how it works I wont understand it properly. So they are probably not ideal but they seem to work here.

The basic asexual breeding rule is:

  1. Randomly grow or shrink the number of genes by 1 (within a specific range);
  2. Copy genes, potentially mutating each bit separately;
  3. If the chromosome grew, add a new random gene.
And sexual breeding:
  1. Concatenate all genes of both parents;
  2. Randomly permute the whole set;
  3. Choose a new size which is half the combined gene length +/- 1 (within a specific range);
  4. Truncate to the new size;
  5. Randomly mutate every bit.
Genes are treated 'whole', and it obviously supports different numbers of genes for each individual chromsome. I found that a small number of tests seems to work better than a larger number; at least for the single-pixel classifier. As such i'm focusing at present on fixing the gene count in advance and running multiple tests to see what difference the gene count makes.

I tried a few different fitness tests:

  • Order by area under ROC curve (or for simplicity, some approximation thereof);
  • Order by rank of first incorrect result;
  • Order by rank of last correct result.

These all result in slightly different classifiers - although the first tends to be the best. Combining a couple by simple concatenation also appears to create a better classifier.

Each generation i discard the worst-N results, and then randomly sexually or asexually reproduce random pairs or individuals to replace them. For simplicity i'm just using uniform randomness rather than basing reproduction rates on fitness. I presume at worst this just reduces the rate of advancement.

I need to read up a bit more on the nomclature and jargon used to describe the behaviour properly but with very small number of genes I tend to end up with poor genetic diversity at least in the top few results (I'm dumping the top 5 every 100 generations). I think my random mutation isn't random enough during breeding - it should always mutate at least some bits rather than applying a low randomisation rate across all of them.

Still, even with these shortcomings and the fact that after a rapid improvement progress slows - it doesn't stall completely. I ran some very long experiments and even after 10 hours I was still getting an occasional improvement.

The Classifier

I need to write up the classifier a bit better than a blog post so I wont go into the detail here. I think it's a novel use of local binary patterns.

But the current 4x4 'gene' consists of:

struct gene4x4 {
  unsigned int x:5;
  unsigned int y:5;
  unsigned int pad:22;
  unsigned int tests[8];
};

(x,y) is the location of the test, and the 16x16-bit integers stored in the tests array are the parameters. For the descriptive power it seems to have, this quite a compact descriptor. With a chromosome of as little as 2 of these genes i'm starting to get some interesting results, and at 4 genes I think i'm starting to get usable results. That's only 1024 bits; this is something that can easily fit on an epiphany core even with a good few stages.

Data

As with any machine learning algorithm dataset size and quality is another issue, but for now i'm just using what I have which is stuff i've extracted from Color FERET. The training set is the left (from front, i.e. right) eye at 32x16, and the negative training set is random samples from the same faces excluding those very close to the eye. I'm just using the same set to test against at the moment.

I think though that because of the similarity of the eyes and the fact that you can easily tell them apart using simple geometry I will look at combining left and right eye detectors into a simple 'eye detector'. This saves the learning algorithm the hassles of trying to distinguish between left and right eyes as well as non-eye data. If I really needed to distinguish between the two I could train another detector for sub-classification - to be honest, I can't really see why you would need to.

Next

One thing I want to do is translate the algorithm to OpenCL so I can accelerate the generation rate. I already have the classifier written and that can score images insanely fast - so fast that I think I need to write the whole algorithm in OpenCL now, which wasn't my original intention. Being able to run a few hours worth of testing in minutes really accelerates experimentation with the heurstics and other gene variations so it's worth the day or so of effort it needs to get working.

Then I want to investigate whether I can turn these classifiers into a reliable cascade of classifiers - which is a critical step for both runtime and quality performance. My first thoughts are to use the ROC curve to choose a threshold such that some of the false positives are culled (but no false negatives), remove those from the data-set and then train a new classifier; and repeat until one reaches a satisfactory result. A variation may be to also cull at the true-positive end of the classifier if there is a particularly high first-false-positive rank. My gut feeling so far is that because each classifier is not too weak it shouldn't need too many stages to be very strong; but whether this pans out in practice or not is another matter.

Then there are many things to try, from data improvements / different objects, to adding extra dimensions to the input data (e.g. multiple planes different lbp codes which test for different characterestics; somewhat analogous to multiple gabor wavelets). It should also be possible to directly create a multi-stage algorithm soley using GA - which may be worth investigating.

The ultimate aim is a strong classifier which is also computationally efficient (ahem, yeah, kinda holy grail territory here, i'm not ambitious at all). I know already that it can be implemented in ARM NEON assembly language very efficiently - under 1 clock cycle per pixel tested(!). It's also simple enough to put in hardware. I'm just not sure yet whether I can make it strong enough to compare to other algorithms, which is of course the big question.

On a side note it's kind of cool having the computer just work away by itself coming up with a solution given very simple goals. For once. It normally feels like it's the one driving me.

Of course, being xmas, and being on leave, ... I may just drink wine instead! And it's about time i put down another homebrew. And cleaned the house a bit. Time for another #3 all over too.

Saturday 21 December 2013

e-port revisited

I can't remember if i posted this here or just discussed it on the forums, but here's an update on the 'eport' code. eport is a lightweight 1:1 synchronisation point designed to assign owneship of slots in a cyclic buffer with minimum overhead. This directly allows the utilisation of some of the specific features in the epiphany architecture.

Since I can't find the post (probably just a bad subject) here's an overview of the api:

int eport_reserve(eport *)
Reserves a single slot in the port for the writer. The returned index is the slot number. Once this returns the writer has exclusive access to the given slot.
void eport_post(eport *)
Indicates that a slot previously reserved is now owned by the receiver with the implication that it contains valid data. Slots are automatically cycled in the same order they are reserved.
int eport_wait(eport *)
Waits for data to be ready at the receiving end. The return value indicates the slot number.
void eport_done(eport *)
Marks the next slot free to be re-used by the caller.

Ok so my initial implementation used power of two sizes for the indices for efficiency reasons of the cyclic buffer index calculation (modulo) and it only required 2 'pointer' values. However, because of repeating of the cycles one of the slots always sits unused, and furthermore the forcing of a power-of-two size can easily lead to needing to allocate more space than is necessary to implement an algorithm. There just isn't enough space to waste like this. An additional problem I hit when working on a real application was that the implementation required that every reserve be paired directly with a post, and every wait be paired directly with a done - which precluded the ability to use multiple slots in the buffer at once - which was kind of the whole point of it!

So I had a fiddle and came up with a new implementation which manages to allow all slots to be in use and also allows for arbitrary sizes without an expensive modulo operation. It only needs two extra local 'pointers' which are used to track multiple outstanding reserves/waits properly. If memory is tight the receiver can limit the working-set to exactly the amount it requires, or allow one extra free slot to allow for interleaving of work.

I've only just banged it up so it might not be correct or handle all the edge cases, but here it is anyway.

data structure

// Code Copyright (C) 2013 Michael Zucchi
// Licensed via GNU GPL version 3 or later.
struct eport {
        volatile unsigned int head;
        volatile unsigned int tail;
        unsigned int index;
        unsigned int next;
        unsigned int size;
        struct eport *other;
};

Each of these is allocated in pairs - one local to the sender and one local to the receiver. other pointers to the other one of the pair and everything else apart from size is initialised to 0. The only cross-communication is that the sender updates head in the receiver, and the receiver updates tail in the sender - obviously an atomic operation with no round-trip required.

index is local to each end and tracks the next slot, modulo the size of the cyclic buffer. Likewise next indicates the next 'actual' slot being requested at each end.

reserve

unsigned int eport_reserve(struct eport *port) {
        unsigned int next = port->next;

        // Check for a free slot
        while (next >= port->tail + port->size) {
                EPORT_SLEEP();
        }

        port->next = next + 1;

        // Increment index modulo size
        unsigned int r = port->index;
        port->index = (r+1) == port->size ? 0 : r+1;
        
        return r;
}

Because the "pointers" run without modulo I can simplify the capacity test. I then manually keep track of the index modulo the size for use by the caller. By using next to track the allocation it detaches the allocation from the publishing so it properly handles multiple outstanding reserves.

EPORT_SLEEP() does nothing on epiphany but might call usleep(x) on GNU/Linux.

post

void eport_post(struct eport *port) {
        unsigned int h = (port->head + 1);

        port->other->head = h;
        port->head = h;
}

Post is basically the same as it was before, except now it doesn't need to modulo the pointer. Since the slot is known to be owned by the sender all it has to do is update the pointers at both ends. The sender is the only thread that can write to head so it needs no arbitration.

reserve

unsigned int eport_wait(struct eport *port) {
        unsigned int next = port->next;

        while (port->head == next)
                EPORT_SLEEP();

        port->next = next + 1;

        // Track tail % size
        unsigned int r = port->index;
        port->index = (r+1) == port->size ? 0 : (r+1);

        return r;
}

There is some obvious (and nice) symmetry with the reserve function here in the receiver. Again next is used to detatch acceptance from recycling.

done

void eport_done(struct eport *port) {
        unsigned int t = (port->tail + 1);

        port->other->tail = t;
        port->tail = t;
}

Done is similarly symmetric to the post function. And here the receiver is the only one who can ever write to tail, so it needs to also needs no arbitration.

summary

So in short ... eport

  • Allows for lock-free, ordered synchronisation of a fixed number of buffers between two cooperating cores;
  • Allows for fully sychronous or n-buffered operation;
  • Lightweight - only a single memory write is required at each post or done call.
When coupled with epiphany's blocking-free writes and limited memory it allows one to write streaming processors which write directly to the target core without needing any additional synchronisation and potentially asynchronously.

One drawback of the implementation above is that a given port is limited to 2^32-1 operations at a time before the maths breaks down, although that could be addressed by using 64-bit integers or some more complex pointer arithmetic.

I'm pretty much done for today but this is just another small step toward getting the 1-pass image scaler going. I think it should also end up a reasonable basis for writing a 1-pass 2-d separable convolution, wavelets, and so on. For some reason although i'm just a little bit of work away from finishing it I keep putting it off - today it's that the wind kept me awake all night and i'm too knackered to think straight. e.g. althogh i haven't compiled or debugged it I have the whole 2-d bicubic scaler written, I just need to slot in this updated eport code. Another day.

On another note i'm finally on xmas leave - hopefully for a few months like last year but I have a feeling i'm going to get roped in to doing a bit of other work which might cut it short. At this point i have a head full of ideas to code on but i think for sanity's sake I will need to take a break at some point too.

Thursday 19 December 2013

Wondered why the cartoons were on telly so late ...

Oops, it's early not late. Somehow it got to 5am, and now it's nearly 6.

I was looking up Welsh accent videos after watching 'Utopia' on SBS along with a bottle of red and a couple of rum on ices and somehow 6 hours vanished before I had realised. Probably didn't help that I only had cheese, olives, and other pickled condiments for dinner. It all seemed like a good idea at the time.

The last few weeks i've been working extra hours to try to burn out my contract before the xmas break so I was already a bit wiped out and I think I got a bit over-eager from wanting it to be over. Insomnia, poor sleep in general and even some annoying 'dreams' about bit reversal permute and vertical fft algorithms just compounded this. Got a meeting in a few hours - but it's only really a hand-off of pre-prepared work and i'm sure I can burn an iso or two.

I guess i'm just really looking forward to a couple months off and started a bit early. Just 16 hours left and all my main tasks are done for the contract and i'm pretty much over it at this point. But a big component is a few of those 'never ending' research-oriented tasks that have no defined completion milestones so I still have plenty to poke at.

Hmm, forcecast is 43 today - garden will be incinerated. It was forecast 41 yesterday and from the look of the plants I can't tell if i'm waterlogging them or letting them dry out too much - I dumped a few hours worth of rainwater tank on the garden trying to help it survive. Need some spot shades to stop the exteremeties burning off since it's a lot hotter than that in the full sun on the bare earth or on the black polyethelyne picking buckets I'm using as pots for the herbs. Last year I measured over 70 on a wheelie bin and that wasn't even in full sun all day. Even the birds are barely singing well after sun-up - must know what's coming in a few hours.

Better try to get a couple of zed's - can always sleep the arvo off. Damn eyes are killing me too as they have been for few weeks; bloody hayfever and i can't be far away from needing glasses.

Wednesday 18 December 2013

Android icon lists

I was hitting some performance / usability issues with a list of icons (sigh, again) ... and looking at the code I came up last time I just don't know what I was thinking. In part it stemmed from initially using an adapter without customising the view; so i was forced to use a standard ImageView.

Anyway, I think ... i finally worked out a relatively clean solution to a listview which contains images which are loaded asynchronously.

In short:

  • Create a sub-class of ImageView and use that wherever an icon in a list is used.
  • Implement a separate loading/cache which loads images at the direction of this new ImageView via a thread. It needs to be able to throw away requests if they're not needed and generally serialise loading of images.
  • Just set the image uri/url/file on the ImageView and let it handle aborting load of an image if it didn't use it.
  • And this is the important bit which gets around an annoying problem with android: treat item 0 separately and just load it's icon directly w/o caching.

Without the last item you can end up loading the the image into the wrong imageview so end up with a missing icon - I recall having a lot of problems trying to workaround this last time, but this is a much more reliable and simple solution.

Previously I was trying to use notifyDataSetChanged() each time an image loaded as well as other nasty redirections to try and update properly whilst still coping with android (ab)using item 0 for it's own nefarious purposes (i presume it's for layout).

Originally i needed to load remote url's so I couldn't just use setBitmapURI() (otherwise i probably would've just used that at the time - it was a very short-on-time prototype), and although i currently don't need that functionality this manual code is smoother for the user and I can do other stuff like animations. The android code will delay drawing the screen until the image is actually loaded which adds a lot of judder particulalry if you're images might be large sizes.

Still not as nice as the JavaFX stuff I did, but it'll suffice. I will have to do something similar in the internode radio app - the solution there is different again but not very good either.

Saturday 14 December 2013

FFT bit-reverse, NEON

Hacked up a NEON version of the FFT permute I mentioned in the last post. Because it worked out much simpler to code up I went with the blocked-8 8-streaming version. Fortunately ARM has a bit-reverse instruction too.

Actually it ended up rather elegant; although i need to use 9 separate addresses within the inner loop, I only need to explicitly calculate one of them. The other 8 are calculated implicitly by a post-increment load thanks to being able to fit all 8 into separate registers. The destination address calculation still needs a bit reversal but 3 instructions are enough to calculate the target address and index to the next one.

This leads to a nice compact loop:

1:      rbit    r12,r11
        add     r11,r3
        add     r12,r1

        vld1.64 { d16 },[r0]!
        vld1.64 { d17 },[r4]!
        vld1.64 { d18 },[r5]!
        vld1.64 { d19 },[r6]!
        vld1.64 { d20 },[r7]!
        vld1.64 { d21 },[r8]!
        vld1.64 { d22 },[r9]!
        vld1.64 { d23 },[r10]!

        subs    r2,#1
        vstmia  r12, {d16-d23}
        bne     1b

Despite this elegance and simplicity, compared to a straightforward C version the performance is only a bit better until the cache gets overflown.

At least on this machine (parallella-16 - the ARM side) the benefits of the cache-friendly code are measuable and start at 32K elements.

Here i'm runnning so many cycles of the given permute, where the number of cycles x N = 226 - i.e. always doing the same number of total elements. This equates to 1M iterations at 64-elements. All elements are complex float. This means the smaller sizes do more loops than the larger ones - so the small up-tick on the ends is first from extra outer loop overheads, and then from extra inner loop overheads.

Also hit a weird thing with the parallella compiler: by default gcc compiles in thumb mode rather than arm.

Update: So I found out why my profiling attempts were so broken on parallella - the cpu gouverner.

# echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Fixes that. It also makes emacs a lot smoother (I was wondering why text-mode emacs across a gig-e connection was lagging so much). I was running some timings on a C implementation of Cooley-Tukey and noticed some strange results - it turned out the scheduler was only kicking in once the problem got big enough so time-per-element turned into a saw-tooth.

Actually I re-ran the above plot and although the numbers did change a little bit they were only minor. Oh, I also added alignment specifiers to the vld1 instructions - which made a small but significant difference. I then tried doubling-up on the above function - forming 16 consecutive output cfloat results. But this was consistently slower - I guess hitting 2x the width on output is causing more cache thrashing (although i'm not sure why it's slower when running in-cache).

Some noise woke me up around 2am and I ended up getting up to have something to eat and then hacked until sunrise due to insomnia (pity Sunday is the mowing day around here - so now i'm grumpy and tired and need to do my own noise-making in the back yard soon). I did the timing stuff above and then looked into adding some extra processing to the permute function - i.e. the first couple of stages of the Cooley-Tukey algorithm. NEON has plenty of registers to manage 3 stages for 8 elements but it becomes a pain to shuffle the registers around to be able to take advantage of the wide instructions. So although i've nearly got a first cut done i'm losing interest (although simply being 'tired' makes it easy to become 'tired of' anything) - parallella will remove the need to fuck around with SIMD entirely, and there isn't much point working on a fast NEON FFT when ffts already exists (apart from the for-the-fun-of-it exploration). Well I do have a possible idea for a NEON implementation of a "vertical" FFT which I will eventually explore - the ability to perform the vertical portion of a 2D fft without a transpose could be a win. Because of the data-arrangement this should be easier to implement whilst still gaining the FLOP multiplier of SIMD but there are some other complications involved too.

Friday 13 December 2013

FFT jiggery pokery

I've been meaning to poke at FFT calculation for a while and I finally got around to having a proper look.

First I just needed to calculate a single value so I did it by hand - I got some strange profiling results from Java vs jtransforms but it turned out that due to it's complexity it takes many many cycles before the JIT fully optimises it.

Then i started looking from a trivial Cooley-Tukey implementation that allocated all it's output arrays ... and i was surprised it was only about 1/2 the speed of jtransforms for it's simplicity. I tried a few variations and delved a bit into understanding the memory access patterns - although after an initial bit-reversed premute the memory patterns are trivial. It was still stuck at about 1/2 the speed of jtransforms.

I had a bit more of a think about the bit-reversing memory permute stage and figured it wouldn't work terribly well with cpu caches. First I just worked on blocking the output by 8 elements (=1 cache line) toward re-arranging the reads so they represent 8 sequential streams. Strangely enough this second case is a bit slower on small workloads that fit in the cache although is a bit faster on larger ones. It's not really much of a difference either way.

 D:d                               |S:s                               |   0
 D: d                              |S:                s               |  16
 D:  d                             |S:        s                       |   8
 D:   d                            |S:                        s       |  24
 D:    d                           |S:    s                           |   4
 D:     d                          |S:                    s           |  20
 D:      d                         |S:            s                   |  12
 D:       d                        |S:                            s   |  28
 D:        d                       |S:  s                             |   2
 D:         d                      |S:                  s             |  18
 D:          d                     |S:          s                     |  10
 D:           d                    |S:                          s     |  26
 D:            d                   |S:      s                         |   6
 D:             d                  |S:                      s         |  22
 D:              d                 |S:              s                 |  14
 D:               d                |S:                              s |  30
 D:                d               |S: s                              |   1
 D:                 d              |S:                 s              |  17
 D:                  d             |S:         s                      |   9
 D:                   d            |S:                         s      |  25
 D:                    d           |S:     s                          |   5
 D:                     d          |S:                     s          |  21
 D:                      d         |S:             s                  |  13
 D:                       d        |S:                             s  |  29
 D:                        d       |S:   s                            |   3
 D:                         d      |S:                   s            |  19
 D:                          d     |S:           s                    |  11
 D:                           d    |S:                           s    |  27
 D:                            d   |S:       s                        |   7
 D:                             d  |S:                       s        |  23
 D:                              d |S:               s                |  15
 D:                               d|S:                               s|  31
This shows the source-destination read/write pattern for a bit-reversal permute. Obviously with all those sparse reads the cache coherency isn't going to be great. FWIW moving from s to d (sequential writes) proved to have better performance than from d to s (sequential reads).
 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|
This is the same, but blocked to 8 elements in the output. Unfortunately there is still a large spread of reads.
 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|
This is an attempt at improving the previous one - although the reads are still in a broad spread, they are always confined to 8 sequential streams. It does improve the performance once you get a large enough problem, but on an x86 machine N needs to be over about 8M before it shows a minor improvement. I've yet to try it on ARM.

Then I realised that Integer.reverse() wasn't mapping to a single instruction on x86 - and subsequently over half of the processing time was spent just calculating the source index ... so I put in a bit of bit-fiddling to the 8-blocked implementations which precalculated the 8 fixed element offsets relative to base-offset. I also tried a lookup table for the non-blocked implementation which made a big difference. Strangely using a lookup-table to get the base offset actually slowed it down - suggesting that the jvm is pre-calculating reverse(i) in the outer-loop if it can.

Once this permute is done the memory access pattern is just in adjacent and sequential blocks - either depth or breadth first. The 8-blocked permute is about 130%-150% of the speed of a straight System.arraycopy() for 4-1K elements.

Actually another reason I worked toward an 8-blocked implementation was in order to perform the first 8-way fft stage at the same time as the fiddling - with that in place I got to within shouting distance (under 200% execution time) of jtransforms for some small fft sizes. I was curious what I had to do to get better than that so I tried hard-coding the last stage of a size-16 fft and managed to beat it - not that this has any real practical purpose mind you.

This gives me a starting point to tackle the problem on the parallella which has it's own challenges.

PS yes I found a book or two and many papers about both fft implementation and the permute step but I had time to experiment on my own.

Wednesday 11 December 2013

Java, C, SSE, poor mans lambdas

So after being able to avoid them for decades I got sucked in to having to do some matrix maths this week. I'm mostly just using a library but there were some memory and performance problems so I had to investigate my own matrix multiply routine.

Java vs gcc

Mostly just because of curiosity I tried comparing C to Java ... and the performance difference was negligble, actually sometimes the java cpu time is neglibly less. And i'm timing executing the programme from the command line - so that includes jvm startup and just-in-time compilation. I expected more of a difference in the obvious direction given the jvm overheads so that was a nice surprise. I suppose I shouldn't really be surprised by the performance anymore ...

And then further curiosity led me to attempting a vector based implementation - just using the gcc vector types. This was only about 2.4x faster - it would be worth it worth it, but I just used threads and it works fast enough anyway.

The vector implementation in gcc is simple, one just defines a vector type and then it's much the same as OpenCL, although one must ensure the data is aligned properly otherwise performance is pants.

TBH i'm a little dissapointed hotspot isn't doing any SSE optimisations here automatically (or maybe it is, but the execution time compared to C is too close for it to be a coincidence).

WorkPool

To implement the multi-threaded code I started using a poor-mans implementation of lambdas, or at least the parallel foreach part of it which is imho the main point of interest. I'm waiting for the jdk8 ga before pushing any java8 requirement onto my customer. It doesn't work as well as the lambda code in jdk8 but it isn't too far off and the syntax is fine as far as i'm concerned.

For simplicity I just have a static class which manages one thread per cpu with a simple static foreach call:

  public class WorkPool {
     ... threads stuff ...

  public interface WorkItem {
    public void accept(int i);
  }

  public static synchronized void foreach(int start, int end, WorkItem job) {
     ... statically partition work across threads ...
     ... pass job to threads ...
     ... await completion ...
  }

With obvious usage:

    WorkPool.foreach(0, N, new WorkItem() {
       public void accept(int i) {
           array[i] = blah ...;
       }
    });
It's a bit clumsy without the 'effectively final' of Java8, and arguably clumsy due to the syntax but if each WorkItem does a sizable amount of work the overheads are acceptable. More importantly it just gets me thinking about how to solve problems in ways that will map immediately to Java8 when I start using it.

There's some other interesting stuff about the parallel execution model that I want to talk about but i'll leave that for another post. It's about mapping non-rectilinear work loads to a linear index and job execution order.

On another note, I keep running into problems with the thread job management on Java and keep end up having to write custom solutions. Executors seem like a great idea ... and they probably are for enterprise workloads but they basically suck for interactive desktop programmes - where you may be getting many many more update requests than you have time to process but you only need to keep (and must keep) the last one. Managing this with Future handles gets clumsy very fast. JavaFX comes with a new set of 'worker thread' primitives which I haven't really looked into much - they seemed to be light wrappers over existing functionality anyway and only seemed to muddy the waters last time I had a look.

One current implementation of WorkPool uses a ThreadPool executor but I will look at custom thread code too (curiosity again). Currently i'm also implementing static scheduling but it will be worth investigating something more dynamic.

Monday 9 December 2013

fpu mode, compiler options

Poked a bit more at the 2d scaler on the parallella yesterday. I started just working out all the edge cases for the X scaler, but then I ended up delving into compiler options and assembler optimisations.

Because the floating point unit has some behaviour defined by the CONFIG register the compiler needs to twiddle bits quite a bit - and by default it seems to do it more often than you'd expect. And because it supports writing interrupt handlers in C it also requires any of these bit twiddles to occur within an interrupt disable block. Fun.

To cut a long story short I found that fiddling with the compiler flags makes a pretty big difference to performance.

The flags which seemed to produce the best result here were:

  -std=gnu99 -O2 -ffast-math -mfp-mode=truncate -funroll-loops

Actually the option that has the biggest effect was -mfp-mode=truncate as that removes many of the (redundant) mode switches.

What I didn't expect though is that the CONFIG register bits also seem to have a big effect on the assembly code. By adding this to the preamble of the linear interpolator function I got a significant performance boost. Without it it's taking about 5.5Mcycles per core, but with it it's about 4.8Mcycles!?

        mov     r17,#0xfff0
        movt    r17,#0xfff1
        mov     r16,#1
        movfs   r12,CONFIG
        and     r12,r12,r17     ; set fpumode = float, turn off exceptions
        orr     r12,r12,r16     ; truncate rounding
        movts   CONFIG,r12      

It didn't make any difference to the results whether I did this or not.

Not sure what's going on here.

I have a very simple routine that resamples a single line of float data using linear interpolation. I was trying to determine if such a simple routine would compile ok or would need me to resort to assembler language for decent performance. At first it looked like it was needed until I used the compiler flags above (although later I noticed I'd left an option to disable inlining of functions that I was using to investigate compiler output - which may have contributed).

The sampler i'm using is just (see: here for a nice overview):

static inline float sample_linear(float * __restrict__ src, float sxf) {
                int sx = (int)sxf;
                float r = sxf - sx;
                float y1 = src[sx];
                float y2 = src[sx+1];
                
                return (y1*(1-r)+y2*r);
}
Called from:
static void scale_linex(float * __restrict__ src, float sxf, float * __restrict__ dst, int dlen, float factor) {
        int x;

        for (x=0;x<dlen;x++) {
                dst[x] = sample(src, sxf);

                sxf += factor;
        }
}

A straight asm implementation is reasonably simple but there are a lot of dependency-stalls.

        mov     r19,#0  ; 1.0f
        movt    r19,#0x3f80

        ;; linear interpolation
        fix     r16,r1          ; sx = (int)sxf

        lsl     r18,r16,#2
        float   r17,r16         ; (float)sx
        add     r18,r18,r0
        fsub    r17,r1,r17      ; r = sxf - sx

        ldr     r21,[r18,#1]    ; y2 = src[sx+1]
        ldr     r20,[r18,#0]    ; y1 = src[sx]

        fsub    r22,r19,r17     ; 1-r
        fmul    r21,r21,r17     ; y2 = y2 * r
        fmadd   r21,r20,r22     ; res = y2 * r + y1 * (1-r)

(I actually implement the whole resample-row routine, not just the sampler).

This simple loop is much faster than the default -O2 optimisation, but slower than the C version with better optimisation flags. I can beat the C compiler with an implementation which processes 4 output pixels per loop - thus allowing for better scheduling with a reduction in stalls, and dword writes to the next core in the pipeline. But the gain is only modest for the amount of effort required.

Overview of

    Routine         Mcycles per core

    C -O2                10.3
    C flags as above      4.2

    asm 1x                5.3
    asm 1x force CONFIG   4.7
    asm 4x                3.9
    asm 4x force CONFIG   3.8
I'm timing the total instruction cycles on the core which includes the synchronisation work. Image is 512x512 scaled by 1.7x,1.0.

On a semi-relted note I was playing with the VJ detector code and noticed the performance scalability isn't quite so good on PAL-res images because in practice image being searched is very small. i.e. parallelism isn't so hot. I hit this problem with the OpenCL version too and probably the solution is the same as I used there: go wider. Basically generate all probe scales at once and then process them all at once.