Tuesday, 13 August 2013

Progress on object detection

I spent more hours than I really wanted to trying to crack how to fit a haarcascade onto the epiphany in some sort of efficient way, and I've just managed to get my first statistics. I think they look ok but I guess i will only know once I load it onto the device.

First the mechanism i'm using.

Data structures

Anyone who has ever used the haarcascades from opencv will notice something immediately - their size. This is mostly due to the XML format used, but even the relatively compact representation used in socles is bulky in terms of epiphany LDS.

struct stage {
    int stageid;
    int firstfeature;
    int featurecount;
    float threshold;
    int nextSuccess;
    int nextFail;

struct feature {
    int featureid;
    int firstregion;
    int regioncount;
    float threshold;
    float valueSuccess;
    float valueFail;

struct region {
    int regionid;
    int left;
    int top;
    int width;
    int height;
    int weight;

For the simple face detector i'm experimenting with there are 22 stages, 2135 features and 4630 regions, totalling ~150K in text form.

So the first problem was compressing the data - whilst still retaining ease/efficiency of use. After a few iterations I ended up with a format which needs only 8 bytes per region whilst still including a pre-calculated offset. I took advantage of the fact that each epiphany core will be processing fixed-and-limited-sized local window, so the offsets can be compile-time calculated as well as fit within less than 16 bits. I also did away with the addressing/indexing and took advantage of some characterstics of the data to do away with the region counts. And I assumed things like a linear cascade with no branching/etc.

I ended up with something like the following - also note that it is accessed as a stream with a single pointer, so i've shown it in assembly (although I used an unsigned int array in C).

    .int    22                           ; stages

    ;; stage 0
    .short  2,3                          ; 3 features each with 2 regions
    ;; one feature with 2 regions
    .short  weightid << 12 | a, b, c, d  ; region 0, pre-calculated region offsets with weight
    .short  weightid << 12 | a, b, c, d  ; region 1, pre-calculated region offsets with weight
    .float  threshold, succ, fail        ; feature threshold / accumulants
    ;; ... plus 2 more for all 3 features

    .short  0,0                          ; no more features
    .float  threshold                    ; if it fails, finish

    ;; stage 1
    .short 2,15                          ; 15 features, each with 2 regions

This allows the whole cascade to fit into under 64K in a readonly memory in a mostly directly usable form - only some minor bit manipulation is required.

Given enough time I will probably try a bigger version that uses 32-bit values and 64-bit loads throughout, to see if the smaller code required for the inner loop outweights the higher memory requirements.

Overlays ...

Well I guess that's what they are really. This data structure also lets me break the 22-stage cascade into complete blocks of some given size, or arbitrary move some to permanent local memory.

Although 8k is an obvious size, as i want to double-buffer and also keep some in permanent local memory and still have room to move, I decided on the size of the largest single stage - about 6.5K, and moved 1.2K to permanent LDS. But this is just a first cut and a tunable.

The LDS requirements are thus modest:

    |             |
    |local stages |
    |    1k2      |
    |             |
    |  buffer 0   |
    |    6k5      |
    |             |
    |  buffer 1   |
    |    6k5      |

With these constraints I came up with 13 'groups' of stages, 3 stages in the local group, and the other 19 spread across the remaining 12 groups.

This allows for a very simple double-buffering logic, and hopefully leaves enough processing at each buffer to load the next one. All the other stages are grouped into under 6k5 blocks which are referenced by DMA descriptors built by the compiler.

This also provides a tunable for experimentation.


So the other really big bandwith drain is that these 6000 features are tested at each origin location at each scale of image ...

If you have a 20x20 probe window and a 512x512 image, this equates to 242064 window locations, which for each you will need to process at least stage 0 - which is 6 regions at 4 memory accesses per region, which is 5 million 32-bit accesses or 23 megabytes. If you just access this directly into image data it doesn't cache well on a more sophisticated chip like an ARM (512 floats is only 4 lines at 16K cache), and obviously direct access is out of the question for epiphany.

So one is pretty much forced to copy the memory to the LDS, and fortunately we have enough room to copy a few window positions, thus reducing global memory bandwidth about 100 fold.

For simplicity my first cut will access 32x32 windows of data, and with a cascade window size of 20x20 this allows 12x12 = 144 sub-window probes per loaded block to be executed with no external memory accesses. 2D DMA can be used to load these blocks, and what's more at only 4K there's still just enough room to double buffer these loads to hide any memory access latency.

LDS for image data buffers:

    |    SAT 0    |
    | 32x32xfloat |
    |     4k      |
    |    SAT 1    |
    | 32x32xfloat |
    |     4k      |

A variance value is also required for each window, but that is only 144 floats and is calculated on the ARM.


So the basic algorithm is:

    Load in 32x32 window of summed-area-table data (the image data)
    prepare load group 0 into buffer 0 if not resident
    for each location x=0..11, y=0..11
        process all local stages
        if passed add (x,y) to hits     

    gid = 0
    while gid < group count AND hits not empty
        wait dma idle
        if gid < group count
            prepare load group gid into dma buffer gid & 1
              if not resident
        for each location (x,y) in hits
           process all stages in current buffer
           if passed add (x,y) to newhits
       hits = newhits

    return hits

The window will be loaded double-buffered on the other dma channel.


So i have this working on my laptop stand-alone to the point that I could run some efficiency checks on bandwidth usage. I'm using the basic lenna image - which means it only finds false positives - so it gives a general lowerish bound one might expect for a typical image.

Anyway these are the results (I may have some boundary conditions out):

Total window load         : 1600
Total window bytes        : 6553600
Total cascade DMA required: 4551
Total cascade DMA bytes   : 24694984
Total cascade access bytes: 371975940
Total SAT access bytes    : 424684912

So first, the actual global memory requirements. Remember that the image is 512x512 pixels.

window load, window bytes

How many windows were loaded, and how many bytes of global memory this equates to. Each window is 32x32xfloats.

This can be possibly be reduced as there is nearly 1/3 overlap between locations at the expense of more complex addressing logic. Another alternative is to eschew the double-buffering and trade lower bandwidth for a small amount of idle time while it re-arranges the window data (even a 64x32 window is nearly 5 times more global-bandwidth efficient, but i can't fit 2 in for double buffering).

cacade DMA count, cascade DMA bytes

How many individual DMA requests were required, and their total bytes. One will immediately notice that the cascade is indeed the majority of the bandwidth requirement ...

If one is lucky, tuning the size of each of the cascade chunks loaded into the double buffers may make a measureable difference - one can only assume the bigger the better as DMA can be avoided entirely if the cascade aborts within 11 stages with this design.

Next we look at the "effective" memory requirements - how much data the algorithm is actually accessing. Even for this relatively small image without any true positives, it's quite significant.

It requires something approaching a gigabyte of memory bandwidth just for this single scale image!

Total SAT bytes

The total image (SAT) data accessed is about 420MB, but since this code "only" needs to load ~6.5MB it represents a 65x reduction in bandwidth requirements. I am curious now as to whether a simple technique like this would help with the cache hit ratio on a Cortex-A8 now ...

Total cascade bytes

The cascade data that must be 'executed' for this algorithm is nearly as bad - a good 370MB or so. Here the software cache isn't quite as effective - it still needs to load 24MB or so, but a bandwidth reduction of 15x isn't to be sneezed at either. Some of that will be the in-core stages. Because of the double buffering there's a possibility the latency of accessing this memory it can be completely hidden too - assuming the external bus and mesh fabric isn't saturated, and/or the cascade processing takes longer than the transfer.

LDS vs cache can be a bit of a pain to use but it can lead to very fast algorithms and no-latency memory access - with no cache thrashing or other hard-to-predict behaviour.

Of course, the above is assuming i didn't fuck-up somewhere, but i'll find that out once I try to get it working on-device. If all goes well I will then have a baseline from which to experiment with all the tunables and see what the real-world performance is like.


Alexander Gorban said...

Actually face detection was implemented for Parallella board by CVisionLab team http://www.adapteva.com/white-papers/face-detection-using-the-epiphany-multicore-processor/ source code is available.

NotZed said...

Yeah I know, i've looked at it. This is a differnet algorithm implemented in a differnet way.

To learn one needs to write code, not just read it, and it's just something i'm familair with.