Just some notes on optimising the assembly language version of the
viola-jones cascade walker I've been working on for the eiphany chip.
I'm still working toward a tech demo but I got bogged down with the
details of the resampling code - i'm using the opportunity to finally
grok how upfirdn works.
Excuse the typos, i did this in a bit of a rush and don't feel like
fully proof-reading it.
The algorithm
First the data structure. This allows the cascade to be encoded
using dword (64-bit) alignment. It's broken into 64-bit elements for
C compatability.
union drecord {
unsigned long long v;
struct {
unsigned int flags;
float sthreshold;
} head0;
struct {
unsigned int count2;
unsigned int count3;
} head1;
struct {
unsigned short a,b,c,d;
} rect;
struct {
float weight1;
float fthreshold;
} f0;
struct {
float fsucc,ffail;
} f1;
};
And then a C implementation using this data structure. The
summed-area-table (sat) sum calculates the average of all pixels
within that rectangle. The sat table size is hard-coded to a
specific width and encoded into the compiled cascade. Because it is
only ever processed as part of a window of a known size this doesn't
limit it's generality.
It performs a
feature test on a 2-region feature which equates to either a "this
half is brighter than the other half" test in all 4 directions, or a
"the middle half is brighter than the two quarter sides" in both
directions and senses.
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
int test_cascade(float *sat, float var, const union drecord *p, float *ssump) {
union drecord h0;
union drecord h1;
float ssum = *ssump;
do {
h0 = (*p++);
h1 = (*p++);
while (h1.head1.count2) {
union drecord r0, r1, f0, f1;
float rsum;
r0 = (*p++);
r1 = (*p++);
f0 = (*p++);
f1 = (*p++);
rsum = (sat[r0.rect.a] + sat[r0.rect.d]
- sat[r0.rect.b] - sat[r0.rect.c]) * -0.0025f;
rsum += (sat[r1.rect.a] + sat[r1.rect.d]
- sat[r1.rect.b] - sat[r1.rect.c]) * f0.f0.weight1;
ssum += rsum < f0.f0.fthreshold * var ? f1.f1.fsucc : f1.f1.ffail;
h1.head1.count2--;
}
... 3-feature test is much the same ...
if (h0.head0.flags & 1) {
if (ssum < h0.head0.sthreshold) {
return 0;
}
ssum = 0;
}
} while ((h0.head0.flags & 2) == 0);
*ssump = ssum;
// keep on going
return 1;
}
As one can see the actual algorithm is really very simple. The
problem with making it run fast is dealing with the amount of data
that it can chew through as i've mentioned and detailed in previous
posts.
I don't have any timings but this should be a particularly fast
implementation on an desktop cpu too - most of the heavy lifting fits
in the L1 cache for example, and it's pre-compling as much as
possible.
Hardware specific optimisations
This covers a couple of optimisations made to take advantage of the
instruction set.
First issue is that there is no comparison operation - all one can do
is subtract and compare flags. Furthermore there are only limited
comparison operators available - equal, less-than and
less-than-or-equal. So in general a compare is at least 2
instructions (and more if you want to be ieee compliant but that isn't needed here).
On the other hand there are fmad and fmsub instructions - AND these
set the flags. So it is possible to perform all three operations in
one instruction given that we don't need to know the precise value.
Another feature of the epu is that the floating point and integer
flags are separate so this can be utilised to fill instruction slots
and also perform control flow without affecting the flags.
The epu is most efficient when performing dword loads. It's the same
speed as a word load, and faster than a short or byte load. So the
format is designed to support all dword loads.
Another general optimisation is in pre-compiling the cascade for the
problem. So far i'm only using it to pre-calculate the array offsets
but it could also be used to alter the sign of calculations to suit
the available fpu flags.
Update: Because the eiphipany LDS is so small another optimisation was to make the cascade streamable. Although the single biggest stage with the test cascade fits in 8k it is pretty tight and limits the code flexbility and tuning options (e.g. trade-off space and time). It also limits generality - other cascades may not have the same topology. So the cascade format is designed so it can be broken at completely arbitrary boundary points with very little overhead - this is probably the single most important bit of engineering in the whole exercise and determines everything else. The difficulty isn't so much in designing the format as in recognising the need for it and it's requirements in the first place. Having a streamable cascade adds a great deal of flexibility for dealing with large structures - they can be cached easily and implementing read-ahead is trivial.
There were some other basic optimisation techniques which became available after studying the actual data:
- 2-region features use only two variations of weights, therefore it can be encoded in 1 bit or in a single float (the first one is always the same).
- 3-region features all use the same weights, therefore all 3 floats can be thrown away.
- The original cascade format had 2 or 3 region features scattered amongst the cascade randomly which means any inner loop has to deal with the different number of elements (and branch!). Once on realises the only result is the sum then they can be processed in any order (summation algebra ftw ... again), meaning i could group them and optimise each loop separately.
Some of these seem to lose the generality of the routine - but actually the weights are always the same relationship they are just scaled to the size of the native cascade window. So making the algorithm general would not take much effort.
These are things I missed when I worked on my OpenCL version so I think I could improve that further too. But trying to utilise the concurrency and dealing with the cascade size is what kills the GPU performance so it might not help much as it isn't ALU constrained at all. If I ever get a GCN APU I will definitely revisit it though.
Unscheduled ASM
After a (good) few days worth hacking blind and lots of swearing I
finally came up with the basic code below. I was dreaming in register
loads ...
Actually this was de-scheduled in order to try to follow it and
re-schedule it more efficiently. This is the top part of the C code
and the entire 2-region loop.
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
0: ldrd r18,[r7,#1] ; count2, count3
ldr r16,[r7],#4 ; flags
and r0,r18,r5 ; check zer0 count
beq 1f
2: ldrd r0,[r7],#4 ; 0: load a,b,c,d
ldrd r2,[r7,#-3] ; 1: load a,b,c,d
lsr r4,r0,#16 ; 0:b index
ldr r21,[r6,r4] ; 0:load b
and r0,r0,r5 ; 0:a index
ldr r20,[r6,r0] ; 0:load a
lsr r4,r1,#16 ; 0: d index
ldr r23,[r6,r4] ; 0: load d
and r1,r1,r5 ; 0: c indec
ldr r22,[r6,r1] ; 0: load c
lsr r4,r2,#16 ; 1: b index
ldr r25,[r6,r4] ; 1: load b
and r2,r2,r5 ; 1: a index
ldr r24,[r6,r2] ; 1: load a
lsr r4,r3,#16 ; 1: d iindex
ldr r27,[r6,r4] ; 1: load d
and r3,r3,r5 ; 1: c index
ldr r26,[r6,r3] ; 1: load c
ldrd r50,[r7,#-2] ; load w1, rthreshold
fsub r44,r20,r21 ; 0: a-b
fsub r45,r23,r22 ; 0: d-c
fsub r46,r24,r25 ; 1: a-b
fsub r47,r27,r26 ; 1: d-c
fmul r48,r51,r60 ; rthreshold *= var
fadd r44,r44,r45 ; 0[-1]: a+d-b-c
fadd r45,r46,r47 ; 1[-1]: a+d-b-c
fmsub r48,r44,r63 ; [-1]: var * thr -= (a+d-b-c) * w0
ldrd r52,[r7,#-1] ; [-1] load fsucc, ffail
fmsub r48,r45,r50 ; [-1] var * thr -= (a+d-b-c) * w1
movblte r52,r53
fsub r17,r17,r52 ; [-2]: ssum -= var * thr > (rsum) ? fsucc: ffail
sub r18,r18,#1
bne 2b
1:
Apart from the trick with the implicit 'free' comparison operations for all that it pretty much ended up in a direct translation of the C code (much of the effort was in the format design and getting the code to run). But even in this
state it will execute much faster than what the compiler generates for
the very simple loop above. Things the C compiler is missing:
- It doesn't use dword loads - more instructions are needed
- It does use hword loads - causes fixed stalls
- It is using an ieee comparison function (compiler flags may change this)
- It doesn't use fmsub as much, certainly not for comparison
- It needs to multiply the array references by 4
Because there are no datatypes in asm, this can take advantage of the
fact that the array lookups are by the byte and pre-calculate the
shift (multiply by sizeof(float)) in the cascade. In the C version I
do not as it adds a shift for a float array reference - I do have a
way to remove that in C but it's a big ugly.
Otherwise - it's all very straightforward in the inner loop.
First it loads all the rect definitions and then looks them up in the sat table (r6).
Then it starts the calculations, first calculating the average and
then using fmsub to perform the multiply by the weight and comparison
operation in one.
At the very end of the loop the last flop is to perform a subtraction
on the ssum - this sets the status flags to the final comparison (if
(ssum < h0.head0.sthreshold) in c). This actually requires some
negation in code that uses it which could be improved - the
threshold could be negated in the cascade for example.
If one looks closely one will see that the registers keep going up
even though many are out of scope and can be re-used. This is done on
purpose and allows for the next trick ...
I don't have the full profiling info for this version, but I have a note
that it includes 15 RA stalls, and I think from memory only dual-issues 2 of the 10 flops.
Scheduling
A typical optimisation technique is to unroll a loop, either manually
or by letting the compiler do it. Apart from reducing the relative
overhead of any loop support constructs it provides modern processors
with more flexibility to schedule instructions.
The code already has some loop unrolling anyway - the two regions are
tested using in-line code rather than in a loop.
But unrolling gets messy when you don't know the the loop bounds or
don't have some other hard detail such as that there is always an even
number of loops. I didn't really want to try to look at pages of code
and try to schedule by hand either ...
So instead I interleaved the same loop - as one progresses through the
loop calculating the addresses needed for "this" result, the fpu is
performing the calculations for the "last" result. You still need a
prologue which sets up the first loop for whatever the result+1 code
is expecting, and also an epilogue for the final result - and if only
1 value is processed the guts is completely bypassed. I'll only show the guts here ...
// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
.balign 8
2:
[ 0] fsub r46,r24,r25 ; [-1] 1: a-b
[ 0] ldrd r0,[r7],#4 ; [ 0] 0: load a,b,c,d
[ 1] fsub r47,r27,r26 ; [-1] 1: d-c
[ 1] ldrd r2,[r7,#-3] ; [ 0] 1: load a,b,c,d
[ 2] fmul r48,r51,r60 ; [-1] rthreshold *= var
[ 2] lsr r4,r0,#16 ; [ 0] 0:b index
[ 3] fadd r44,r44,r45 ; [-1] 0: a+d-b-c
[ 3] ldr r21,[r6,r4] ; [ 0] 0:load b
[ 4] and r0,r0,r5 ; [ 0] 0:a index
[ 5] ldr r20,[r6,r0] ; [ 0] 0:load a
[ 6] lsr r4,r1,#16 ; [ 0] 0: d index
[ 6] fadd r45,r46,r47 ; [-1] 1: a+d-b-c
[ 7] ldr r23,[r6,r4] ; [ 0] 0: load d
[ 8] and r1,r1,r5 ; [ 0] 0: c indec
[ 8] fmsub r48,r44,r63 ; [-1] var * thr -= (a+d-b-c) * w0
[ 9] ldr r22,[r6,r1] ; [ 0] 0: load c
[ 10] lsr r4,r2,#16 ; [ 0] 1: b index
[ 11] ldr r25,[r6,r4] ; [ 0] 1: load b
[ 12] and r2,r2,r5 ; [ 0] 1: a index
[ 13] ldr r24,[r6,r2] ; [ 0] 1: load a
[ 13] fmsub r48,r45,r50 ; [-1] var * thr -= (a+d-b-c) * w1
[ 14] ldrd r52,[r7,#-5] ; [-1] load fsucc, ffail
[ 15] lsr r4,r3,#16 ; [ 0] 1: d iindex
[ 16] and r3,r3,r5 ; [ 0] 1: c index
[ 17] ldr r27,[r6,r4] ; [ 0] 1: load d
[ 18] movblte r52,r53 ; [-1] val = var * thr < rsum ? fsucc : ffail
[ 19] fsub r44,r20,r21 ; [ 0] 0: a-b
[ 19] ldr r26,[r6,r3] ; [ 0] 1: load c
[ 20] fsub r45,r23,r22 ; [ 0] 0: d-c
[ 20] sub r18,r18,#1
[ 21] ldrd r50,[r7,#-2] ; [-1] load w1, rthreshold
[ 21] fsub r17,r17,r52 ; [-1] ssum -= var * thr > (rsum) ? fsucc: ffail
[ 22] bne 2b
[ 26] ; if back to the start of the loop
Update: I tried to improve and fix the annotations in the comments. The [xx] value is the index of the result this instruction is working on, the next x: value is the index of the region being worked on (where it is needed).
I've attempted to show the clock cycles the instructions start on (+ 4
for the branch), but it's only rough. I know from the hardware
profiling that every flop dual-issues and there are no register
stalls. The loop start alignment is also critical to the lack of
stalls. And it took a lot of guess-work to remove the final stall
which lingered in the last 5 instructions (someone'll probably tell me
now that the sdk has a cycle timer, but that would be no matter if
they did).
It almost fell out almost completely symmetrically - that is having
all ialu ops in loop 0 and having all flops in loop 1, but by rotating
the flops around a bit I managed to get the final flop being the ssum
"subtraction + comparison" operation and with no stalls ...
The movblte instruction which performs the ternary is the one that
uses the implicit comparison result from the fmsub earlier. Not only
does this save one instruction, it also saves the 5 clock cycle
latency it would add - and this loop has no cycles to spare that I could find.
There is some more timing info for this one on the previous post. The version
that this is 30% faster is not the unscheduled one above but an earlier scheduling attempt.
Oh I should probably mention that i found the bugs and the timings in the previous post did change a bit for the worse, but not significantly.