This is an almost perfect fit for a SIMD processor, and in NEON it is no exception.

The simplest mechanism is successive averaging of a 2x2 blocks, and a simple C implementation which works ok is something like this:

void resample_1_2(uchar *src, uchar *dst, int sw, int sh) { int x,y; int dw = sw >> 1; for (y=0;y<sh;y+=2) { for (x=0;x<sw;x+=2) { uint v; v = (src[x+y*sw] + src[x+1+y*sw] + src[x+(y+1)*sw] + src[x+1+(y+1)*sw]) >> 2; dst[x/2+y/2*dw] = (uchar)v; } } }

This can then just be called successively on the result from the previous run, until one reaches the desired depth.

This is easy to convert to SIMD, but some of the operations cannot be expressed in C or parallel variants, so it's easier just to show it in NEON. The only 'issue' to deal with is that NEON uses 8-byte memory accesses, but that isn't too difficult to cope with - we just produce at least 8 pixels at a time, i.e. 16 bytes input.

So the basic 1/2 downsample (actually it's 1/4) step is:

@ Load 16x2 pixels vld1.u8 { d0, d1 },[r0],r1 vld1.u8 { d2, d3 },[r0],r1 @ Add horizontal adjacent pixels vpaddl.u8 q0,q0 vpaddl.u8 q1,q1 @ Add vertical adjacent pixels vadd.u16 q0,q1 @ Average (/4) vshrn.i16 d0,q0,#2 @ Output 8x1 vst1.u8 { d0 }, [r2,:64],r3

The 'trick' here is the vpaddl instruction, which adds adjacent items in the vector and produces a result twice as wide, preserving all bits of precision. And the other trick is the vshrn instruction - shift right with narrow - which performs a shift and a cast to a half-sized result in one go.

The only problem with this is there just isn't enough work to do so the processor will be stalled some of the time. It also vastly underutilises the available resources. One obvious solution is just to go wider - take 32x2 pixels, and produce 16x1 results. A better solution is to go wider and deeper.

Since the data for the level above is already present in registers, one can save the memory load next time and just do the processing now. This can be extended as far as one has registers or as far as one wants to write specific custom code, but for me I just added another level.

So this is the final routine which takes a 32x32 pixel region, and generates two outputs: 16x16 1/2 downsampled, and 8x8 1/4 downsampled. Each loop generates 16x2 1/2 scaled results and 18x1 1/4 scaled ones - at full precision. (Actually again - these are 1/4 and 1/16 downsampled respectively since it does X and Y at 1/2 and 1/4 each, but you get the idea).

@ @ Rescale 32x32 bytes by 1/2 and 1/4 @ @ r0 = src @ r1 = width (stride) @ r2 = dst 1/2 @ r3 = width/2 (stride) @ stack0 r4 = dst 1/4 @ stack1 r5 = width/4 (stride) .global resample_12_14_neon resample_12_14_neon: stmdb sp!, {r4-r6, lr} ldr r4,[sp,#4*4] ldr r5,[sp,#4*4+4] mov r6,#8 @ for 32 input rows 1: @ first 2 samples vld1.u8 { d0, d1, d2, d3 },[r0],r1 vld1.u8 { d4, d5, d6, d7 },[r0],r1 @ sum across vpaddl.u8 q0,q0 vpaddl.u8 q1,q1 vpaddl.u8 q2,q2 vpaddl.u8 q3,q3 @ sum down into 1/4 accumulation registers vadd.u16 q14,q0,q2 vadd.u16 q15,q1,q3 @ output 1/2 scale vshrn.i16 d0,q14,#2 vshrn.i16 d1,q15,#2 vst1.u8 { d0, d1 }, [r2,:64],r3 @ Repeat for row 1 vld1.u8 { d0, d1, d2, d3 },[r0],r1 vld1.u8 { d4, d5, d6, d7 },[r0],r1 vpaddl.u8 q0,q0 vpaddl.u8 q1,q1 vpaddl.u8 q2,q2 vpaddl.u8 q3,q3 vadd.u16 q0,q2 vadd.u16 q1,q3 @ Accumulate 1/4 scale vadd.u16 q14,q0 vadd.u16 q15,q1 vshrn.i16 d0,q0,#2 vshrn.i16 d1,q1,#2 vst1.u8 { d0, d1 }, [r2,:64],r3 @ Generate 1/4 (1/16) scale vpaddl.u16 q14,q14 vpaddl.u16 q15,q15 vshrn.i32 d28,q14,#4 vshrn.i32 d29,q15,#4 vmovn.i16 d28,q14 subs r6,#1 vst1.u8 { d28 }, [r4,:64],r5 bhi 1b ldmfd sp!, {r4-r6, pc}

Of note perhaps is that the 1/4 result is generated in the high registers rather than use q4-q5. This is to honour the C ABI and avoid the need for saving these registers, only q4-q7 need to be saved.

It may be that the 32x32 access pattern is not ideal, and it requires appropriate padding/alignment of the mipmap levels, but I had more interesting things to look at so I didn't spend a lot of time on it.

One added benefit of doing the two levels at once is improved precision of the lower level since it doesn't use truncated intermediate results. For this reason it may be worth extending it to 3 levels, or 4 if registers allowed and the image stride restrictions weren't onerous. For VGA signals 3 levels of sub-sampling is adequate for what I need so it probably isn't worth going further than that.

Actually for this reason of precision, and the fact that I am usually down-sampling the scaled result from the mipmap, I decided the '2 row' version of the SIMD scaling routine would better suit my purposes. And the fact I had it already working helped ...

### Classifier

So with all this I put together a timing routine for my classifier (I don't have the classifier itself setup yet/or debugged). The timing as usual depends a great deal on the scales one searches but, but when going from a scale of 0.207 to 0.333 in 6 steps (equating to an object width of 51 to 85 pixels for my 17x17 classifier), loading the mipmap, scaling to all the target scales and generating the LBPu2 codes takes about 5ms on the beagleboard-xm (image is 512x512). Which is pretty ok by me - even if I could squeeze a bit more out of it it's not going to be much.

The problem is now the classifier - when I turn that on it blows out to about 200ms (and most of that is at the smallest scale), or a pretty paltry 5fps (but of course, one must be-ware of timing on unverified code, it could be totally bogus). But I had a thought on that just before going to bed last night and I'm off to go see if i can't improve on that.

## No comments:

Post a Comment