Friday, 17 August 2012

mipmap generation with SIMD/NEON

One part of the code i've been working on is building mipmaps so i can quickly generate non-aliased images at multiple scales.

This is an almost perfect fit for a SIMD processor, and in NEON it is no exception.

The simplest mechanism is successive averaging of a 2x2 blocks, and a simple C implementation which works ok is something like this:

void resample_1_2(uchar *src, uchar *dst, int sw, int sh) {
  int x,y;
  int dw = sw >> 1;

  for (y=0;y<sh;y+=2) {
    for (x=0;x<sw;x+=2) {
      uint v;

      v = (src[x+y*sw]
           + src[x+1+y*sw]
           + src[x+(y+1)*sw]
           + src[x+1+(y+1)*sw]) >> 2;

      dst[x/2+y/2*dw] = (uchar)v;
    }
  }
}

This can then just be called successively on the result from the previous run, until one reaches the desired depth.

This is easy to convert to SIMD, but some of the operations cannot be expressed in C or parallel variants, so it's easier just to show it in NEON. The only 'issue' to deal with is that NEON uses 8-byte memory accesses, but that isn't too difficult to cope with - we just produce at least 8 pixels at a time, i.e. 16 bytes input.

So the basic 1/2 downsample (actually it's 1/4) step is:

        @ Load 16x2 pixels
        vld1.u8         { d0, d1 },[r0],r1
        vld1.u8         { d2, d3 },[r0],r1

        @ Add horizontal adjacent pixels
        vpaddl.u8       q0,q0
        vpaddl.u8       q1,q1

        @ Add vertical adjacent pixels
        vadd.u16        q0,q1
        @ Average (/4)
        vshrn.i16       d0,q0,#2
        @ Output 8x1
        vst1.u8         { d0 }, [r2,:64],r3

The 'trick' here is the vpaddl instruction, which adds adjacent items in the vector and produces a result twice as wide, preserving all bits of precision. And the other trick is the vshrn instruction - shift right with narrow - which performs a shift and a cast to a half-sized result in one go.

The only problem with this is there just isn't enough work to do so the processor will be stalled some of the time. It also vastly underutilises the available resources. One obvious solution is just to go wider - take 32x2 pixels, and produce 16x1 results. A better solution is to go wider and deeper.

Since the data for the level above is already present in registers, one can save the memory load next time and just do the processing now. This can be extended as far as one has registers or as far as one wants to write specific custom code, but for me I just added another level.

So this is the final routine which takes a 32x32 pixel region, and generates two outputs: 16x16 1/2 downsampled, and 8x8 1/4 downsampled. Each loop generates 16x2 1/2 scaled results and 18x1 1/4 scaled ones - at full precision. (Actually again - these are 1/4 and 1/16 downsampled respectively since it does X and Y at 1/2 and 1/4 each, but you get the idea).

        @
        @ Rescale 32x32 bytes by 1/2 and 1/4
        @

        @ r0 = src
        @ r1 = width (stride)
        @ r2 = dst 1/2
        @ r3 = width/2 (stride)
        @ stack0 r4 = dst 1/4
        @ stack1 r5 = width/4 (stride)

        .global resample_12_14_neon

resample_12_14_neon:
        stmdb sp!, {r4-r6, lr}

        ldr     r4,[sp,#4*4]
        ldr     r5,[sp,#4*4+4]
        
        mov     r6,#8   @ for 32 input rows
        
1:      @ first 2 samples
        vld1.u8         { d0, d1, d2, d3 },[r0],r1
        vld1.u8         { d4, d5, d6, d7 },[r0],r1

        @ sum across
        vpaddl.u8       q0,q0
        vpaddl.u8       q1,q1
        vpaddl.u8       q2,q2
        vpaddl.u8       q3,q3

        @ sum down into 1/4 accumulation registers
        vadd.u16        q14,q0,q2
        vadd.u16        q15,q1,q3

        @ output 1/2 scale
        vshrn.i16       d0,q14,#2
        vshrn.i16       d1,q15,#2
        vst1.u8         { d0, d1 }, [r2,:64],r3

        @ Repeat for row 1
        vld1.u8         { d0, d1, d2, d3 },[r0],r1
        vld1.u8         { d4, d5, d6, d7 },[r0],r1
        
        vpaddl.u8       q0,q0
        vpaddl.u8       q1,q1
        vpaddl.u8       q2,q2
        vpaddl.u8       q3,q3
        
        vadd.u16        q0,q2
        vadd.u16        q1,q3

        @ Accumulate 1/4 scale
        vadd.u16        q14,q0
        vadd.u16        q15,q1

        vshrn.i16       d0,q0,#2
        vshrn.i16       d1,q1,#2
        vst1.u8         { d0, d1 }, [r2,:64],r3
        
        @ Generate 1/4 (1/16) scale
        vpaddl.u16      q14,q14
        vpaddl.u16      q15,q15

        vshrn.i32       d28,q14,#4
        vshrn.i32       d29,q15,#4

        vmovn.i16       d28,q14

        subs            r6,#1

        vst1.u8         { d28 }, [r4,:64],r5

        bhi             1b
        
        ldmfd sp!, {r4-r6, pc}

Of note perhaps is that the 1/4 result is generated in the high registers rather than use q4-q5. This is to honour the C ABI and avoid the need for saving these registers, only q4-q7 need to be saved.

It may be that the 32x32 access pattern is not ideal, and it requires appropriate padding/alignment of the mipmap levels, but I had more interesting things to look at so I didn't spend a lot of time on it.

One added benefit of doing the two levels at once is improved precision of the lower level since it doesn't use truncated intermediate results. For this reason it may be worth extending it to 3 levels, or 4 if registers allowed and the image stride restrictions weren't onerous. For VGA signals 3 levels of sub-sampling is adequate for what I need so it probably isn't worth going further than that.

Actually for this reason of precision, and the fact that I am usually down-sampling the scaled result from the mipmap, I decided the '2 row' version of the SIMD scaling routine would better suit my purposes. And the fact I had it already working helped ...

Classifier

So with all this I put together a timing routine for my classifier (I don't have the classifier itself setup yet/or debugged). The timing as usual depends a great deal on the scales one searches but, but when going from a scale of 0.207 to 0.333 in 6 steps (equating to an object width of 51 to 85 pixels for my 17x17 classifier), loading the mipmap, scaling to all the target scales and generating the LBPu2 codes takes about 5ms on the beagleboard-xm (image is 512x512). Which is pretty ok by me - even if I could squeeze a bit more out of it it's not going to be much.

The problem is now the classifier - when I turn that on it blows out to about 200ms (and most of that is at the smallest scale), or a pretty paltry 5fps (but of course, one must be-ware of timing on unverified code, it could be totally bogus). But I had a thought on that just before going to bed last night and I'm off to go see if i can't improve on that.

No comments: