One interesting solution along the way was code that took 2x2-channel float sequences (i.e. 2xcomplex number arrays) and re-wound them back to 4-channel bytes, including scaling and clamping.
I utilised the fixed-point variant of the VCVT instruction which performs the scaling to 8 bits with clamping below 0. For the high bits I used the saturating VQMOVN variant of move with narrow.
I haven't run it through the cycle counter (or looked the details up) so it could probably do with some jiggling or widening to 32 bytes/iteration but the current main loop is below.
vld1.32 { d0[], d1[] }, [sp] vld1.32 { d16-d19 },[r0]! vld1.32 { d20-d23 },[r1]! 1: vmul.f32 q12,q8,q0 @ scale vmul.f32 q13,q9,q0 vmul.f32 q14,q10,q0 vmul.f32 q15,q11,q0 vld1.32 { d16-d19 },[r0]! @ pre-load next iteration vld1.32 { d20-d23 },[r1]! vcvt.u32.f32 q12,q12,#8 @ to int + clamp lower in one step vcvt.u32.f32 q13,q13,#8 vcvt.u32.f32 q14,q14,#8 vcvt.u32.f32 q15,q15,#8 vqmovn.u32 d24,q12 @ to short, clamp upper vqmovn.u32 d25,q13 vqmovn.u32 d26,q14 vqmovn.u32 d27,q15 vqmovn.u16 d24,q12 @ to byte, clamp upper vqmovn.u16 d25,q13 vst2.16 { d24,d25 },[r3]! subs r12,#1 bhi 1b
The loading of all elements of q0 from the stack was the first time I've done this:
vld1.32 { d0[], d1[] }, [sp]
Last time I did this I thing I did a load to a single-point register or an ARM register then moved it across, and I thought that was unnecessarily clumsy. It isn't terribly obvious from the manual how the various versions of VLD1 differentiate themselves unless you look closely at the register lists. d0[],d1[] loads a single 32-bit value to every lane of the two registers, or all lanes of q0.
The VST2 line:
vst2.16 { d24,d25 },[r3]!Performs a neat trick of shuffling the 8-bit values back in to the correct order - although it relies on the machine operating in little-endian mode.
The data flow is something like this:
input bytes: ABCD ABCD ABCD float AB channel: AAAA BBBB AAAA BBBB float CD channel: CCCC DDDD CCCC DDDD output bytes: ABCD ABCD ABCD
As the process of performing a forward then inverse FFT ends up scaling the result by the number of elements (i.e. *(width*height)) the output stage requires scaling by 1/(width*height) anyway. But this routine requires further scaling by (1/255) so that the fixed-point 8-bit conversion works and is performed 'for free' using the same multiplies.
This is the kind of stuff that is much faster in NEON than C, and compilers are a long way from doing it automatically.
The loop in C would be something like:
float clampf(float v, float l, float u) { return v < l ? l : (v < u ? v : u); } complex float *a; complex float *b; uint8_t *d; float scale = 1.0f / (width * height); for (int i=0;i<width;i++) { complex float A = a[i] * scale; complex float B = b[i] * scale; float are = clampf(creal(A), 0, 255); float aim = clampf(cimag(A), 0, 255); float bre = clampf(creal(B), 0, 255); float bim = clampf(cimag(B), 0, 255); d[i*4+0] = (uint8_t)are; d[i*4+1] = (uint8_t)aim; d[i*4+2] = (uint8_t)bre; d[i*4+3] = (uint8_t)bim; }
And it's interesting to me that the NEON isn't much bulkier than the C - despite performing 4x the amount of work per loop.
I setup a github account today - which was a bit of a pain as it doesn't work properly with my main browser machine - but I haven't put anything there yet. I want to bed down the basic data flow and user-interaction first.
2 comments:
Nice post!
Just wondering why you don't use the PLD instruction which would greatly increase the overall performance.
Isn't that instruction implemented on OMAP on Beagleboard? I thought PLD to be mandatory from Coretex on - at least for NEON.
To be honest I don't know exactly how to use it e.g. how far ahead one should pre-fetch.
I've tried it a few times but it never made any difference with the code in question so I said 'fuck it' and never bothered trying again.
Post a Comment