Tuesday, 23 October 2012

NEON complex multiply

In the last post I mentioned writing a complex multiply for NEON.

It's actually a good demonstration of the use of a NEON feature - data manipulation on loads, and it's quite trivial i'll post it here.

Complex Multiply

As one might recall, a complex multiply:

C = A * B
Is implemented as the expansion:
C = A * B
  = (A.re + A.im j) * (B.re + B.im j)
  = (A.re * B.re - A.im * B.im) + (A.re * B.im + A.im * B.re) j

Where of course j*j = -1.

If the real and imaginary parts are stored in separate planes, this translates trivially to a set of SIMD instructions, but normally they are stored as (real, imag) pairs.


Here is where VLD2 comes to the aid of the weary programmer. It will automatically unpack 2-element fields into separate registers and simply allow you to write the code as if the data was stored as planes to start with.

It wasn't quite clear from the documentation how it handled more than 4x2 elements but with an experiment I worked it out and it does the thing you'd expect, allowing you to use quad-word ops.


$00000000: a.real a.imag b.real b.imag
$00000010: c.real c.imag d.real d.imag

 LDR  r0,=0
 VLD2 { d0-d3 }, [r0]

Registers (as float2)

  d0  a.real b.real
  d1  c.real d.real
  d2  a.imag b.imag
  d3  c.imag d.imag

Registers (as float4)

  q0  a.real b.real c.real d.real
  q1  a.imag b.imag c.imag d.imag


By unrolling the loop 4x in SIMD and 2x in instructions one can perform 8 complex multiplies per loop:

    @ r0 is address of C
    @ r1 is address of A
    @ r2 is address of B
    @ q8, q10 = A[0-7].real
    @ q9, q11 = A[0-8].imag
    @ q12, q14 = B[0-7].real
    @ q13, q15 = B[0-7].imag

    vld2.32  { d16-d19 },[r1]!
    vld2.32  { d24-d27 },[r2]!
    vld2.32  { d20-d23 },[r1]!
    vld2.32  { d28-d31 },[r2]!

    vmul.f32 q0,q8,q12    @ a.r * b.r [ 0-3 ]
    vmul.f32 q1,q9,q12    @ a.i * b.r
    vmul.f32 q2,q10,q14   @ a.r * b.r [ 4-7 ]
    vmul.f32 q4,q11,q14   @ a.i * b.r

    vmls.f32 q0,q9,q13    @ - a.i * b.i [ 0-3 ]
    vmla.f32 q1,q8,q13    @ + a.r * b.i
    vmls.f32 q2,q11,q15   @ - a.i * b.i [ 4-7 ]
    vmla.f32 q3,q10,q15   @ + a.r * b.i

    vst2.32  { d0-d3 },[r0]!
    vst2.32  { d4-d7 },[r0]!

    mov      pc,lr

q4-q7 are the callee-saved registers, so I simply avoid having to save them by using the others.

There is a few cycle stall for the stores at the end, but in a loop one can load the next 8 complex values before the store to avoid it.


I started pulling some of my experiments together into a prototype today and started to hit some annoying issues: pretty much anything in to do with large arrays of floats in C is 3-4x slower than doing it in NEON.

I can feel a lot of NEON coming on ...

No comments: