Sunday 6 July 2014

asm v c

For a bit of 'fun' i thought i'd see how translating some of the fft code to assembly would fare.

I started with this:

void
build_wtable(complex float *wtable, int logN, int logStride, int logStep, int w0) {
        int wcount = 1<<(logN - logStep - 2);
        int wshift = logStride;

        for (int i=0;i<wcount;i++) {
                int j0 = w0 + (i<<wshift);
                int f2 = j0 << (CEXPI_LOG2_PI - logN + logStep - logStride + 1);
                int f0 = f2 * 2;

                wtable[i*2+0] = ez_cexpii(f0);
                wtable[i*2+1] = ez_cexpii(f2);
        }
}

It generates the twiddle factors required for a given fft size relative to the size of the data. ez_cexpii(x) calculates e^(I * x * PI / (2^20)), as previously discussed.

This is called for each radix-4 pass - 5 times per 1024 elements, and generates 1, 4, 16, 64, or 256 complex value pairs per call. I'm timing all of these together below so this is the total time required for twiddle factor calculation for a 1024-element FFT in terms of the timer counter register.

The results:

                what  clock    ialu    fpu   dual  e1 st  ra st  loc fe  ex ld  size
                noop     23       4      0      0      9     16      8      6
        build_wtable  55109   36990   9548   6478      9   7858      1      6    372
     build_wtable_il  62057   25332   9548   4092      9  27977      8      6    618
      e_build_wtable  31252   21999   9548   8866      9   7518     13      6    464
   e_build_wtable_hw  29627   21337   9548   8866     24   7563     16     36    472

  1 * e_build_wtable    148      99     28     26      9     38      1      6

$ e-gcc --version
e-gcc (Epiphany toolchain (built 20130910)) 4.8.2 20130729 (prerelease)
(30 000 cycles is 50uS for a clock of 600Mhz)
noop
This just calls ez_timer0_start(timerid, ~0); ez_ctimer0_stop(); and then tallies up the counters showing a baseline for a single time operation.
build_wtable
This is the code as in the fft sample.
build_wtable_il
This is the same as above but the ez_cexpii() call has been inlined by the compiler.
e_build_wtable
The assembly version. This is after a few hours of poking at the code trying to optimise it further.
e_build_wtable_hw
The same, but utilising the hardware loop function of the epiphany.

Discussion

Well what can i say, compiler loses again.

I'm mostly surprised that inlining the cexpii() function doesn't help at all - it just hinders. I confirmed the code is being inlined from the assembler listing but it just seems to be more or less copying the code in-lined twice rather than trying to interleave the functions to provide additional scheduling opportunities. As can be seen the ialu ops are significantly reduced because the constant loads are being taken outside of the loop but that is offset by a decrease in dual-issue and a big jump in register stalls.

Whilst the compiler generated a branch in the cexpii function I managed to go branchless in assembly by using movCC a couple of times. I am loading all the constants from a memory table using ldr which reduces the ialu op count a good bit, although that is outside the inner loop.

The hardware loop feature knocks a little bit off the time but the loop is fairly large so it isn't really much. I had to tweak the code a little to ensure the code lined up correctly in memory otherwise the loop got slower. i.e. such that .balignw 8,0x01a2 did nothing although it is still present to ensure correct operation.

Try as I might I was unable to get every flop to dual issue with an in ialu op. Although I got 26 out of 28 to dual-issue which seems ok(?). I don't fully understand the interaction of the dual-issue with things like instruction order and alignment so it was mostly just trial and error. Sometimes moving an instruction which had no other dependency could change the timing by a bigger jump than it should have, or e.g. add one ialu instruction and now the total clock time is 3 clock cycles longer?

I'm not sure I understand the register stall column completely either. For example calling the function such that a single result pair is calculated results in the last row of the table above. 38 stalls, where? The noop case is easier to analyse in detail: where do those 16 stalls come from? AFAICT the instruction sequence executed is literally:

  32:   3feb 0ff2       mov r1,0xffff
  36:   3feb 1ff2       movt r1,0xffff
  3a:   10e2            mov r0,r4
  3c:   1d5f 0402       jalr r15

-> calls 00000908 <_ez_ctimer0_start>:

 908:   6112            movfs r3,config
 90a:   41eb 0ff2       mov r2,0xff0f
 90e:   5feb 1ff2       movt r2,0xffff
 912:   4d5a            and r2,r3,r2
 914:   4102            movts config,r2
 916:   390f 0402       movts ctimer0,r1
 91a:   0096            lsl r0,r0,0x4
 91c:   487a            orr r2,r2,r0
 91e:   4102            movts config,r2

-> clock starts counting NOW!

 920:   194f 0402       rts

-> returns to timing loop:

  40:   0112            movfs r0,config
  42:   02da            and r0,r0,r5
  44:   0102            movts config,r0

-> clock stops counting NOW!

  46:   391f 0402       movfs r1,ctimer0
  4a:   1113            add r0,r4,2
  4c:   270a            eor r1,r1,r6
  4e:   0056            lsl r0,r0,0x2

etc.
Attempting to analyse this:
4 ialu - integer alu operations
Yep, 4 instructions are executed.
6 ex ld - external load stalls
This is from reading the config register.
8 loc fe - local fetch stalls
From the rts switching control flow back to the caller?
16 ra st - register dependency stalls
Considering no registers are being directly accessed in this section it's a little weird. rts will access lr (link register) but that shouldn't stall. The instruction fetch will access the PC, not sure why this would stall.

Still puzzled. 23 clock cycles for { rts, movfs, and, movts }, apparently.

No comments: