## Saturday, 12 July 2014

### Optimising

I spent a good chunk of yesterday trying to optimise the cexpi implementation to get of as many register access penalties as possible and dual-issue as many instructions as possible. There still seems to be a bit of magic there but i'm getting there.
```                what  clock   ialu    fpu   dual  e1 st  ra st  loc fe  ex ld

e_build_wtable  29637  21347   9548   8866     24   7563      21     36
e_build_wtable2  20149  17656   8184   6138     24    402      28     36
```

build_wtable_2 is based on the 5+4 term sincos from the previous post so requires fewer flops to start with but the main point of interest is the greatly reduced register stalls.

Primarily I just wasn't putting enough code between the flop instructions, but that can be easier said than done.

I used a technique I've talked about previously to be able to add more ialu instructions within the flops. Basically the loop is unrolled "in-place", whereby some results required for the next loop are calculated in the current one and otherwise the loop conditions remain the same.

A fairly detailed summation of the steps follows. This calculates two tables of sin/cos pairs together. The polynomial approximation is only valid over (0..pi/4) and so to calculate sin(pi/4+i) it calcualtes cos(pi/4-i) and so on.

```build_wtable(complex float *wtable, int logN, int logStride, int logStep, int w0)

calculate loop count
loop:
all integer ops:
// 1: calculate frequency of this coefficient
f1 = w0 << shift
f0 = f1 * 2

// 2: calculate octant
j0 = f0 >> shift2
j1 = f1 >> shift2

// 3: calculate fractional part of frequency and mirror if required
x0 = f0 & mask;
x0 = j0 & 1 ? (scale - x0) ? x0;
x1 = f1 & mask;
x1 = j1 & 1 ? (scale - x1) ? x1;

// 4: calculate sign of result
negcos0 = ((j0+2) >> 2) << 31;
negsin0 = ((j0) >> 2) << 31;
negcos1 = ((j1+2) >> 2) << 31;
negsin1 = ((j1) >> 2) << 31;

// 5: determine if cos/sine needs swapping
swap0 = (j0+1) & 2;
swap1 = (j1+1) & 2;

All float ops:
// 6: scale input
a0 = to_float(x0) * 1.0 / scale;
a1 = to_float(x1) * 1.0 / scale;

// 7: calculate sin/cos over pi/4
b0 = a0 * a0;
b1 = a1 * a1;

... calculate sin'(a0/a1) and cos'(a0/a1) using fmadd+fmul

// 8: swap if necessary
cos0 = swap0 ? sin'0 : cos'0;
sin0 = swap0 ? cos'0 : sin'0;
cos1 = swap1 ? sin'0 : cos'0;
sin1 = swap1 ? cos'0 : sin'0;

// 9: fix sign
All Integer ops:
// twiddles bit-31 of ieee float
as_int(cos0) ^= negcos0;
as_int(sin0) ^= negsin0;
as_int(cos1) ^= negcos1;
as_int(sin1) ^= negsin1;

// a: output
*wtable++ = cos0 + I * sin0;
*wtable++ = cos1 + I * sin1;

// b: next step
w0 += wstep;

endloop
```
I wont show the detail of the calculation itself but I tried using Estrin's Algorithm for the polynomial expansion to try to reduce the latency involved. Actually it ended up a bit interesting because it shifted where the constants needed to be loaded for the 3-register fmadd (since it modifies it's first result, the constants need loading there) - and meant there were fewer ways to dual-issue instructions, more below. I might end up trying the other way too - but this stuff just sucks so much time it's crazy.

So basically there are a bunch of integer ops followed by float ops followed by a few int ops to finish off. Now comes the tricky bit where 'optimisation' comes into play. Firstly, float operations all have a latency which means that if you just use them in the order shown you wont get good performance.

Even just listing it the way I have is a simple optimisation.

For example:

``` a0 = to_float(x0) * 1.0 / scale;
b0 = a0 * a0;
a1 = to_float(x1) * 1.1 / scale;
b1 = a1 * a1;
```
Would take approximately 2x longer than:
``` a0 = to_float(x0) * 1.0 / scale;
a1 = to_float(x1) * 1.1 / scale;
b0 = a0 * a0;
b1 = a1 * a1;
```
If one looks at the basic instructions annotated with the delay slots it's pretty obvious which is better (i might be a off with the delay slots)
```inline                 interleaved

float r8,r0           float r8,r0
-                     float r10,r1
-                     -
-                     -
-                     -
fmul  r8,r8,r32       fmul  r8,r8,r32
-                     fmul  r10,r10,r32
-                     -
-                     -
-                     -
fmul  r9,r8,r8        fmul  r9,r8,r8
float r10,r1          fmul  r11,r10,r10
-
-
-
-
fmul  r10,r10,r32
-
-
-
-
fmul  r11,r10,r10
```
But one can only do this optimisation if you have enough registers to fully calculate each separately (this is an important point, and why the cpu has so many registers).

The other factor is that the epiphany cpu can dual-issue certain sets of instructions in certain circumstances. Fortunately the circumstances are fairly simple (for the most part, some details still elude me) and it's basically a float + ialu or load op. The precise behaviour seems to depend on instruction order/alignment but i'm not really sure on all the details yet.

So this means for example that any of the integer operations can be interleaved with the float operations and effectively become 'zero cost' - assuming there is some 'fixed cost' based on the number of float ops and their scheduling requirements. Cool but it complicates things and dual-issuing instructions means you don't fill any of those latency slots so need to find more instructions!

Examining the algorithm there are some hard dependencies which force a particular calculation order but a bunch of the integer calculations aren't needed till the end.

Steps 1-3
These are required before any float ops can be performed. These cannot be dual-issued obviously.
Steps 4-5
These can be performed anywhere and they are not needed until the end. These are prime candidates for dual issue.
Steps 6-7
These are hard orderings that add float latencies which cannot be avoided. They also require some register moves to setup the 3-register fmadd instructions which can make use of some of those slots and dual-issue.
Steps 8-a
These are pretty much hard-dependencies on the output. Fortunately there are no ra penalties for integer operations but it does mean there is an integer/fpu op mismatch; as odd as it sounds it may actually help to convert these operations into floating point ops.
Step b
Can be put almost anywhere.

Actually ... I lied. All of those steps can be interleaved with others ... by calculating the value you will need for the next loop (or next next) instead. This requires a ton of registers - I used every 'free' register up and had to save a few to the stack besides. I think the epiphany ABI could use some tuning for size/performance but I don't have enough data to back up any changes; and it isn't simple because any change has side effects.

The algorithm then becomes changed to:

```  setup first loop
loop:
calculate result interleaved with setup next loop
```

The tricky bit comes in tracking the state of all the registers across the algorithm and ensuring the right value is calculated and in the correct register when it is needed. The simplest way to do this is to basically use a new register for every calculation that might be needed later and then bits of code can basically be re-arranged 'at will' without becoming too much of a complicated mess for a head to manage. And hope you don't run out of registers.

Still it's a little hard on the head and eyes so I'm thinking of writing a tool which can help with some of the work. I started with a tool that parsed the assembly and dumped a table of register liveliness; but then i realised to go any further with that i'd basically need to write a whole assembler so I started work on an instruction decoder that can pass over the compiled binary which should be easier. Epiphany's instruction set is fairly tiny so at least that shouldn't be too much work. Well i'll see on that one, ... it's still a lot of work.

### Estrin's Method and 3-register FMA

So whilst doing a pass over this post I noticed the anomaly with the greatly reduced ialu count of the new implementation. I did notice when I wrote the initial implementation it seemed rather small compared to the previous one - I put it down to the reduced term count but on reflection is is more than that.

So the basic polynomial expansion using Horner's Rule for sin+cos() in terms of 3-register fmadd instructions becomes:

``` ; sin
; a (A + a^2 (B + a^2 (C + a^2 D)

; r0 = a
; r1 = a^2

mov   r2,rC      ; t0 = C
1 fmadd r2,r1,rD   ; t0 += a^2 D
mov   r3,rB      ; t1 = B
2 fmadd r3,r2,r1   ; t1 += a^2 (C + a^2 D)
mov   r2,rA      ; t0 = A
3 fmadd r2,r3,r1   ; t0 += a^2 (B + a^2 (C + a^2 D))
4 fmul  r2,r2,r0   ; t0 *= a

; cos
; A + a^2 (B + a^2 (C + a^2 (D + a^2 E)

; r0 = a
; r1 = a^2

mov   r2,rD      ; t0 = D
1 fmadd r2,r1,rE   ; t0 += a^2 D
mov   r3,rC      ; t1 = C
2 fmadd r3,r2,r1   ; t1 += a^2 (D + a^2 E)
mov   r2,rB      ; t0 = B
3 fmadd r2,r3,r1   ; t0 += a^2 (C + a^2 (D + a^2 E))
mov   r3,rB      ; t1 = A
4 fmadd r3,r2,r1   ; t1 += a^2 (B + (C + a^2 (D + a^2 E)))
```

Note that the cos 'A' is not the same as the sin 'A'. The digit infront of the fpu ops is the 'stage' of the calculation in terms of register dependencies.

The same using Estrin's Aglorithm:

``` ; sin
; a ((A + a^2 B) + a^4 (C + a^2 D))

; r0 = a
; r1 = a^2

1 fmul  r2,r1,r1   ; a^4 = a^2 . a^2
mov   r3,rA      ; t0  = A
mov   r4,rC      ; t1  = C
1 fmadd r3,r1,rB   ; t0  = A + a^2 B
1 fmadd r4,r1,rD   ; t1  = C + a^2 D
2 fmadd r3,r2,r4   ; sin'= (A + a^2 B) + a^4 (C + a^2 D)
3 fmul  r3,r3,r0   ; sin = a sin'

; cos
; A + a^2 ((B + a^2 C) + a^4 (D + a^2 E))

; r0 = a
; r1 = a^2

1 fmul  r2,r1,r1   ; a^4 = a^2 . a^2
mov   r3,rB      ; t0  = B
mov   r4,rD      ; t1  = D
1 fmadd r3,r1,rB   ; t0  = B + a^2 C
1 fmadd r4,r1,rD   ; t1  = D + a^2 E
2 fmadd r3,r2,r4   ; sin'= (A + a^2 B) + a^4 (C + a^2 D)
mov   r4,rA      ; t1  = A
3 fmadd r4,r3,r1   ; sin = A + a^2 ((A + a^2 B) + a^4 (C + a^2 D))
```
So 15 instructions vs 15 instructions, but 7 constant loads vs 5. Well that's interesting. With dual issue with this snippet the cost is hidden but if a routine has more ialu ops than flops then the latter leave more opportunities for scheduling.

Even if the instruction set had a 4-register fma instruction the second might be better due to needing one less stage of calculation.

### ABI?

I wish the abi was tad bit more asm-hacker friendly. A compiler doesn't care where the spare registers are but keep track of the sparse ranges is a pain. I think it would benefit from having all 8 'zero page' registers available as scratch anyway.

I wouldn't mind something closer to this:

```  r0-r3   argument / saved by caller
r4-r7   gp / saved by caller
r8-r23  saved by callee / hardware (lr only) / constants
r24-r63 gp / saved by caller
```
Rather than this:
```  r0-r3   argument / saved by caller
r8-r11  saved by callee
r12     scratch / saved by caller
r13-r15 saved by callee / hardware
r16-r27 gp /saved by caller
r28-r31 constants
r32-r43 saved by callee
r44-r63 gp / saved by caller
```
I don't really see the need for any of the arm-legacy register assignments such as r12 == ip. This would leave 48 registers free for leaf functions (rather than the current 37) to use without having to resort to saving registers and importantly leave 0-7 which can have a huge impact on code-size. And it would still leave enough registers for outer loops and so on. The abi is designed to work with a cut-down cpu which i think has 16 registers: but the proposed would work there as well.

But yeah, who knows, any change isn't cost-free and if your code is calling lots of small functions rather than calling big processing leaf functions then it might not work so well (but yeah, it wont anyway). I previously tried patching gcc to modify the abi but I didn't fully complete it.