I poked about trying to turn each pass into a single loop - this could allow it for example to be implemented using epiphany hardware loops. But with a radix-2 kernel there just isn't really enough flops to hide the overheads of the in-loop indexing calculations which are more complicated than simple loops.
For convolution it doesn't matter if the intermediate data is bit-reversed, so you can get away without reordering it but you need to use a decimation in frequency (dif) algorithm for the forward transform and a decimation in time (dit) for the inverse. So I gave that a go as well, but the dif algorithm was slower than the in-order algorithm, at least for the data and kernel-size I was playing with. I think I have too much loop arithmetic.
So after poking about with that stuff for a bit I tSo obviously stalls ried a radix-4 kernel. I did get something working but i'm not convinced i've done it the right way because i'm looking up the twiddle factors for each calculation as if were doing two stages of a radix-2 locally.
The compiler refuses to use double-word loads or stores for the complex float accesses though which is a bit of a bummer. So I tried coding up the first pass of the in-order routine - it does a bit-reversing permute and a radix-4 together - in assembly language for comparison. Using the hardware loops and some pre-calculation the inner loop only requires 3 instructions for the address calculations which is pretty tidy.
Probably not going to break any speed records but the code isn't too big.
I also looked at the inner loop - just the calculation part anyway, as the indexing arithmetic is pretty messy. Looks like you need a radix-4 kernel to remove all the stalls in the code, otherwise there isn't enough work to hide it (one could just unroll the loop once too, but since that's part of the calculation anyway may as well make it a radix-4 step). If i ever finish it i'll try putting it in a hardware loop as well.
To confirm some of the details about the timing on the hardware I created a small program which ran a small routine across all timings modes in CTIMER0. This records things like stalls and so on. It's quite interesting.
So for example the worst-case for a dependency stall for a fpu op might be:
stalls: fmadd r0,r0,r0 fmadd r0,r0,r0 rtsThis code adds 4 stalls to the running time. "not much" perhaps, but with dual-issue, the following code executes in the same time:
stalls: fmadd r0,r0,r0 add r16,r16,r16 fmadd r1,r1,r1 add r16,r16,r16 fmadd r2,r2,r2 add r16,r16,r16 fmadd r3,r3,r3 add r16,r16,r16 fmadd r12,r12,r12 add r16,r16,r16 fmadd r0,r0,r0 rts
This dual issue should mean I can hide all the addressing mathematics from the execution time.
This scheduling stuff is handled in the compiler when you don't write assembly - and it typically does a reasonable job of it.