First I just needed to calculate a single value so I did it by hand - I got some strange profiling results from Java vs jtransforms but it turned out that due to it's complexity it takes many many cycles before the JIT fully optimises it.
Then i started looking from a trivial Cooley-Tukey implementation that allocated all it's output arrays ... and i was surprised it was only about 1/2 the speed of jtransforms for it's simplicity. I tried a few variations and delved a bit into understanding the memory access patterns - although after an initial bit-reversed premute the memory patterns are trivial. It was still stuck at about 1/2 the speed of jtransforms.
I had a bit more of a think about the bit-reversing memory permute stage and figured it wouldn't work terribly well with cpu caches. First I just worked on blocking the output by 8 elements (=1 cache line) toward re-arranging the reads so they represent 8 sequential streams. Strangely enough this second case is a bit slower on small workloads that fit in the cache although is a bit faster on larger ones. It's not really much of a difference either way.
D:d |S:s | 0 D: d |S: s | 16 D: d |S: s | 8 D: d |S: s | 24 D: d |S: s | 4 D: d |S: s | 20 D: d |S: s | 12 D: d |S: s | 28 D: d |S: s | 2 D: d |S: s | 18 D: d |S: s | 10 D: d |S: s | 26 D: d |S: s | 6 D: d |S: s | 22 D: d |S: s | 14 D: d |S: s | 30 D: d |S: s | 1 D: d |S: s | 17 D: d |S: s | 9 D: d |S: s | 25 D: d |S: s | 5 D: d |S: s | 21 D: d |S: s | 13 D: d |S: s | 29 D: d |S: s | 3 D: d |S: s | 19 D: d |S: s | 11 D: d |S: s | 27 D: d |S: s | 7 D: d |S: s | 23 D: d |S: s | 15 D: d|S: s| 31This shows the source-destination read/write pattern for a bit-reversal permute. Obviously with all those sparse reads the cache coherency isn't going to be great. FWIW moving from s to d (sequential writes) proved to have better performance than from d to s (sequential reads).
D:dddddddd |S:0 4 2 6 1 5 3 7 | D: dddddddd |S: 0 4 2 6 1 5 3 7 | D: dddddddd |S: 0 4 2 6 1 5 3 7 | D: dddddddd|S: 0 4 2 6 1 5 3 7|This is the same, but blocked to 8 elements in the output. Unfortunately there is still a large spread of reads.
D:dddddddd |S:0 4 2 6 1 5 3 7 | D: dddddddd |S: 0 4 2 6 1 5 3 7 | D: dddddddd |S: 0 4 2 6 1 5 3 7 | D: dddddddd|S: 0 4 2 6 1 5 3 7|This is an attempt at improving the previous one - although the reads are still in a broad spread, they are always confined to 8 sequential streams. It does improve the performance once you get a large enough problem, but on an x86 machine N needs to be over about 8M before it shows a minor improvement. I've yet to try it on ARM.
Then I realised that Integer.reverse() wasn't mapping to a single instruction on x86 - and subsequently over half of the processing time was spent just calculating the source index ... so I put in a bit of bit-fiddling to the 8-blocked implementations which precalculated the 8 fixed element offsets relative to base-offset. I also tried a lookup table for the non-blocked implementation which made a big difference. Strangely using a lookup-table to get the base offset actually slowed it down - suggesting that the jvm is pre-calculating reverse(i) in the outer-loop if it can.
Once this permute is done the memory access pattern is just in adjacent and sequential blocks - either depth or breadth first. The 8-blocked permute is about 130%-150% of the speed of a straight System.arraycopy() for 4-1K elements.
Actually another reason I worked toward an 8-blocked implementation was in order to perform the first 8-way fft stage at the same time as the fiddling - with that in place I got to within shouting distance (under 200% execution time) of jtransforms for some small fft sizes. I was curious what I had to do to get better than that so I tried hard-coding the last stage of a size-16 fft and managed to beat it - not that this has any real practical purpose mind you.
This gives me a starting point to tackle the problem on the parallella which has it's own challenges.
PS yes I found a book or two and many papers about both fft implementation and the permute step but I had time to experiment on my own.