Friday, 13 December 2013

FFT jiggery pokery

I've been meaning to poke at FFT calculation for a while and I finally got around to having a proper look.

First I just needed to calculate a single value so I did it by hand - I got some strange profiling results from Java vs jtransforms but it turned out that due to it's complexity it takes many many cycles before the JIT fully optimises it.

Then i started looking from a trivial Cooley-Tukey implementation that allocated all it's output arrays ... and i was surprised it was only about 1/2 the speed of jtransforms for it's simplicity. I tried a few variations and delved a bit into understanding the memory access patterns - although after an initial bit-reversed premute the memory patterns are trivial. It was still stuck at about 1/2 the speed of jtransforms.

I had a bit more of a think about the bit-reversing memory permute stage and figured it wouldn't work terribly well with cpu caches. First I just worked on blocking the output by 8 elements (=1 cache line) toward re-arranging the reads so they represent 8 sequential streams. Strangely enough this second case is a bit slower on small workloads that fit in the cache although is a bit faster on larger ones. It's not really much of a difference either way.

 D:d                               |S:s                               |   0
 D: d                              |S:                s               |  16
 D:  d                             |S:        s                       |   8
 D:   d                            |S:                        s       |  24
 D:    d                           |S:    s                           |   4
 D:     d                          |S:                    s           |  20
 D:      d                         |S:            s                   |  12
 D:       d                        |S:                            s   |  28
 D:        d                       |S:  s                             |   2
 D:         d                      |S:                  s             |  18
 D:          d                     |S:          s                     |  10
 D:           d                    |S:                          s     |  26
 D:            d                   |S:      s                         |   6
 D:             d                  |S:                      s         |  22
 D:              d                 |S:              s                 |  14
 D:               d                |S:                              s |  30
 D:                d               |S: s                              |   1
 D:                 d              |S:                 s              |  17
 D:                  d             |S:         s                      |   9
 D:                   d            |S:                         s      |  25
 D:                    d           |S:     s                          |   5
 D:                     d          |S:                     s          |  21
 D:                      d         |S:             s                  |  13
 D:                       d        |S:                             s  |  29
 D:                        d       |S:   s                            |   3
 D:                         d      |S:                   s            |  19
 D:                          d     |S:           s                    |  11
 D:                           d    |S:                           s    |  27
 D:                            d   |S:       s                        |   7
 D:                             d  |S:                       s        |  23
 D:                              d |S:               s                |  15
 D:                               d|S:                               s|  31
This shows the source-destination read/write pattern for a bit-reversal permute. Obviously with all those sparse reads the cache coherency isn't going to be great. FWIW moving from s to d (sequential writes) proved to have better performance than from d to s (sequential reads).
 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|
This is the same, but blocked to 8 elements in the output. Unfortunately there is still a large spread of reads.
 D:dddddddd                        |S:0   4   2   6   1   5   3   7   |
 D:                dddddddd        |S: 0   4   2   6   1   5   3   7  |
 D:        dddddddd                |S:  0   4   2   6   1   5   3   7 |
 D:                        dddddddd|S:   0   4   2   6   1   5   3   7|
This is an attempt at improving the previous one - although the reads are still in a broad spread, they are always confined to 8 sequential streams. It does improve the performance once you get a large enough problem, but on an x86 machine N needs to be over about 8M before it shows a minor improvement. I've yet to try it on ARM.

Then I realised that Integer.reverse() wasn't mapping to a single instruction on x86 - and subsequently over half of the processing time was spent just calculating the source index ... so I put in a bit of bit-fiddling to the 8-blocked implementations which precalculated the 8 fixed element offsets relative to base-offset. I also tried a lookup table for the non-blocked implementation which made a big difference. Strangely using a lookup-table to get the base offset actually slowed it down - suggesting that the jvm is pre-calculating reverse(i) in the outer-loop if it can.

Once this permute is done the memory access pattern is just in adjacent and sequential blocks - either depth or breadth first. The 8-blocked permute is about 130%-150% of the speed of a straight System.arraycopy() for 4-1K elements.

Actually another reason I worked toward an 8-blocked implementation was in order to perform the first 8-way fft stage at the same time as the fiddling - with that in place I got to within shouting distance (under 200% execution time) of jtransforms for some small fft sizes. I was curious what I had to do to get better than that so I tried hard-coding the last stage of a size-16 fft and managed to beat it - not that this has any real practical purpose mind you.

This gives me a starting point to tackle the problem on the parallella which has it's own challenges.

PS yes I found a book or two and many papers about both fft implementation and the permute step but I had time to experiment on my own.

No comments: