Tuesday, 10 May 2016

radix-4 stuff

Well I did some more mucking about with the fft code. Here's some quick results for the radix-4 code.

First the runtime per-transform, in microseconds. I ran approximately 1s worth of a single transfer in a tight loop and took the last of 3 runs. All algorithms executed sequentially in the same run.

                16          64         256         1024         4096       16384        65536       262144
                                                                                                
jtransforms  0.156       1.146       6.058       27.832      135.844     511.098    3 037.140   14 802.328
dif4         0.160       0.980       5.632       27.503      138.077     681.006    3 759.005   20 044.719
dif4b        0.136       0.797       4.713       22.994      120.915     615.623    3 175.115   17 875.563
dif4b_col    0.143       0.797       4.454       21.835      117.659     593.314    3 020.144   22 341.453
dif4c        0.087       0.675       4.255       21.720      115.550     576.798    2 775.360   15 248.578
dif4bc       0.083       0.616       3.760       19.596      108.028     547.334    2 810.118   16 308.047
dif4bc_col   0.137       0.622       3.699       19.629      107.954     550.483    2 820.234   16 323.797

And the same information presented as a percentage of jtransforms' execution time with my best implementation highlighted.

              16          64          256         1024        4096        16384       65536       262144
                                                                                                
jtransforms   100.0       100.0       100.0       100.0       100.0       100.0       100.0       100.0
dif4          102.4       85.5         93.0        98.8       101.6       133.2       123.8       135.4
dif4b          86.7       69.6         77.8        82.6        89.0       120.5       104.5       120.8
dif4b_col      91.3       69.6         73.5        78.5        86.6       116.1        99.4       150.9
dif4c          55.9       58.9         70.2        78.0        85.1       112.9        91.4       103.0
dif4bc         53.3       53.8         62.1        70.4        79.5       107.1        92.5       110.2
dif4bc_col     87.7       54.3         61.1        70.5        79.5       107.7        92.9       110.3

Executed with the default java options.

$ java -version
java version "1.8.0_92"
Java(TM) SE Runtime Environment (build 1.8.0_92-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.92-b14, mixed mode)

CPU is kaveri clocked at minimum (1.7Ghz) with only a single DIMM, it is quite slow as you can see.

A summary of the algorithms follow.

jtransforms
jtransforms 2.2, complexForward() routine.
dif4
A radix-4 implemented as two nested radix-2 operations. 42 flops.
dif4b
A radix-4 implemented directly with common subexpressions explicitly encoded. 36 flops.
dif4b_col
The first approximately half passes are the same as dif4b, but the second half swap the order of the loop in order to exploit locality of reference, a longer inner loop, and re-using of twiddle factors.
difc
The first N-2 passes attempt to re-use twiddle factors across pairs of results situated at (i,N-i). The N-1 pass is hand-coded for the 4 cases which only require 3 constants for the twiddle factors.
dif4bc
The first N-2 passes are the same as dif4b. The N-1 pass is hand coded (==dif4c).
dif4bc_col
A combination of dif4bc and dif4b_col.

The final (trivial) pass is hand-coded in all cases. dif4 requireds N/2 complex twiddle factors and the rest require N/2+N/4 due to the factorisation used.

In all cases only a forward complex transform which leaves the results unordered (bit-reversed indexed) is implemented.

Thoughts

  • All the implementations do full scans at each pass and so start to slow down above 2^12 elements due to cache thrashing (together with the twiddle table). Depth-first ordering should help in most cases.

  • Despite requiring fewer twiddle lookups in dif4c the extra complexity of the code leads to register spills and slower execution. That it takes a lead at very large numbers is also consistent with this and the point above.

  • Twiddle factor lookups are costly in general but become relatively costlier at larger sizes. This needs to be addressed separately to the general cache problem.

  • Whilst the re-ordering for the "col" variants was a cheap and very simple trick to gain some performance, it can't keep up with the hand-coded variants. Further tuning may help.

  • For addressing calculations sometimes it's better to fully spell out the calculation each time and let the compiler optimise it, but sometimes it isn't.

  • It is almost always a good idea to keep loops as simple as possible. I think this is one reason such a trivial / direct implementation of the FFT beats out a more sophisticated one.

  • There's a pretty clear winner here!

In most cases the calculations are expanded in full inside inner loops, apart from the W^0_N case. In all other cases the compiler generated slower execution if it was modularised, sometimes significantly slow (10%+). I suspect that even the radix-4 kernel is starting to exhaust available registers and scheduling slots together with the java memory model. The code also includes numerous possibly questionable and sometimes a little baffling micro-optimisations. I haven't validated the results yet apart from a test case which "looks wrong" when things go awry but i have reasonable confidence it is functioning correctly.

After I found a decent reference I did start on a radix-8 kernel but i realised why there is so little about writing software to do it; it just doesn't fit on the cpu in a high level language. Even the radix-4 kernel seems to be a very tight fit for java on this cpu.

Still playing ...

No comments: