Sunday, 29 June 2014

cexpi, power of 2 pi

I did a bit more playing with the sincos stuff and ended up rolling my own using a taylor series.

I was having issues with using single floating point to calculate the sincos values, the small rounding errors add up over a 1M element FFT. I tried creating a fixed-point implementation using 8.24 fixed point but that was slightly worse than a float version. It was quite close in accuracy though, but as it requires a 16x32-bit multiply with a 48-bit result, it isn't of any use on epiphany. For the integer algorithm I scaled the input by 0.5/PI which allows the taylor series coefficients to be represented in 8.24 fixed-point without much 'bit wastage'.

My final implementation uses an integer argument rather than a floating point argument which makes it easier to calculate the quadrant the value falls into and also reduces rounding error across the full input range. It calculates the value slightly differently and mirrors the angle in integer space before converting to float to ensure no extra error is added - this and the forming of the input value using an integer is where the additional accuracy comes from.

I tried a bunch of different ways to handle the swapping and ordering of the code (which surprisingly affects the compiler output) but the following was the best case with the epiphany compiler.

/*
 Calculates cexp(I * x * M_PI / (2^20))
   = cos(y) + I * sin(y)  |y = (x * M_PI / (2^20))

  where x is a positive integer.

  The constant terms are encoded into the taylor series coefficients.
*/
complex float
ez_cexpii(int ai) {
    // PI = 1
    const float As = 3.1415926535897931159979634685441851615906e+00f;
    const float Bs = -5.1677127800499702559022807690780609846115e+00f;
    const float Cs = 2.5501640398773455231662410369608551263809e+00f;
    const float Ds = -5.9926452932079210533800051052821800112724e-01f;
    const float Es = 8.2145886611128232646095170821354258805513e-02f;
    const float Fs = -7.3704309457143504444309733969475928461179e-03f;

    const float Bc = -4.9348022005446789961524700629524886608124e+00f;
    const float Cc = 4.0587121264167684842050221050158143043518e+00f;
    const float Dc = -1.3352627688545894990568285720655694603920e+00f;
    const float Ec = 2.3533063035889320580018591044790809974074e-01f;
    const float Fc = -2.5806891390014061182789362192124826833606e-02f;

    int j = ai >> (20-2);
    int x = ai & ((1<<(20-2))-1);

    // odd quadrant, angle = PI/4 - angle
    if (j & 1)
        x = (1<<(20-2)) - x;

    float a = zfloat(x) * (1.0f / (1<<20));
    float b = a * a;

    float ssin = a * ( As + b * ( Bs + b * (Cs + b * (Ds + b * (Es + b * Fs)))));
    float scos = 1.0f + b * (Bc + b * (Cc + b * (Dc + b * (Ec + b * Fc))));

    int swap = (j + 1) & 2;

    if (swap)
        return bnegate(ssin, j+2, 2) + I * bnegate(scos, j, 2);
    else
        return bnegate(scos, j+2, 2) + I * bnegate(ssin, j, 2);
}

This takes about 90 cycles per call and provides slightly better accuracy than libc's sinf and cosf (compared on amd64) across the full 2PI range. The function is 242 bytes long. If the taylor series only used 5 terms is about as accurate as libc's sinf and cosf and executes 6 cycles faster and fits in only 218 bytes.

Apart from accuracy the other reason for having a power of 2 being PI is that this what is required for the FFT anyway so it simplifies the angle calculation for that.

Whether this is worth it depends on how long it would take to load the values from a 1M entry table

As per the previous post a radix-1024 fft pass (using radix-4 fft stages) requires 341 x 2 = 682 sincos pairs for one calculation. For the first pass this can be re-used for each batch but for the second values must be calculated (or loaded) each time. This is approximately 61K cycles.

The calculation of the radix-1024 pass using radix-4 stages itself needs 1024 / 4 x log4(1024) x 8 complex multiply+add operations, or at least 40K cycles (complex multiply+add using 4xfma) plus some overheads.

Going on the memory profiling i've done so far, it's probably going to be faster calculating the coefficients on-core rather than loading them from memory and saves all that memory too.

No comments: