Monday, 9 December 2013

fpu mode, compiler options

Poked a bit more at the 2d scaler on the parallella yesterday. I started just working out all the edge cases for the X scaler, but then I ended up delving into compiler options and assembler optimisations.

Because the floating point unit has some behaviour defined by the CONFIG register the compiler needs to twiddle bits quite a bit - and by default it seems to do it more often than you'd expect. And because it supports writing interrupt handlers in C it also requires any of these bit twiddles to occur within an interrupt disable block. Fun.

To cut a long story short I found that fiddling with the compiler flags makes a pretty big difference to performance.

The flags which seemed to produce the best result here were:

  -std=gnu99 -O2 -ffast-math -mfp-mode=truncate -funroll-loops

Actually the option that has the biggest effect was -mfp-mode=truncate as that removes many of the (redundant) mode switches.

What I didn't expect though is that the CONFIG register bits also seem to have a big effect on the assembly code. By adding this to the preamble of the linear interpolator function I got a significant performance boost. Without it it's taking about 5.5Mcycles per core, but with it it's about 4.8Mcycles!?

        mov     r17,#0xfff0
        movt    r17,#0xfff1
        mov     r16,#1
        movfs   r12,CONFIG
        and     r12,r12,r17     ; set fpumode = float, turn off exceptions
        orr     r12,r12,r16     ; truncate rounding
        movts   CONFIG,r12      

It didn't make any difference to the results whether I did this or not.

Not sure what's going on here.

I have a very simple routine that resamples a single line of float data using linear interpolation. I was trying to determine if such a simple routine would compile ok or would need me to resort to assembler language for decent performance. At first it looked like it was needed until I used the compiler flags above (although later I noticed I'd left an option to disable inlining of functions that I was using to investigate compiler output - which may have contributed).

The sampler i'm using is just (see: here for a nice overview):

static inline float sample_linear(float * __restrict__ src, float sxf) {
                int sx = (int)sxf;
                float r = sxf - sx;
                float y1 = src[sx];
                float y2 = src[sx+1];
                
                return (y1*(1-r)+y2*r);
}
Called from:
static void scale_linex(float * __restrict__ src, float sxf, float * __restrict__ dst, int dlen, float factor) {
        int x;

        for (x=0;x<dlen;x++) {
                dst[x] = sample(src, sxf);

                sxf += factor;
        }
}

A straight asm implementation is reasonably simple but there are a lot of dependency-stalls.

        mov     r19,#0  ; 1.0f
        movt    r19,#0x3f80

        ;; linear interpolation
        fix     r16,r1          ; sx = (int)sxf

        lsl     r18,r16,#2
        float   r17,r16         ; (float)sx
        add     r18,r18,r0
        fsub    r17,r1,r17      ; r = sxf - sx

        ldr     r21,[r18,#1]    ; y2 = src[sx+1]
        ldr     r20,[r18,#0]    ; y1 = src[sx]

        fsub    r22,r19,r17     ; 1-r
        fmul    r21,r21,r17     ; y2 = y2 * r
        fmadd   r21,r20,r22     ; res = y2 * r + y1 * (1-r)

(I actually implement the whole resample-row routine, not just the sampler).

This simple loop is much faster than the default -O2 optimisation, but slower than the C version with better optimisation flags. I can beat the C compiler with an implementation which processes 4 output pixels per loop - thus allowing for better scheduling with a reduction in stalls, and dword writes to the next core in the pipeline. But the gain is only modest for the amount of effort required.

Overview of

    Routine         Mcycles per core

    C -O2                10.3
    C flags as above      4.2

    asm 1x                5.3
    asm 1x force CONFIG   4.7
    asm 4x                3.9
    asm 4x force CONFIG   3.8
I'm timing the total instruction cycles on the core which includes the synchronisation work. Image is 512x512 scaled by 1.7x,1.0.

On a semi-relted note I was playing with the VJ detector code and noticed the performance scalability isn't quite so good on PAL-res images because in practice image being searched is very small. i.e. parallelism isn't so hot. I hit this problem with the OpenCL version too and probably the solution is the same as I used there: go wider. Basically generate all probe scales at once and then process them all at once.

No comments: