Thursday 18 March 2010

SSE, gcc, AMD

I'm a bit ahead of schedule on my work and I am awaiting some feedback so I thought i'd look at what SSE2 could do to speed up tiny bits of what it does. Might not need to use it, but it'd be good to have some sort of feel for what it can do - e.g on CELL it can make an order of magnitude difference. Some interesting and surprising results.

As a quick aside, while searching for why using an SSE intrinsic kept complaining about assigning to the wrong data-type (turned out it was just a typo and gcc didn't complain about the undefined `function') I came across this interesting comparison of SSE optimisations on common compilers. Actually I was quite surprised that gcc was doing quite so much with constant subexpression elimination of vector quantities - I thought intrinsics were pretty much in-line assembly with register aliases. Anyway good to see it is well ahead of the pack in all departments.

First, I tried implementing the following function:
X = abs(Z)*exp(Y*i)
Z, X complex, Y real
The C code is all pretty straightforward:
                        Xrow[x] = cabsf(Zrow[x])
* cexpf(Yrow[x]*I);

The SSE code was a bit messier, but for such a simple function quite manageable. I found a nice library of SSE optimised transcendental functions needed to implement a vector cexpf(), and the Cephes maths library was an excellent reference for the internal workings of functions like cexp - e.g. so I could remove redundant work.
                        v4sf d0, d1;
v4sf i, r;
v4sf absz;

v4sf rr, ri;
v4sf ci, si;

d0 = Zrow[x];
d1 = Zrow[x+1];

i = _mm_shuffle_ps(d0, d1, 0x88);
r = _mm_shuffle_ps(d0, d1, 0xDD);
absz = _mm_sqrt_ps(r*r + i*i);

i = Yrow[x/2];
sincos_ps(i, &si, &ci);

rr = absh*ci;
ri = absh*si;

d0 = _mm_unpacklo_ps(ri, rr);
d1 = _mm_unpackhi_ps(ri, rr);

Xrow[x] = d0;
Xrow[x+1] = d0;

I'm sure it's not the prettiest code, and it may not even work (not sure my shuffles are correct), but assuming it is otherwise correct, it'll do. I also tried a double version for simple C, and the above sse version with 1 unrolled loop.

I have a number of machines to try it on, a pentium-m core-duo-wtf-its-called, athlon 64 x2, and a phenom II 945. I'm using gcc 4.4.3. Running the loop 10 000 times over a 64x64 array using constant loop counts. I'm using the aggressive optimisation flags suggested by AMD, including -march=amdfam10 and -funroll-all-loops, and also c99 features like restrict.

Anyway I'll keep this brief. I'm not so much interested in the various CPU's as the relative differences within them.
 Code             A64-X2    Phenom II Pentium-M

C single 1.16 0.84 1.17
SSE single 0.69 0.31 0.37
SSE unrolled 0.66 0.31 0.37
C double 1.36 0.90 1.25

Well it's certainly an improvement, but nothing like on Cell. The Athlon-64 X2 really suffers from a 64-bit internal data-path, so the SSE is not even 2x faster, even though it's calculating 4x values at once for most operations. The intel shows the biggest improvement using SSE at just over 3x, but the Phenom is about the same. gcc has much improved since I tried similar things on the SPU so perhaps the non-sse code is 'just that good', but I can't help but imagine there is more silicon dedicated to making non-sse code run faster too. Oh and it is interesting that double's are pretty much on-par with singles, at least in this case. I was hoping that moving to singles would make a bigger difference - too lazy at this point to try a double SSE3 version (i'd have to write sincos_pd for example).

I tried this first because I thought the a little complexity (yet not too time consuming to write) would help the compiler do more optimisations, and it is complex enough it can't be memory constrained. I couldn't get gcc to inline the sincos_ps, but I doubt that would've made much difference.

Then I tried something much simpler, just adding two 2d arrays. I didn't use sse for this, but just used the vector types. And just looped over every element of every row - I used a stride calculation rather than treating it as completely linear. Hmm, perhaps I should have used non-constant dimensions to see what the compiler did with more general purpose code.
a[x] += b[x];

gcc unrolled the whole row-loop so things are about as simple as they get. Actually gcc vectorised the version using simple C types as well (well if it couldn't vectorise this loop it'd be a worry), and then it included the two versions: one vectorised for when the addresses are aligned appropriately (which they are), and one using the FPU for otherwise. Looking at x86 assembly gives me a headache so I didn't look too closely though. Since this loop is so simple I had to run it 1 000 000 times to get a reasonable timing.

Code A64-X2 Phenom II Pentium-M

C single 8.5 1.1 2.1
C vector single 6.0 0.98 1.2

Ok, so it's tuned for the phenom, but wtf is going on with the athlon64? Memory speed constrained? I presume they are all constrained by the L2 cache, or I might even be getting cache thrashing or something (this is the sort of thing you could guarantee on cell ... huge pity IBM aren't progressing it for HPC). Don't ask me why the pentim-m version is so different - afaict from the source they're both using SSE2 code since the memory should be aligned properly. *shrug*.

So after all that i'm not sure i'm much wiser. The question being - is it worth the effort to manually vectorise the code or even use sse intrinsics on an x86 cpu, at least for the algorithms i'm using? Hmm ... probably ... even a 2x increase on a minor operation is better than nothing, but it'll only be an incremental improvement overall- but at least in every case it was an improvement. These are all using 32-bit operating systems too, I wonder if I should try a 64-bit one. Well i'll see how much time I have to play with it.

Should really try GPGPU and/or OpenCL simply to get a good 'feel' of what to expect.

No comments: