After a short search I came across this nice post which pointed me in the right direction. My system is a bit out of date so his approach didn't work but it wasn't much effort to drop in binutils from a GNU mirror. I had to remove -Werror in a couple of makefiles for it to compile (why this is ever used in released software i don't know, too many things change in system libs for it to be portable).
I've only had a quick look but it's quite interesting. Pity about the horrid x86 ISA though.
It will do several iterations of a compilation before settling down - gradually changing (improving?) the code at each step.
Eventually it does all the things you'd expect: registerising locals, unrolling loops, using registers as pointers with fixed array offsets where possible. It will also move array bounds checks to outside of inner loops so that the result looks pretty much like compiled C and sometimes better as it can potentially inline anything and not just macros or stuff in includes.
In one test case it appeared to unroll a simple loop almost identically to the optimisation of a manually unrolled loop; but it ran quite a bit slower. Not sure on that one, might be register dependency stalls or perhaps I was looking at the wrong code-path as it was a fairly large function. I will have to try on simpler loops and mathematically they weren't strictly identical either.
Unfortunately it wont employ SIMD even when it's pretty close; I guess that's alignment related as much as anything due to intel's strict alignment requirements. I did notice recently that bytebuffers seem to be 16-byte aligned now though.
dot product
To start with something simple this is the loop i'll look closer at, it's the one I was looking at above.
double v=0; for (int i=0; i<end;i++) v += a[i] * b[i];
And the manually unrolled version. This is not identical due to the peculiarities of floating point despite being mathematically the same.
double v, v0=0, v1=0, v2=0, v3=0; int i=0; for (int bend=end & ~3; i<bend;i+=4) { v0 += a[i] * b[i]; v1 += a[i+1] * b[i+1]; v2 += a[i+2] * b[i+2]; v3 += a[i+3] * b[i+3]; } for (; i<end;i++) v0 += a[i] * b[i]; v = (v0+v1+v2+v3);
I will look at the compiler output in another post. If I get keen i might see if i can build an ARM version for the nicer ISA.
No comments:
Post a Comment