Monday, 8 February 2010

vectors and bits

Updated, see the end of the post

Yesterday I started poking around with the SIMD unit. Wow, is that a way to eat up time or what.

Wasn't quite sure what to do with it, so played at first with writing an RGB888 to RGB565 converter. Didn't get to testing it, but it brought back memories of the SPU hacking I did before - the instruction set has a lot of similarities, although NEON is filled out more. And like with the SPU, there's so many ways to do the same thing it can be a bit overwhelming trying to find a good way of solving a problem. Particularly if you don't really know which instructions are there, or what they do. There seems to be some interesting ones though, like vrsi which lets you insert the upper-bits of each element into the lower-bits of each element in another register (without clobbering it's contents). I still seem to be wedded to the vtbl instruction as I was with the shuffleb instruction on SPU, although I think it's not always the best route. I really missed the spu_timing tool though - although the issue rules and latencies are simpler.

That idea didn't seem to be going anywhere in particular, so I thought i'd look at some specific stuff I need, and for which I have very slow implementations - font rendering and rect fill, although I only got around to looking at rect fill, and that still doesn't work 100%. I just did it using ARM code though. For such an old architecture i'm was a little surprised at the lack of info available for such tasks - at least as it applies to searching using google. Maybe it's too old, and the new stuff is hidden away in proprietary and embedded systems, and nobody does software rendering anymore.

And then I totally lost track of the time reading about the DSP ... at 4am I thought it was time to `call it a night' - that's what I get for having coffee and chips for dinner (and in short; there's no free tools to use it, and the Linux driver uses binary blobs - of course).

Today I filled out the rect fill code a bit and tried various implementations, including some NEON variants. Oh, I also `discovered' the performance counting unit - wow, you can track a lot of stuff, from branches taken to cache and memory stats to stalls. Very nice.

Oh NEON. Fucking hell. Spent about 4 hours tracking down why the NEON instructions just threw an undefined instruction exception. After a couple of hours of digging I came across a reference to the Coprocessor Access Control Register, but that didn't really help (oh and a thread on the beagleboard group where people just say to turn CONFIG_NEON to y ... sigh). So here I was trying to turn on clocks and power and other PRCM registers ... and then I remembered something about a bit in a status register to enable/disable the whole shebang. A bit more tracking down (i've got about nearly 10K pages of documentation to search now) and I discovered the FPEXC register and VMSR/VMRS instructions (my memory was wrong, but it was a lucky guess). Although the binutils i'm using doesn't support them ... sigh. Finally found a workaround using MRC/MCR from Linux - about the only thing i've managed to find in there when tracking things down (a lot of stuff is so abstracted it it's very hard to follow). Gee that was frustrating.

Anyway, so I came up with some total cycle counts for various implementations of a 'rectangular block colour fill for RGB565'.

These are all with *NO CACHE* or write buffers, so they don't really mean anything other than relative to each other. You have to turn the MMU on to turn on data caches and write buffers, perhaps that is the next thing to try.

Code                   Total    Slowest Fastest
C short 36308222 1.00 5.25
C long 18307488 0.50 2.64
ARM asm 15877960 0.43 2.29 - uses 4x strd (writes 8 bytes/instruction)
NEON 9735680 0.26 1.40 - uses 2x writes of 2xD regs, 64 bit aligned
NEON2 9134690 0.25 1.32 - uses 1x write of 4xD regs, 64 bit aligned
NEON3 9311284 0.25 1.34 - uses 2x writes of 4xD regs, 128 bit aligned
NEON4 9191652 0.25 1.33 - uses vstm of 8xD regs
sDMA 6910682 0.19 1.00

The NEON implementations use ARM code for the non-aligned 'edges', and none of them are particularly fantastic code.

Hrm, I thought the ARM asm one was ok when I was running it by itself, i guess twice as fast as something is quite noticeable, but obviously it's kind of slow.

Looking in more detail at a couple of them:

drawRect() C long
total cycles=18307488
dwrite intns=169668
ext writes =169671
iexec =701230
istall =1201453

drawRect() ARM asm
total cycles=15877960
dwrite intns=168963
ext writes =168965
iexec =310508
istall =182922

The C version executes 2x as many instructions but the execution time isn't much different - everything is waiting on memory (although I wonder if it uses less power). At first I thought the total cycle count was a mistake, but of course, it's taking about 100x longer than the number of instructions executing, so memory accesses must be around 100 cycles -- which sounds about right. Be interesting to see if any cache/write buffers make a noticeable difference here, although it is just a flood of writes.

Update: Should've tested more, the long version was still just a 'short' version, it just wrote half the width ... so all bogus. Will revisit in a newer post.

The code in question is all in puppy bits:

No comments: