Wednesday, 10 February 2010

Vectors and Bits again

Well I fixed the `c long' version of the rect-fill from the update mentioned a couple of posts ago ... and a bit more besides.

After sleeping in a bit I worked on some MMU code so I can start using the CPU cache. Most of that was just gaining a deeper understanding of the permission and memory type bits, which are a little confusing in places. It looks like it's been extended a couple of times whilst keeping compatability so there's multiple combinations that appear to do the same thing but with different nomenclature. Hmm, I have it more or less worked out ... I think. So once I got the MMU code working, it allowed me to enable caches and play a bit with various options. I used only section and super-section pages - 1MB or 16MB, so i'm probably only using a couple of TLB entries to run everything (= no page table walks).

I was assuming the caches were on when i enabled the MMU ... oh but they weren't, of course ... stupid me. Wow does that make a difference ... Wow.

Ok, pause to run a few more timings. ... Here goes.

Code                   Total    Slowest Fastest
C short 36097442 0.89 5.22
C long 40526536 1.00 5.86
ARM asm 15801430 0.38 2.28
NEON 9654736 0.23 1.39
NEON2 9982542 0.24 1.44
NEON3 9421366 0.23 1.36
NEON4 9467262 0.23 1.37
sDMA 6904794 0.17 1.00

(see 2 posts ago, or render-rect.c for what they mean)

This is the original scenario from a previous post, but with a 'fixed' C long version. Strangely, it runs slower than the short version. A cursory look at the assembly looks like it's doing the right thing - but it's not worth looking deeper. My guess is the extra logic required for the un-aligned edges is throwing it out or the pointer aliasing is making the compiler angry. Oddly, the performance monitor is registering the same number of data writes too.

Anyway, who cares. Lets turn the MMU on and set the memory regions up properly and and see what happens. Even with the caches off things happen, although not much.

With MMU on, graphics = wt                    With MMU on, graphics = wb

Code Total Slowest Fastest Code Total Slowest Fastest
C short 36058684 0.89 7.33 C short 36233408 0.89 5.23
C long 40496404 1.00 8.23 C long 40584664 1.00 5.86
ARM asm 9367578 0.23 1.90 ARM asm 15811204 0.38 2.28
NEON 5332580 0.13 1.08 NEON 9653676 0.23 1.39
NEON2 4917308 0.12 1.00 NEON2 10057086 0.24 1.45
NEON3 5598968 0.13 1.13 NEON3 9555816 0.23 1.38
NEON4 5685246 0.14 1.15 NEON4 9431842 0.23 1.36
sDMA 6908602 0.17 1.40 sDMA 6917612 0.17 1.00

We're starting to beat the system DMA - I presume that even with the cache off this enables some sort of write-combining/write-buffering. It's interesting that the NEON2 code speeds up the most (nearly 2x) - probably given it has the smallest loop the CPU isn't in contention for memory bandwidth as much. You'd never use a write-back cache for video memory, but I timed it anyway. I really have no idea how or why using it is making any difference whatsoever though, since the global cache bits are all off!

Ok, so ... der, lets turn on the caches properly.

The way I set the MMU up is to have the first bank of memory - where all code and data resides - as write-back write-allocate (writes also read a cache-line), and the second - where the frame-buffer resides - as write-through no-write-allocate. For the `graphics = wb' case, I also set write-back write-allocate on the second bank of memory (in a separate run). All the IO devices are using shared-device mode.

First, with unrolled loops.

MMU on, graphics = wt, -O3 -funroll-loops     MMU on, graphics = wb, -O3 -funroll-loops
-- lots of artifacts
Code Total Slowest Fastest Code Total Slowest Fastest
C short 957743 0.14 1.02 C short 1816546 0.28 1.00
C long 956818 0.14 1.02 C long 1992627 0.30 1.09
ARM asm 933198 0.14 1.00 ARM asm 1871829 0.28 1.03
NEON 930448 0.14 1.00 NEON 1857085 0.28 1.02
NEON2 945969 0.14 1.01 NEON2 1862711 0.28 1.02
NEON3 946522 0.14 1.01 NEON3 1848473 0.28 1.01
NEON4 945739 0.14 1.01 NEON4 1861538 0.28 1.02
sDMA 6456313 1.00 6.93 sDMA 6455228 1.00 3.55

Ahh, now this is more like it. Getting over 800MB/S (if my timing calculations are right).

Even the basic crappy C code is within a whisker of everything else - even though it executes about 3.5x as many instructions to get the same work done. The system DMA has fallen right off; but run asynchronously it would probably still be worth using since it is basically `free', and the CPU can do a lot more than just write memory. This code also polls the DMA status in a tight loop, I don't know if that is having any bandwidth effects

The write-back timing is all out of whack - the C short version is the first to run, so it gets a benefit of having an empty cache and nothing to write-back. You also get to see the CPU write stuff back to the screen when it feels the need - lots of weird visual artifacts. And the explicit cache flushing required would only make it slower on top of that. In short - useless for a framebuffer. Any performance issues you might expect a write-back cache to address are handled much better by using proper algorithms. I saw it mentioned on the beagleboard list, so it seemed worthy of comment ...

And lastly, just with -O3, a typical compile flag (-funroll-loops generates much bigger code so might not always be desirable). I also added in a `hyper-optimised' memset implementation for good measure.

MMU on, graphics = wt, -O3

Code Total Slowest Fastest
C short 1372096 0.21 1.47
C long 1038868 0.16 1.11
ARM asm 948600 0.14 1.02
NEON 929968 0.14 1.00
NEON2 939165 0.14 1.00
NEON3 946102 0.14 1.01
NEON4 945702 0.14 1.01
msNEON 1309313 0.20 1.40 (see memset_armneon())
sDMA 6462071 1.00 6.94

The C is still ok, if a bit slower, but barely worth `optimising' in this trivial case.

The msNEON code is from the link indicated ... interesting that a more complex C loop beats it somewhat; the msNEON code is only writing the same amount of memory linearly not as a rectangle, and with severe alignment restrictions.

The NEON2 code has such a simple inner loop, yet is the most consistently top performer. Good to see that KISS sometimes still works.

 // write out 32-byte chunks
2: subs r6,#1
vst1.64 { d0, d1, d2, d3 }, [r5, :64]! // ARM syntax is `r5 @ 64'
bgt 2b

The ARM code is quite a mess by comparison:

 // write out 32-byte chunks
2: strd r2,[r5]
strd r2,[r5, #8]
strd r2,[r5, #16]
subs r6,#1
strd r2,[r5, #24]
add r5,r5,#32
bgt 2b

(FWIW I tried a similar trivial loop in ARM, a direct translation of the `C long' code, and that wasn't terribly fast).

Anyway, I think i've done memory fill/rect fill to bloody death (and beyond!) now. It's just not a terribly interesting problem - particularly for a SIMD unit. Apart from evaluating raw memory performance. Actually it is kind of handy for that since it will easily show if things aren't configured properly.

PS Code changes not committed yet.

No comments: