Thursday, 11 April 2013

Beware the malloc ...

So one of the earliest performance lessons I learnt was to try to avoid allocating and freeing memory during some processing step. From bss sections, to stack pointer manipulation, to memory pools.

This was particularly important with Amiga code with it's single-threaded fit-first allocator, but also with SunOS and Solaris - it wasn't until a few versions of glibc in, together with faster hardware, that it became less of an issue on GNU systems. And with the JVM many of the allocation scenarios that were still a problem (e.g. many small or short-lived objects) with libc simply vanished (there are others to worry about, but they are easier to deal with ... usually).

It was something I utilised when I wrote zvt to make it quick, unfortunately whomever started maintaining it after I started at Ximian (I wasn't allowed time to work on zvt anymore, and ximian was a bit too busy to keep it as a hobby) didn't understand why the code did that and it was one of the first things to go ... although by then on a GNU system it wasn't so much of an issue.

But even with a super-computer on the desk it's still a fairly major issue with GPU code. Knowing this I always pre-allocate buffers for a given pipeline and let OpenCL virtualise it if required (or more likely, just run out of memory) but today I had a graphic reinforcement of just why this is such a good idea.

After hacking all week I managed to improve the 'kernel time' of a specific high-level algorithm from 50ms to about 6ms. I was pretty damn chuffed, particularly as it also works better at what it's doing.

However when I finally hooked it up to a working tech demo, the performance improvement plummeted to only about 3x - one expects quite a lot of overhead with a first-cut synchronous implementation from c-java-opencl-java-javafx, but that just seemed unreasonable, it seemed like every kernel invocation had nearly 1ms overhead.

Without any way to use the sprofile output at that level (nanosecond timestamps aren't visually rich ...) I added some manual timing and tracked it down to one routine. Turned out that my port of the Apple FFT code was re-allocating temporary work-space whenever the batch-size changed (rather than simply if it grew). Simply those 2 frees and 2 allocations were taking 20ms alone and obviously swamping the processing and other overheads completely.

Whilst at 15ms with about 60% of the time spent in setup, data transfer, and invocation overheads it is still pretty poor, it is acceptable enough for this particular application for a first-cut single-queue synchronous implementation. Actually apart from running much faster, the new routine barely warms up the GPU and the rest of the system remains more responsive. I must try a newer driver to see if it improves anything, i'm still on 12.x.

Bring on HSA ... I really want to see how OpenCL will work with the better architecture of HSA. Maybe the GCN equipped APUs will have enough capability to show where it's headed, if not enough to show where it's going to end.

Why is the interesting hardware always 6-12 months away?

No comments: