Thursday, 18 August 2011

GEGL/OpenCL

So apparently a lad's been working on getting some OpenCL code into GEGL. What surprises me is just how slow the result is - and how slow GEGL is at doing the super-simple operation of brightness/contrast even with a CPU.

Of course, I'm not sure exactly what is being timed here, so perhaps it's timing a lot more than just the mathematics. Well obviously it has to be, my ageing Pentium-M laptop can do a 1024x1024xRGBA/FLOAT brightness/contrast in about 70ms with simple single-threaded Java code. So 500ms for the same operation using 'optimised sse2' is including a hell of a lot of extra stuff beyond the maths. Curiously, the screenshot of the profiler shows 840 'tiles' have been processed, if they are 128x64 as suggested then that is 6MP, not 1MP as stated in the post - in that case 500ms isn't so bad (it isn't great either, but at least it's in the same order).

I tried posting this to the forum linked to this phoronix post but for whatever reason it refused to take the post, so i'll post it here instead.


This result is really slow. Like about 100x off if I have the relative performance of that gpu correct. Even the CPU timings look suspect - is GEGL really that slow?

A list of potential bottlenecks:
  • the locking stuff sounds overly complex, but maybe that's a gegl requirement
  • are you timing 1-off allocations which skew the results?
  • moving single tiles back/and forth/processing them separately (this is a big one)
  • processing only a single tile per kernel call (this is a really big no-no)
  • might want to specify the local work-size to ensure the best memory access pattern on the opencl side. 16x16 usually works well for image processes per pixel on a gpu.
  • PCI latency, related to working with small blobs of data at a time. This can be completely hidden fairly easily by queueing up more jobs before a synchronisation point (either a clFinish or EnqueueReadBuffer(, true). Also you need to do a clFlush if you want the work to start while the cpu is still doing something (e.g. queuing up more work).
  • GEGL design. I know nothing about it, but if you need to go to the CPU to do synchronisation between each composed operation you may never achieve very good performance. Ideally you upload data once to the gpu, then do all processing without any cpu synchronisation until the final result is ready. By default an opencl command-queue is in-order (and no implementation support out of order anyway), so you leverage that as well. If GEGL can't already handle threads to do a similar parallelisation it might not be ready for opencl either.
  • GEGL itself. Since the GEGL CPU timings are so slow (i mean, really really slow) GEGL must be doing a lot more behind the scenes/adding so much overhead that the actual calculations are completely swamped. If this is 'fixed', then no matter what you do, such processing will always be relatively slow, although as the complexity of the algorithm increases this fixed overhead will matter less.

A list of things which can't be bottlenecks:
  • PCI bandwidth. It's just not enough data to matter.
  • OpenCL kernel - maybe it can be improved with a better work-group-size, but it's so simple it can't really be wrong.

Suggestions
  • My gut feeling is that you ignore tiles completely on the opencl backend. Even doing manual cpu-side composition of tiles into aggregate will be fairly cheap compared to synchronous transfers/operations. Composing operations complicate matters though ...
  • Don't try to hide too much detail with abstractions. It usually just makes it harder to know what's really going on (particularly for another coder).
  • Don't worry too much about comparing such a simple operation with the CPU. The CPU should already be able to do it at about memory speed, and you're adding PCI copies in-between. It's the more interesting stuff like convolution or FFT-based algorithms where the GPU will blow it completely out of the water.
  • Think of the GPU processor as a 'stream' processor. You want to load it up with a pipeline of operations and keep the pipe stuffed with work. Waiting for the pipeline to empty before adding more work will kill performance faster than anything else. This applies at every level - the individual threads, SM's, as well as data blocks.
  • Might need to do some profiling of the CPU GEGL brightness/contrast implementation. Something other than the actual calculations is taking most of the time.

In the nvidia profiler, look at the 'gpu time width plot' to see when the gpu is actually doing work. You'll probably see the individual jobs (and memory transfers) take almost no time and it's mostly sitting idle waiting for work from the cpu. It's that idle time which is going to be 99% of the elapsed time which is where you find all the gains at this point.

Don't even bother looking at the graph you posted - memory transfer time will have to be greater than the processing time since the processing is so simple and the gpu memory bandwidth is so much higher than pci speed. All you're doing is confirming that fact. The memory transfer time can mostly be hidden using asynchronous programming techniques anyway, so it is basically irrelevant.

No comments: