Friday, 2 October 2015


Here's something I missed being out of the loop for the last good while: OpenVX.

Although it appears so did the rest of the industry? Another standard sent to rot?

I'll have to have a closer look though; if it is well thought out it will be worth borrowing a few ideas at the least as it's a problem i've come up with multiple solutions for myself and it's always useful to get a fresh perspective.

Or maybe it'll give me something to play with along with ZCL either as a binding and/or a pure Java (re)implementation for prototyping.

I wonder how far away vulkan is for the rest of us.

Update: Some movement has occurred on the HSA front recently. I suppose just some more announcements or partnerships or something. There's a lot of technology and software that needs to come together to get anywhere and it's taking it's time. I wonder if it'll ever get traction though. I guess it solves one of the problems with OpenCL - you actually have to program parallel software which is a skill few have the time to learn and fewer companies have the budget to fund. At least it has the potential to stay in the compiler layer, hidden from your typical code-monkey.

OpenCL notes

As tangentially related observation I've been working on some OpenCL stuff of late and I came across a problem I just couldn't get to run very fast on the dev GPU. I ended up resorting to using an OpenCL/CPU kernel and another queue but along the way i tried a couple of other interesting things.

One was using a native kernel written in Java - but this still requires a CPU driver / queue since the GPU driver i'm using doesn't support native kernels. ZCL has fairly simple interface to this:

public interface CLNativeKernel {

     * Native kernel entry point.
     * @param args arguments, with any CLMemory objects replaced with
     * ByteBuffer.
    public void invoke(Object[] args);

public class CLCommandQueue {
public native void enqueueNativeKernel(
    CLNativeKernel kernel,
    CLEventList waiters,
    CLEventList events,
    Object... args) throws CLException;

The JNI hides the details but it behaves the same way as the C code whereby any memory objects in the argument list are replaced by pointers (ByteBuffer here). Not sure if i'll keep the varargs prototype because it is inconsistent with every other enqueue function and only saves a little bit of typing. I'll review it when i look at the functional stuff i mentioned in the last post.

Which can be used efficiently and conveniently in Java 8:

  CLBuffer points = cl.createBuffer(0, 1024 * 4);
  q.enqueueNativeKernel((Object[] args)-> {
    // object[0] = ByteBuffer = points contents
    // object[1] = Integer = 12
    // object[1] = Integer = 24
  }, null, null,
  points, 12, 24);

Since my prototype harness didn't have a CPU queue until I subsequently added it my first approach was to emulate a native kernel using event callbacks and user events. It actually worked pretty well and was about the same runnning time, although it took a bit more (fairly straightforward) code to set up.

One approach I took was to have two queues - the primary 'gpu' queue where the work is done, and another one used for the memory transfer and rendezvous point.

  // setup CLEventLists and the user event

  // gpu part of work
  gpu.enqueueXXKernel(..., gpudone);
  // prepare for cpu component
  work.enqueueReadBuffer(..., gpudone, readdone);
  // prepare for gpu again
  work.enqueueWriteBuffer(..., cpudone, writedone);
  // make sure q is ready
  gpu.enqueueMarkerWithWaitList(writedone, null);

  memdone.setEventCallback(CL_COMPLETE, (CLEvent e, int status) -> {
    // do the work

In this case all the enqueue operations are performed at once and events are used to synchronise. This simplifies the callback code a little bit. Now i'm looking it it there's probably no need for the separate queue if the gpu queue is synchronised with it anyway. (like with most of these examples it is a summary of what i came up with, but not the full sequence of how i got there which explains some of the decisions).

This is a trivial approach to ensuring the 'gpu' queue behaves as the caller expects: that is, as if the work was performed in sequence on the queue and without having to pass explicit events. I'm using the read/write interfaces rather than map/unmap or otherwise mostly out of habit, but the data in question is quite small so it shouldn't make much difference either way.

And FWIW for this problem ... this approach or the java NativeKernel one actually runs a tiny bit quicker than using the OpenCL/CPU device let alone the GPU (all wall-clock time on the opencl q).

I had to make some small tweaks to the CLEventList code to make this all work and to tie properly into the garbage collection system. Mostly this was adding a CLEvent array rather than just using the pointer array and fixing the implementation. I kept the pointer array to simplify the jni lookup. I also had to have construction go through the same mechanism as the other CLObjects so they retain global reference uniqueness. This greatly simplifies (i.e. completely removes) reference tracking which is always nice with async callbacks. I think it should all "just work"; it does from the Java side - but i need to check from the OpenCL side of things whether actions like setUserEvent() adds an explicit reference bump.

This is a prime example of what HSA should solve well, but for now this is what i've got to work with.

I've been so busy with details i haven't had a chance to even look at any OpenCL 2.0, let alone OpenVX, HSA, or much else. And frankly the spirit is just not that willing of late. Spring is just the latest of a litany of excuses for that.

No comments: