Sunday 2 March 2014

Native kernels too?

I kept poking from the previous post and ended up getting native kernels going as well. I'm not really sure how useful they are but it's nice to come up with a neat solution.

It took me a while to grok the interface to clEnqueueNativeKernel but it seems to make sense.

This is the result I managed:

  public interface CLNativeKernel {
    public void invoke(Object[] args);
  }

  class CLCommandQueue {
      public native void enqueueNativeKernel(
          CLNativeKernel kernel,
          CLEventList waiters,
          CLEventList events,
          Object... args) throws CLException;
  }
Which leads to a relatively clean usage:
 CLBuffer mem = cl.createBuffer(0, 1024 * 4, null);

 q.enqueueNativeKernel((Object[] args) -> {
    System.out.printf("native kernel invoked %s\n", Thread.currentThread());
    for (Object o : args) {
        System.out.printf(" %s = %s\n", o.getClass().getName(), o);
    }
  }, null, null, mem, 10, mem, 10L);
Produces:
native kernel invoked Thread[Thread-0,5,main]
 java.nio.DirectByteBuffer = java.nio.DirectByteBuffer[pos=0 lim=4096 cap=4096]
 java.lang.Integer = 10
 java.nio.DirectByteBuffer = java.nio.DirectByteBuffer[pos=0 lim=4096 cap=4096]
 java.lang.Long = 10

The tricky bit is getting the memory handled. clEnqueueNativeKernel takes cl_mem arguments as input but then remaps them to physical (virtual) memory pointers when invoking the kernel. The only equivalent of a pointer in Java is a ByteBuffer ... but that also needs a length.

But basically I just copy over the jobject references from the jobject array and change any CLMemory classes to be the cl_mem they point to. In the native kernel hook I then have to remap the provided pointers of any CLMemory instances to direct ByteBuffers, and I obtain the actual memory size using clGetMemObjectInfo(). Because the native kernel hook can only take one set of arguments I fudge it by internally using argument 0 as a structure which contains all the copies of stuff I need and then free it afterwards. It does force the java code deal with some of the bytebuffer details but the only alternatives I can think of get pretty messy and actually doing lots of processing on memory buffers isn't something you should be doing from any native kernel to start with. They only work on CPU targets (APU?) anyway.

I did hit an issue in that AttachCurrentThread() was attaching to another native thread this time; so I tried using AttachCurrentThreadAsDaemon() instead. That may actually not be a good idea but it depends on whether a given OpenCL implementation is using thread pools or not. I guess?

Anyway, i'm fairly pleased with the result here.

No comments: