Tuesday, 25 February 2014

Simple Java binding for OpenCL 1.2

Well I have it to the point of working - it still needs some functions filled out plus helper functions and a bit of tweaking but it's mostly there. So far under 2KLOC of C and less of Java. I went with the 'every pointer is 64-bits' implementation, using non-static methods, and passing objects around to the JNI rather than the pointers (except for a couple of apis). This allows me to implement the raw interface fully in C with just an 'interface' in Java - and thus write a lot less code.

Currently i'm mapping a bit closer to the C api than JOCL does. I'm using only using ByteBuffers to transfer memory asynchronously, for any other array arguments i'm just using arrays.

This example is with the raw api with no helpers coming in to play - there are some obvious simple ones to add which will make it a bit more comfortable to use.

// For all the CL_* constants
import static au.notzed.zcl.CL.*;

...

  CLPlatform[] platforms = CLPlatform.getPlatforms();
  CLPlatform plat = platforms[0];
  CLDevice dev = plat.getDevices(CL_DEVICE_TYPE_CPU)[0];
  CLContext cl = plat.createContext(dev);
  CLCommandQueue q = cl.createCommandQueue(dev, 0);

  CLBuffer mem = cl.createBuffer(0, 1024 * 4, null);

  CLProgram prog = cl.createProgramWithSource(
    new String[] {
      "kernel void testa(global int *buffer, int4 n, float f) {" +
      " buffer[get_global_id(0)] = n.s1 + get_global_id(0);" +
      "}"
    });

  pog.buildProgram(new CLDevice[]{dev}, null, null);

  CLKernel k = prog.createKernel("testa");

  ByteBuffer buffer = ByteBuffer.allocateDirect(1024 * 4).order(ByteOrder.nativeOrder());
   
  k.set(0, mem);
  k.set(1, 12, 13, 14, 15);
  k.set(2, 1.3f);
   
  q.enqueueWriteBuffer(mem, CL_FALSE, 0, 1024 * 4, buffer, 0, null, null);
  q.enqueueNDRangeKernel(k, 1, new long[] { 0 }, new long[] { 16 }, new long[] { 1 }, null, null);
  q.enqueueReadBuffer(mem, CL_TRUE, 0, 1024 * 4, buffer, 0, null, null);
  q.finish();

  IntBuffer ib = buffer.asIntBuffer();
   
  for (int i=0;i<32;i++) {
    System.out.printf(" %3d = %3d\n", i, ib.get());
  }

Currently CLBuffer (and CLImage) is just a handle to the cl_mem - it has no local meta-data or a fixed Buffer backing. The way JOCL handles this is reasonably convenient but i'm still yet to decide whether I will do something similar. Whilst it may be handy to have local copies of data like 'width' and 'format', I'm inclined to just have accessors which invoke the GetImageInfo call instead - it might be a bit more expensive but redundant copies of data isn't free either.

I'm not really all that fond of the way JOCL honours the position() of Buffers - it kind of seems useful but usually it's just a pita. And manipulating that from C is also a pain. So at the moment I treat them as one would treat malloc() although I allow an offset to be used where appropriate.

Such as ...

public class CLCommandQueue {
  ...
   native public void enqueueWriteBuffer(CLBuffer mem, boolean blocking,
      long mem_offset, long size,
      Buffer buffer, long buf_offset,
      CLEventList wait,
      CLEventList event) throws CLException;
  ...
}

Compare to the C api:

extern CL_API_ENTRY cl_int CL_API_CALL
clEnqueueWriteBuffer(cl_command_queue   /* command_queue */, 
                     cl_mem             /* buffer */, 
                     cl_bool            /* blocking_write */, 
                     size_t             /* offset */, 
                     size_t             /* size */, 
                     const void *       /* ptr */, 
                     cl_uint            /* num_events_in_wait_list */, 
                     const cl_event *   /* event_wait_list */, 
                     cl_event *         /* event */) CL_API_SUFFIX__VERSION_1_0;
In C "ptr" can just be adjusted before you use it but in Java I need to pass buf_offset to allow the same flexibility. It would have been nice to be able to pass array types here too ... but then I realised that these can run asynchronous which doesn't work from jni (or doesn't work well).

I'm still not sure if the query interface is based only on the type-specific queries implemented in C or whether I have helpers for every value on the objects themselves. The latter makes the code size and maintenance burden a lot bigger for questionable benefit. Maybe just do it for the more useful types.

Haven't yet done the callback stuff or native kernels (i don't quite understand those yet) but most of that is fairly easy apart from some resource tracking issues that come in to play.

Of course now i've done 90% of the work i'm not sure i can be fagged to do the last 10% ...

No comments: