First I thought i'd fix the array getter methods. The OpenCL get methods are built in a way where if you don't know the size of a field you have to call it twice - once to get the size and the next to get the content. There didn't seem to be much point exposing this to the Java side of things as I had initially done and I went with just doing that directly in the C code and returning a newly allocated and correctly sized result. My original thoughts were that perhaps buffers could be re-used for multiple gets but in reality it just isn't that useful: for small buffers it doesn't matter and for large ones you need to find out the size anyway and any management overheads (thread-specific etc, and simply having memory sitting around doing nothing) of reusable buffers is going to swamp allocation and GC.
I also realised I could fix one of the last bastions of the exposed native pointers and change the
long getInfoP methods to return an object directly to:
native public <T extends CLObject> T getInfoP(int param, Class<T> klass);Which is kind of nice. Actually
getInfoP()was hidden by type-specic getters but doing it this way (and particularly for the array types) saved even more code in the Java side for a minimal cost on the C side (actually I ended up saving code by reorganising the array getters).
Then I thought about whether I could add native array types to the CLCommandQueue interfaces. e.g.
native public void enqueueReadBuffer(CLBuffer mem, boolean blocking, long mem_offset, long size, byte buffer, long buf_offset, CLEventList wait, CLEventList event) throws CLException;In addition to the interface that uses nio buffers.
The tricky bit is that these can run asynchronously so you can't use the
GetPrimitiveArrayCritical() calls and you're basically left with either manually copying them using
Get*ArrayRegion() or using
Get*ArrayElements() which just seems to copy them on hotspot anyway.
As an experiment I tried the latter. Actually it ends up copying both ways which is a bit of a waste.
When called without blocking I use an event callback to await completion and then release the array back to Java. Strictly speaking I should also do the same for the Buffer versions so that the Buffer doesn't get GC'd while it's running but that's something I think can be left to the programmer to keep track of.
I tried a test program which just did many calls followed by a flush each time and actually performance wasn't too bad relative to the Buffer version. Maybe 10-20% slower (which is ok since accessing arrays is faster and simpler than Buffers in java). But then I tried a silly example of moving the flush outside of the loop. Ok, now it's 4x slower and god knows how much memory it ends up swallowing whilst executing.
So I followed up by trying the GetArrayRegion interface. This is a little bit faster but nothing to write home about.
At this point I think i'll just keep the binding and api smaller and leave it with using a ByteBuffer (sigh, which i still need to fix the endianess of) but i'll save the code for maybe later.
Actually probably the most surprising thing is just how slow the OpenCL stuff is here. This is only using the CPU driver so there's no weird memory busses to go over (even if this wasn't an apu). It's about 100x slower than copying a ByteBuffer to a byte array the same number of times. I thought it might be because the calls are non-blocking, but making them blocking only makes it worse. I tested the JNI overhead too by simply nooping out the
clEnqueueReadBuffer call on the array Region version and that is only about 2x slower than