Monday, 19 March 2012

JNI overheads

Update: Also see the next post which has a slightly more real-world example.

So, i've been poking around with some JNI binding stuff and this time I was experimenting with different types of interfaces: I was thinking of doing more work on the C side of things so I can just directly use the native interfaces rather than having to go through wrappers all the time.

So I was curious to see how much overhead a single field access would be.

I'll go straight to the results.

This is the type of stuff I'm testing:
class HObject {
long handle;
public void call1() { invokes static native void call(this.handle); }
public void call2() { invokes native void call(this.handle); }
native public void call3();
class BObject {
ByteBuffer handle;

public void call1() { invokes static native void call(this.handle); }
All each native function does is resolve the 'handle' to the desired pointer, and assign it to a static location. e.g. for a ByteBuffer it must call JNIEnv.GetDirectBufferAddress(), for long it can just use the parameter directly, and for an object reference is must call back into the JVM to get the field before doing either of those.

The timings ... are for 10^6 10x10^6 invocations, spread over 10050 objects (some attempt to avoid unrealistic optimisations), repeated 5 times: the last result is shown.
What \ time
0 static native void call(HObject o) 0.15
1 HObject.call1() 0.10
2 static native void call(long handle) 0.10
3 static native void call(HObject.handle) 0.10
4 HObject.call2() 0.12
5 HObject.call3() 0.14
6 static native void call (BObject o) 0.59 (!)
7 BObject.call1() 0.36 (!)
The timings varied a bit so I just showed them to 2 significant figures.

Whilst case 2 isn't useful, cases 1 and 3 show that hotspot has effectively optimised out the field dereferences after a few runs. Obviously this is the fastest way to do it. Although it's interesting that the static native method (1) is measurably different to the object native method (4).

The last case is basically how I implemented most of the bindings i've used so far, so I guess I should have a re-think here. There are historical reasons I implemented jjmpeg this way - I was going to write java-side struct accessors. But since I dropped that idea as impractical, it may make sense for a rethink here. PDFZ does have some java-side struct builders, but these are only for simple arrays which are already typed appropriately.

I didn't test the more complex double-level used in jjmpeg which allows it's native objects to be garbage collected more easily.


So I was thinking I could implement code using case 0 or 5: this way the native calls can just be used directly without any extra glue-code.

There are overheads compared to cases 1 and 4, but it's less than 50%, and relatively speaking it will be far less than that. And most of this is due to the fact that hotspot can remove the field access entirely (which is of course: very cool).

Although it is interesting that a static native call from a local method is faster than a local native call from a local method. Whereas a static native call with an object parameter is slower than a local native call with an object parameter.

Although 10^6 10x10^6 calls are a lot of calls, so the absolute overhead is pretty insignificant even for the worst-case. Even if it's 5x slower, it's still only 59 vs 10 ns per call.

Small Arrays

This has me curious now: I wonder what the overhead for small arrays are, versus using a ByteBuffer/IntBuffer, etc.

I ran some tests with ByteBuffer/IntBuffer vs int[], using Get/ReleaseArrayElements vs using alloca(), and Get(Set)ArrayRegion. The test passes from 0 to 60 integers to the C code, which just adds up the elements.

Types used:
IntBuffer ib = ByteBuffer.allocateDirect(nels * 4)
ByteBuffer bb = ByteBuffer.allocateDirect(nels * 4)
int[] ia = new int[nels];

  • Using GetArrayElements() + ReleaseArrayElements() is basically the same as GetArrayRegion/SetArrayRegion up until there are 32 array elements, beyond that the second is faster. Which is most counter-intuitive.
  • I thought that using a ByteBuffer is slower than using an IntBuffer (which is derived from a ByteBuffer using .asIntBuffer()), but it turns out that GetDirectBufferCapacity returns the elements of the buffer size, not the number of bytes (i.e. as the java is documented, but different to the JNI method docs I found). Actually a ByteBuffer is a tiny bit faster.
  • If one is only reading the data, then calling GetArrayRegion to a copy on the stack is always faster than anything else for these sizes.
  • For read/write the direct byte buffer is the fastest.

But this was just using the same array. What about if i update the content each time? Here I am using the same object, but setting it's content before invocation.
  • Until 16 elements, the order is IntBuffer, Get/SetIntArrayRegion, Get/ReleaseIntArray, ByteBuffer
  • 16-24 elements, Get/SetIntArrayRegion, IntBuffer, Get/ReleaseIntArray, ByteBuffer
  • Beyond that, Get/SetIntArrayRegion, Get/ReleaseIntArray, IntBuffer, ByteBuffer

Obviously the ByteBuffer suffers from calling setInt(), but all the values are within 30% of each other so it's much of a muchness really.

And finally, what if the object is created every time as well?
  • Here, any of the direct buffer methods fall down - several times slower than the arrays - 6-10x slower than the fastest array version.
  • Here, using Get/SetIntArrayRegion is much faster than anything else, it was consistently at least twice as fast as the Get/ReleaseIntArray version.
So this contains few few curious results.

Firstly (well perhaps not so curious), only if you know the direct Buffer has been allocated beforehand is it always going to win. Dynamic allocation will be a killer; a cache might even it up, but i'm doubtful it would put any Buffer back to a winning spot.

Secondly - again not so curious: small array allocation is pretty damn fast. The timings hint that these small loops might be optimising away the allocation completely which cannot be done for the direct buffers.

And finally the strangest result; copying the whole array to the stack is usually faster than trying to access it directly. Possibly the latter case is either having to take the memory from the heap first and is effectively just doing the same thing. Or it needs to lock the region or perform other GC-related things which slows it down.

Small arrays aren't the only thing needed for a JNI binding, but they do come up often enough. Now I know they're just fine to use, I will look at using them more: they will be easier to use on the Java side too.

Update: So I realised I'd forgotten Get/ReleasePrimitiveArrayCritical: for the final test cases, this was always a bit faster even than Get/SetArrayRegion. I don't know if it has other detrimental effects in an MT application though.

However, it does seem to work fine for very large arrays too, so it might be the simple one-stop shop, as at least on Oracle's JVM it always seems to be the fastest solution.

I tried some runs of 1024 and 10240 elements, and oddly enough the same relative results hold in all cases. Direct buffers only win when pre-allocated, GetIntArrayRegion is equal/faster to GetIntArrayElements, and GetCriticalArray is the fastest.


mbien said...

direct buffers are like static memory. Always preallocate them in performance relevant code. GC of them is difficult too. If you have many small buffers slice a big one to safe overhead further. (take a look at the cached buffer factory as used in jocl)

NotZed said...

Yes, of course. However, all this management overhead is a pain to code too!

And even when the buffer doesn't need allocating, accessing it from java is quite slow: my second set of tests show that even with a pre-allocated buffer, updating the array in a direct buffer is a hit - it's not much, but for all the pain of using them from java, and all the extra stuff required to make them efficient, there's not much gain to be had even when there is one.

In my tests on my hardware/jvm (updating all elements of an existing object with new data, invoking a native function on this data, the native function only reads all values), the break-even is about 1024 ints for bytebuffer vs get+setintarrayregion, and by 10240 ints bytebuffer is edging it out by 19% (and in-turn getprimitivearraycritical is 16% faster than the direct buffer).

I just tested with openjdk 1.6 as on the system, and the results are even worse: using a bytebuffer is much slower than using an intbuffer (reverse of the oracle jdk), getintarrayregion is still faster at 10240 integers, and criticalarray is quite a bit faster.

openjdk: intbuffer: 17s, bytebuffer: 23s, get/releaseintarray: 14s, get/setarrayregion: 13s, critical array: 10s.
jdk1.7: 18, 12, 15.5, 15.7, 12.

On those numbers: any solution wouldn't really matter in a real application, but given a choice the simplest (from java) seems the one to go for. i.e. arrays.

Also in my experience, on the java side - if you have to access all elements of the data it's often faster to access the bytebuffer (in chunks) via an array. i.e. a copy, and possibly messy logic.

So strangely enough, it's only pre-allocated, very small (<= 4 integers) buffers that always win (and only by a small margin) when using direct buffers: even pushing it out to to 128K integers - 512KB of memory (which is clearly not practical for the alloca stuff) on the oracle jvm: criticalarray, byte buffer, get+setregion, intbuffer, get/release int array. openjdk: criticalarray, get+setregion, intbuffer, with bytebuffer/getreleasearray a dead heat for last.

Clearly if java isn't accessing all elements, this may not hold, otoh get/setarrayregion doesn't need to either. And no matter what, GetArrayElements/ReleaseArrayElements is the slowest of the lot.

Obviously what they were intended for: i/o, is a different matter as extra copies may not be able to lie around, the critical stuff can't be used, and they're a convenient way to encapsulate a malloc and for encoding binary streams. And the only real way to do something like concurrent access from a c thread - but that's a pretty limited case.

If you know of other benchmarks i'd be interested: I did a search but didn't find much. This was interesting though and of course i'm basically familiar with how jogamp does it.