Tuesday, 20 March 2012

More JNI'ing about

Well it looks like I brought my 'hacking day' a few days early, and I ended up poking around with JNI most of the day ... Just one of those annoying things that once it was in my head I couldn't get it out (and the stuff I have to do for work isn't inspiring me at the moment - gui re-arrangements/database design and so on).

I took the stuff i discovered yesterday, and tweaked the openal binding I hacked up a week ago. I converted all array arguments + length into java array types, and all factory methods create the object directly.

Then I ran a very crappy timing test ... most of the time in this is spend in alcOpenDevice(), but audio will have a fairly low call requirement anyway. I could spend forever trying to get a properly application-represenative benchmark, but this will show something.

First, the C version.
#define buflen 5
static ALuint buffers[buflen];
void testC() {
ALCdevice *dev;
ALCcontext *ctx;
int i;

dev = alcOpenDevice(NULL);
ctx = alcCreateContext(dev, NULL);

alcMakeContextCurrent(ctx);

alGenSources(buflen, buffers);
for (i=0;i<buflen;i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, 1);
}
for (i=0;i<buflen;i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, 0);
}
alDeleteSources(buflen, buffers);

alcMakeContextCurrent(NULL);
alcDestroyContext(ctx);
alcCloseDevice(dev);

}
Not much to say about it.

Then the JOAL version.
AL al;
ALC alc;
IntBuffer buffers;
int bufcount = 5;
void testJOAL() {
ALCdevice dev;
ALCcontext ctx;

dev = alc.alcOpenDevice(null);
ctx = alc.alcCreateContext(dev, null);
alc.alcMakeContextCurrent(ctx);

al.alGenSources(bufcount, buffers);
for (int i = 0; i < bufcount; i++) {
al.alSourcei(buffers.get(i), AL.AL_SOURCE_RELATIVE, AL.AL_TRUE);
}
for (int i = 0; i < bufcount; i++) {
al.alSourcei(buffers.get(i), AL.AL_SOURCE_RELATIVE, AL.AL_FALSE);
}
al.alDeleteSources(bufcount, buffers);

alc.alcMakeContextCurrent(null);
alc.alcDestroyContext(ctx);
alc.alcCloseDevice(dev);
}
...
buffers = com.jogamp.common.nio.Buffers.newDirectIntBuffer(bufcount);
al = ALFactory.getAL();
alc = ALFactory.getALC();
...
I timed the initial factory calls which loads the library: this isn't timed in the C version. Note that because it's passing around handles you need to go through an interface, and not directly through the native methods.

And then this new version, lets call it 'sal'.
import static au.notzed.al.ALC.*; import
static au.notzed.al.AL.*;

int bufcount;
int[] buffers;
void testSAL() {
ALCdevice dev;
ALCcontext ctx;

dev = alcOpenDevice(null);

ctx = alcCreateContext(dev, null);
alcMakeContextCurrent(ctx);

alGenSources(buffers);
for (int i = 0; i < buffers.length; i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, AL_TRUE);
}
for (int i = 0; i < buffers.length; i++) {
alSourcei(buffers[i], AL_SOURCE_RELATIVE, AL_FALSE);
}
alDeleteSources(buffers);

alcMakeContextCurrent(null);
alcDestroyContext(ctx);
alcCloseDevice(dev);
}
...
buffers = new int[bufcount];
...
Again the library load and symbol resolution is included - in this case it happens implicitly. Notice that when using the static import it's almost identical to the C version. Only here the array length isn't required as it's determined by the array itself.

Also this implementation only needs a small number of very trivial classes to be hand-coded, and everything else is done in the C code; although I also looked into wrapping the whole lot (buffers and sources included) in a high-level api as well. The openal headers are used completely untouched, although I have some mucky scripts which call gcc/cproto/grep to extract the information I need.

Apart from the code itself, I tried two array binding approaches, one which uses GetIntArrayRange(), and the other that uses Get/ReleaseCriticalArray(). Note that for the case of alSourcei() the binding JNI code only needs to read the array and it doesn't copy it back afterwards.

Timings

The results, the above routine is run 1000 times for each run. The runs are a loop within the process, so only the first time has any library load overheads. I used the oracle jdk 1.7.

run c joal range critical
0 3.3 3.678 3.518 3.554
1 3.267 3.622 3.446 3.405
2 3.297 3.513 3.493 3.552
3 3.264 3.482 3.448 3.494
4 3.243 3.575 3.553 3.542
5 3.297 3.472 3.395 3.352
6 3.308 3.527 3.376 3.359
7 3.284 3.52 3.354 3.363
8 3.253 3.419 3.363 3.349
9 3.266 3.42 3.429 3.413

ave 3.2779 3.5228 3.4375 3.4383
min 3.308 3.678 3.553 3.554
max 3.243 3.419 3.354 3.349

As you'd expect, C wins out - it just calls all the functions directly, and even the temporary storage is allocated in the BSS.

Both of the 'sal' versions are much the same, and joal isn't too far behind either - but it is behind.

Ok, so it's not a very good benchmark. I'm not going to re-write all the above, but when I changed to 100 iterations, but repeated the inner 2 loops (between gen/deleteSources) 100 times as well: c was about 0.9s, joal averaged about 4s, and sal averaged 2s. But that case probably goes too far in over-representing the method call overheads relative to what you might expect for an application at run-time - JNI has overheads, but as long as you're not implementing java.lang.Math() with it, it's barely going to be measurable once you add in i/o overheads, mmu costs, system scheduling and even cache misses.

At any rate, it validates the approach taken against another long-standing implementation (if not a particularly heavily developed one). Assuming that is I don't have glaring errors in the code and it's not actually doing all the work I ask of it.

SAL binding

Note also that the 'sal' binding hasn't skimped on safety where possible just to try to get a more favourable result (well, the al*v() methods have an implied data length which I am not checking ...), e.g. the symbols are looked up at run-time and suitable exceptions thrown on error.

An example bound method:
// manually written glue code
int checkfunc(JNIEnv *env, void *ptr, const char *name) {
// returns true if *ptr != null
// opens library if not opened
// sets exception and returns false if it can't open
// looks up method and field id's if not already done
// sets *ptr to dlsym() lookup
// sets exception and returns false if it can't find it
// returns true
}
// auto-generated binding
jobject Java_au_notzed_al_ALCAbstract_alcCreateContext(
JNIEnv *env, jclass jc, jobject jdevice, jintArray jattrlist) {
static LPALCCREATECONTEXT dalcCreateContext;
if (!dalcCreateContext
&& !checkfunc(env, (void **)&dalcCreateContext, "alcCreateContext")) {
return (jobject)0;
}
ALCdevice * device = (jdevice ?
(void *)(*env)->GetLongField(env, jdevice, ALCdevice_p) : NULL);
jsize attrlist_len = jattrlist ?
(*env)->GetArrayLength(env, jattrlist) : 0;
ALCint * attrlist = jattrlist ?
alloca(attrlist_len * sizeof(attrlist[0])) : NULL;
if (jattrlist)
(*env)->GetIntArrayRegion(env, jattrlist, 0, attrlist_len, attrlist);

ALCcontext * res;
res = (*dalcCreateContext)(device, attrlist);
jobject jres = res ?
(*env)->NewObject(env, ALCcontext_jc, ALCcontext_init_p, (long)res) : NULL;
return jres;
}
// auto-generated java side
public class ALCAbstract extends ALNative implements ALCConstants {
...
public static native ALCcontext alcCreateContext
(ALCdevice device, int[] attrlist);
...
}
// manually written classes (could obviously be auto-generated,
// but this isn't worth it if i want to add object api here)
public class ALCcontext extends ALObject {

long p;

ALCcontext(long p) {
this.p = p;
}
}
// and another
public class ALCdevice extends ALObject {

long p;

ALCdevice(long p) {
this.p = p;
}
}
So, 32-bit cpu's amongst you will notice that the handle is a long ... but that's only because I haven't bothered worrying about creating a version for 32-bit machines. Actually, because the JNI code is the only one which creates, accesses, or uses 'p' directly, it's actually easier to do this than if I was passing the handle to all of the native methods.

i.e. all I need is a different ALCdevice concrete implementation for each pointer size, and have the C code instantiate each instance itself. Neither the java native method declarations nor any java-side code needs to know the difference. If I wanted a high level ALCdevice object, that could just be abstract and it also needn't know about the type of 'p'.

Other stuff

So one thing i've noticed when doing these binding generators is that every library does things a bit differently.

The earlier versionf of FFmpeg were pretty clean, and although the function names were all over the place most of the calls took simple arguments with obvious wrappings. It's also a huge api ... which one
would not want to have to hand-code.

For openal, it requires passing a lot of arrays+length around, so to implement an array based interface requires special-case code to remove the length specifiers out of the api (of course, it may be that one actually wants these: e.g. to use less than the full array content, or for indexing within an array, but these cases I have intentionally ignored at this point). The api is fairly small too and changes slower than a wet week! It also has a separate symbol resolution function for extensions - which I haven't implemented yet.

I also looked at OpenCL - and the binding for that requires special-case handling for the event arrays for it to work practically from Java. It is also more of an object based api rather than a 'name' (i.e. an int reference id) based one.

(BTW I'm only experimenting with these apis because i've been looking at them recently and they provide examples of reasonably small and well-defined publc interfaces. I am DEFINITELY NOT planning on a whole jogamp replacement here: e.g. opencl and openal are simple stand-alone interfaces that can work mostly independently of the rest of the system. OpenGL OTOH has a lot of weird system dependencies which are a pain to work with - xgl, wgl, etc. - before you even start to talk about toolkit integration).

So ... what was my point here. I think it's that trying to create a be-all-and-end-all binding generator is too much work for little pay off. Each library will need special casing - and if one tries to include every possible api convention (many of which cannot be determined by examining the header files - e.g. reference counting or passing, null terminated arrays, arrays vs pass by reference, etc! etc! etc!) the cost becomes overwhelming.

For an interface like openal - which is fairly small, mostly repetitive, and changes very slowly, all the time spent on getting a code generator to work would probably be better spent on doing it by hand: with a few judiciously designed macros it would probably only be half to a days work once you nutted out the mechanisms. Although once you have a generator working it's easier to experiment with those mechanisms. In either case once it's done it's pretty much done too - openal isn't going to be changing very fast.

Although a generator just seems like 'the right way to do it(tm)' - you know, the interface is already described in a machine-readable form, so why not use it? But a 680 line bit of write-only perl is probably going to be more work to maintain than the 1400 lines of simple, much repeating and never changing C that it generates.

No comments: