Thursday, 1 October 2015

OpenCL garbage

I was working on some higher level containers for managing OpenCL stuff and came to the conclusion that I wanted to add automatic resource reclaimation to zcl - it was either that or fill a whole hierarchy of objects with reference counting. But reference counting is slow, error-prone, and a big mess to write in so it isn't at all attractive when there is an alternative.

I'd already done it in jjmpeg but i wasn't really keen on the way i implemented it there and wanted to see if i could come with a more streamlined solution. Like when I did it for jjmpeg I started with this article about JavaSE finalisation and using weak reference queues.

I think the solution I came up with will work ... and it turned out to be rather simple in the end.

Previously all CLObjects were a simple lightweight pointer handle with all the details passed to the C functions. They all have an init(pointer) constructor which was called directly from the JNI layer. Duplicate objects referencing the same resource were not an issue so I just let it happen. Well it's easy to break but if you treat objects like the C pointers they are and know that dangling references are possible then it's not unsolvable.

But for GC to work the references need to be unique. This is fairly easy to guarantee as the resources are just memory pointers - which are guaranteed to be unique and unchanging. So rather than the JNI layer invoke the constructors directly I just call a factory method with a type index which lets me move some of the code into Java - it isn't significantly simpler but it is more flexible.

For the reference queue to work properly I need to store them in a container anyway so this conveniently meshes with using a hashtable to uniqify the objects.

  static CLObject toObject(int ctype, long p) {
    CLObjectHandle h = referenceMap.get(p);

    if (h != null)
      return h.get();

    return classTable[ctype].newInstance(ctype, p);
  }

My first attempt passed the Class through (this is how i did it in JNI) but I changed it to an integer. It makes the JNI a bit easier and having the type as an integer simplifies the release call (OpenCL api isn't OO and has per-type release functions). Being able to identify the object fully using primitive types also lets me freely use them without polluting the reference tree; which is critically important when dealing with gc.

Now comes the bit which i fucked up in jjmpeg (well the biggest bit). Each object is represented by 4(!) classes. An autogenerated native abstract class which includes the static native method prototypes and a hand-written native concrete class which implements any type-specific dispose or construction semantics. Then there is an autogenerated abstract public class which includes all the autogenerated methods again - this time invoking all the methods on the native class after looking up the object pointer. And finally a hand-written public concrete class which includes constructors, helpers, and any other special cases where the details are better hidden.

This is just a lot of code - every public method on the "java" class ends up calling a native method on the "native" class so every method needs at least two implementations; . This was the main driver for ZCL simply using a single JNI implementation and foregoing this redundant juggling of the call stack just to insert the resource pointer into the call. In most cases in ZCL the public api is just the native method and it needs no redundant wrapper.

This time I just added a single general-purpose CLObjectHandle weak reference type which is used by all instances to track the native resource. It just holds the pointer (and the ctype) and implements the release. I just add one of these to each CLObject in one place.

  public abstract class CLNative {
    final long p;

    protected CLNative(long p) {
      this.p = p;
    }
...
  }

  public abstract class CLObject extends CLNative {
    final CLObjectHandle h;

    protected CLObject(int ctype, long p) {
      super(p);
      h = new CLObjectHandle(this, ctype, p);
    }

...
    static class CLObjectHandle extends WeakReference<CLObject> {
      long p;
      int ctype;

      CLObjectHandle(CLObject referent, int ctype, long p) {
        super(referent, referenceQueue);
        this.p = p;
        this.ctype = ctype;
        referenceMap.put(p, this);
      }

      void release() {
        if (p != 0) {
          map.remove(p);
          CObject.release(ctype, p);
          p = 0;
        }
      }
    }
...

  }

This and a bit of house-keeping is all that is required.

Having release be idempotent allows explicit release mechanisms to remain - for those cases where you can't afford to let the native resource management be at the whim of the garbage collector. For this reason i may also have to move the native pointer resolution in the JNI from a CLNative.p field lookup to resolving it via the handle. I need to investigate the cost of doing this first, and also whether explicit release like this will actually work in practice (e.g. if you release an object with more than one reference, does it fuck up?). Doing this would also let me use the correct integral type if I felt the need by just creating two different CLObjectHandle classes (32/64) and resolving sizes in the JNI code.

There is some potential problems where you resolve an object for the first time via a non-referencing api (for example clGetProgramInfo(CL_PROGRAM_CONTEXT) and the like) and then let the reference expire. But this shouldn't normally be a problem since you would have to get the context before creating the program and are going to be keeping it around for the lifetime of the program and thus only one xxRelease is every invoked. And this should normally hold for everything else too. If it turns out to be an issue I have mechanisms I can use to address it from adding an explicit object reference to the given objects (e.g. a CLContext to each CLProgram created), or adding phantom reference bumps on specific apis.

It's actually a devilishly difficult thing to test and verify: even once you know the exact reference counting semantics of every OpenCL api the interaction with the JVM will hide faults.

I haven't explored further but having unique objects and gc lets me freely cache local copies of resource handles for convenience or efficiency and so on. It really simplifies using the library enough as it is.

The next zcl release will include this as well as a couple of bug fixes and some other things which make it easier to use. Dunno when that might be though.

No comments: