Friday, 2 October 2015


Here's something I missed being out of the loop for the last good while: OpenVX.

Although it appears so did the rest of the industry? Another standard sent to rot?

I'll have to have a closer look though; if it is well thought out it will be worth borrowing a few ideas at the least as it's a problem i've come up with multiple solutions for myself and it's always useful to get a fresh perspective.

Or maybe it'll give me something to play with along with ZCL either as a binding and/or a pure Java (re)implementation for prototyping.

I wonder how far away vulkan is for the rest of us.

Thursday, 1 October 2015

OpenCL lambda enqueue

Just had a thought on an alternative api for CLCommandQueue in zcl. No this has nothing to do with lambda calculus in OpenCL.

An inconvenience in the current api is that all the enqueue functions take a lot of arguments, many of which are typically default values. This can be addressed using function overloading but this just adds additional inconvenience as there are also simply a lot of functions to overload. A related issue is things like extensions can add additional entry points which are object-orientedly resident on the queue object but placing them there doesn't necessarily fit.

And finally new compound operations need to be placed elsewhere but also fit a similar semantic model of enqueing a task to a specific queue.

So the thought is to instead to use java's lambda expressions to create queueable objects which know how to run themselves, and then at least the waiters/events parameter overload can be handled in one place.

So rather than:

// some compound task
  public void runop(CLCommandQueue q, CLImage src, CLImage dst,
      CLEventList waiters, CLEventList events) {
     ... enqueue one or more jobs ...
  public void runop(CLCommandQueue q, CLImage src, CLImage dst) {
     runop(q, src, dst, null, null);
  public void runop(CLCommandQueue q, CLImage src, CLImage dst,
      CLEventList event) {
     runop(q, src, dst, null, event);

// usages
 runop(q, src, dst, waiters, events);
 runop(q, src, dst, events);
 runop(q, src, dst);

I can do:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);

// the creation (only one required)
 public CLTask of(CLImage src, CLImage dst) {
   return (q, w, e) -> {
     ... enqueue one or more jobs ...

// usages, dst));, dst), events);, dst), waiters, events);

This could extend throughout the rest of the api so that for example a CLBuffer would provide it's own read task factories:

  public CLBuffer {

    public CLTask ofRead(byte[] target) {
      return (q, w, e) -> {
        q.enqueueReadBuffer(this, true, 0, target.length, target, 0, w, e);

// usages;, events);, waiters, events);


// typical usage (without overloading)
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, null);
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, events);
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, waiters, events);

I think this would provide a way to add the convenience of overloading without a method count explosion. But the real question is whether it would actually improve the api in any meaningful way or merely make it different. Probably at this point it's a tentative yes on that one for many of the same reasons lambdas are convenient such as encapsulation and reuse.

There are some issues of resolving state at point-of-execution and threads but these are already an issue with OpenCL code to some extent and definitely with lambdas in general.

One could keep going:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);

  public default void on(CLCommandQueue q) {
    enqueue(q, null, null);
  public default void on(CLCommandQueue q, CLEventList w, CLEventList e) {
    enqueue(q, w, e);

// usage

Despite this having the benefit of layering in isolation above the base api I think it starts to get a little absurd and turns into "much of a muchness" deckchair shuffling.

Although this addition is probably useful:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);

  public default CLTask andThen(CLTask after) {
     return (q, w, e) -> {
        enqueue(q, w, e);
        after.enqueue(q, w, e);

// usage buffer1.ofRead(target).andThen(buffer2.ofWrite(target)) );

Actually I didn't really intend it as an outcome but this also becomes a lot more usable if the resources in questions are automatically reclaimable via gc as per my last post. Whole state and work spaces can be retained and reused through nothing more than a CLTask reference.

I think i've convinced myself of the utility now but either way it takes very little code to try it.

OpenCL garbage

I was working on some higher level containers for managing OpenCL stuff and came to the conclusion that I wanted to add automatic resource reclaimation to zcl - it was either that or fill a whole hierarchy of objects with reference counting. But reference counting is slow, error-prone, and a big mess to write in so it isn't at all attractive when there is an alternative.

I'd already done it in jjmpeg but i wasn't really keen on the way i implemented it there and wanted to see if i could come with a more streamlined solution. Like when I did it for jjmpeg I started with this article about JavaSE finalisation and using weak reference queues.

I think the solution I came up with will work ... and it turned out to be rather simple in the end.

Previously all CLObjects were a simple lightweight pointer handle with all the details passed to the C functions. They all have an init(pointer) constructor which was called directly from the JNI layer. Duplicate objects referencing the same resource were not an issue so I just let it happen. Well it's easy to break but if you treat objects like the C pointers they are and know that dangling references are possible then it's not unsolvable.

But for GC to work the references need to be unique. This is fairly easy to guarantee as the resources are just memory pointers - which are guaranteed to be unique and unchanging. So rather than the JNI layer invoke the constructors directly I just call a factory method with a type index which lets me move some of the code into Java - it isn't significantly simpler but it is more flexible.

For the reference queue to work properly I need to store them in a container anyway so this conveniently meshes with using a hashtable to uniqify the objects.

  static CLObject toObject(int ctype, long p) {
    CLObjectHandle h = referenceMap.get(p);

    if (h != null)
      return h.get();

    return classTable[ctype].newInstance(ctype, p);

My first attempt passed the Class through (this is how i did it in JNI) but I changed it to an integer. It makes the JNI a bit easier and having the type as an integer simplifies the release call (OpenCL api isn't OO and has per-type release functions). Being able to identify the object fully using primitive types also lets me freely use them without polluting the reference tree; which is critically important when dealing with gc.

Now comes the bit which i fucked up in jjmpeg (well the biggest bit). Each object is represented by 4(!) classes. An autogenerated native abstract class which includes the static native method prototypes and a hand-written native concrete class which implements any type-specific dispose or construction semantics. Then there is an autogenerated abstract public class which includes all the autogenerated methods again - this time invoking all the methods on the native class after looking up the object pointer. And finally a hand-written public concrete class which includes constructors, helpers, and any other special cases where the details are better hidden.

This is just a lot of code - every public method on the "java" class ends up calling a native method on the "native" class so every method needs at least two implementations; . This was the main driver for ZCL simply using a single JNI implementation and foregoing this redundant juggling of the call stack just to insert the resource pointer into the call. In most cases in ZCL the public api is just the native method and it needs no redundant wrapper.

This time I just added a single general-purpose CLObjectHandle weak reference type which is used by all instances to track the native resource. It just holds the pointer (and the ctype) and implements the release. I just add one of these to each CLObject in one place.

  public abstract class CLNative {
    final long p;

    protected CLNative(long p) {
      this.p = p;

  public abstract class CLObject extends CLNative {
    final CLObjectHandle h;

    protected CLObject(int ctype, long p) {
      h = new CLObjectHandle(this, ctype, p);

    static class CLObjectHandle extends WeakReference<CLObject> {
      long p;
      int ctype;

      CLObjectHandle(CLObject referent, int ctype, long p) {
        super(referent, referenceQueue);
        this.p = p;
        this.ctype = ctype;
        referenceMap.put(p, this);

      void release() {
        if (p != 0) {
          CObject.release(ctype, p);
          p = 0;


This and a bit of house-keeping is all that is required.

Having release be idempotent allows explicit release mechanisms to remain - for those cases where you can't afford to let the native resource management be at the whim of the garbage collector. For this reason i may also have to move the native pointer resolution in the JNI from a CLNative.p field lookup to resolving it via the handle. I need to investigate the cost of doing this first, and also whether explicit release like this will actually work in practice (e.g. if you release an object with more than one reference, does it fuck up?). Doing this would also let me use the correct integral type if I felt the need by just creating two different CLObjectHandle classes (32/64) and resolving sizes in the JNI code.

There is some potential problems where you resolve an object for the first time via a non-referencing api (for example clGetProgramInfo(CL_PROGRAM_CONTEXT) and the like) and then let the reference expire. But this shouldn't normally be a problem since you would have to get the context before creating the program and are going to be keeping it around for the lifetime of the program and thus only one xxRelease is every invoked. And this should normally hold for everything else too. If it turns out to be an issue I have mechanisms I can use to address it from adding an explicit object reference to the given objects (e.g. a CLContext to each CLProgram created), or adding phantom reference bumps on specific apis.

It's actually a devilishly difficult thing to test and verify: even once you know the exact reference counting semantics of every OpenCL api the interaction with the JVM will hide faults.

I haven't explored further but having unique objects and gc lets me freely cache local copies of resource handles for convenience or efficiency and so on. It really simplifies using the library enough as it is.

The next zcl release will include this as well as a couple of bug fixes and some other things which make it easier to use. Dunno when that might be though.

Tuesday, 29 September 2015

BOOPSOGL time waster

The long story: I finally replaced my AVR (hi-fi amplifier) a couple of weeks ago after blowing it up 1-2 years ago and the new one has some network features. There's a web page to control it and a phone app - but both are pretty shitful. Actually the app isn't all bad but it's a pain for the things I use it for most: volume and mute because the volume knob is clumsy and the way the app handles screen blanking means mute isn't as easily accessed as it should be. I played a bit with the web app and worked out some of it's terrible 'xml-ish' remote control protocol and wrote a little application to perform both - but javafx is way too fat for this. I was recently looking into some opengl stuff and came across a trivial example which uses GLX to setup the screen - it also had some simple X11 Display code so I thought I could just write a super-lightweight Xlib tool for this. But then you need at least a little bit of 'toolkit' to make this doable ...

I'd had a blast of Res0gun and DRIVECLUB earlier but TV was dull so I started poking around some trivial C struct-based object system but then realised how much i'd forgotten since GObject and CamelObject. And then realised all the boilerplate that would be needed to even use such a one, so I went back to my RKRM: Libraries and looked into cloning BOOPSI instead. The only boilerplate that needs is setting a dispatch method, although the dispatch method itself ends up being fat as it fulfills the role a vtable would.

BOOPSI (basic object oriented programming system for intuition) was the AmigaOS 2 solution to general 'objects in C' which was apparently based on SmallTalk (Amiga libraries and devices are also object oriented but are not as general). Everything is implemented using a programmed dispatch call stack rather than vtables. It's not particularly fast but it is very small and flexible and it does have one rather interesting benefit not found in C or C++ - the ability to change any object in the hierarchy without a full recompile whilst still retaining single-instance memory blocks.

The short story: I got a couple of hundred lines into the code which is enough to instantiate objects and define classes together with some core support utilities.

Will I keep poking? I'm slightly curious perhaps but not quite curious enough for that as it gets involved very quickly. Maybe if I use GLX instead of the raw X I was thinking of (BOOPSOGL?). OpenVG? Text rendering is the biggest hassle either way. And layouts, although i've looked at that before.

I guess at least one observation is that back then this stuff looked so fat and cumbersome (albeit a large improvement over base intuition or gadtools), but then yeah, i've seen what else has come since and it really really wasn't.

Friday, 18 September 2015

Nice curves!

Bezeir Curves.

Wow what a page.

Thursday, 17 September 2015

empirically corrected-approximate integer division

Although i haven't been posting about it i've been continuing to poke around in bits and pieces of code. Well, a little bit.

I did a bit of OpenCL last week and that was pretty fun. I had enough time to really dig into optimising a particular routine and was down to inspecting the ISA output from the driver. Good stuff. The GCN isa is pretty foreign to me so I had to use small snippets to isolate operations of interest.

For example one construct that comes up repeatedly when parallelising code is using a divide and modulus operator when splitting up a non-work-sized job into work-group sized blocks.

  int block_size = info.block_size;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = id / block_size;
    int block_index = id % block_size;
     // do work

Where possible one just chooses a power of 2 so this is a simple shift and mask, or integer divide by a constant isn't too bad as it can usually be optimised by the compiler. But this problem required a dynamic block size that wasn't a power of 2.

The solution? Use floating point multiply the reciprocal which can be calculated efficiently or here off-line. The problem is that this introduces enough rounding error to be worthless without some more work.

I must admit I just found the solution empirically here: i had a limited range of possible values so I just exhaustively tested them all against a couple of guesses. Hey it works, i'm no scientist.

  float block_size_1 = info.block_size_1;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = (int)(id * block_size_1 + 1.0f / 16384);
    int block_index = id - (block_no * block_size);
     // do work

This replaces the many instruction integer division decomposition with a convert+mad+convert.

On some work-loads this was a 25% improvement to the entire routine and these 2 lines are in an inner loop of about 50 lines of code.

Well it's been fun to play at this level again - its ... mostly ... pointless going to this level but just adds to the toolkit and I enjoy poking. Maybe one day i'll have a job where it's useful.

I gave zcl a go on this as originally I was thinking of trying some OpenCL 2 stuff but I may not bother now. Given the lack of use/testing it was pretty much bug free but I started filling out the API with some more convenient entry points. I also decided to add some more java-array interfaces here and there: they're just too convenient and it hides the mess in the C even if they might not be the most efficient in all cases.

This is the sort of thing i'm talking about:

  float[] data = new float[] { 1, 2, 3, 4, 5 };
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data);
  float[] data = new float[] { 1, 2, 3, 4, 5 };
  ByteBuffer bb = ByteBuffer.allocateDirect(data.length * 4).order(ByteOrder.nativeOrder());
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data.length * 4, bb);
It's only two fewer lines of code ... but yeah that shit gets old fast. The first is more efficient too because this is a native method and it avoids the copy. In the first case CL_MEM_USE_HOST_PTR throws an exception though, and in the second it works (library call permitting).

The main downside is adding these convenience calls blows out the method count very quickly if you support all the primitive types - which detracts from the ease of use they're supposed to increase.

Another release? Who knows when.

And this week i've been poking at some OpenGL. My it's grown. I'm experimenting using JOGL for this although i'm not a fan of some of it's binding choices. It's crossed my mind but i'm pretty sure i don't want to create yet another binding as in a 'ZGL'. Hmm, I wonder if vulkan will clean up the cross platform junk from opengl.

Unfortunately my home workstation seems to have developed a faulty DIMM or something (unrelated note of note).

Saturday, 1 August 2015

Losing a chunk.

So i've mentioned it a few times on here - i've been steadily losing weight since February. It's settled now at about 74kg - in part because it just seemed to stop on it's own and its enough off so i'm no longer trying to actively lose weight. It is surprising to me how little food it takes to maintain this so far but i haven't been exercising a lot either.

I'm not sure if the initial trigger was all mental or all physical but in part the shock of getting gout and a full freezer and pantry bereft of 'gout friendly' foods kicked off a period of simply being able to lose weight on whim. I guess I somehow shrank my stomach enough to change how my body detects hunger and once it started I just ran with it - i can't say I tried too hard but I wanted to lose a bit for years. I still get hungry, it just doesn't really bother me like it used to and turn into an overwhelming need to eat. I did a few experiments in the midst to see how little I could get away with and it wasn't hunger that drove me to eat but a sore stomach.

At 92kg I was sadly bang on average for a bloke my age; but that is a lot more than my body can take. Although the belt was the first indication of making progress it wasn't till i broke under about 84kg that I looked (to myself) like I was making any progress; that's probably about when my waist to hip ratio broke under 1.0. After that each kg seemed noticeable one way or another.

I don't feel great or anything, but I would be lying if i didn't say I felt `less shit'.

Looking more like a Clark every year (matriarchal ancestry).

A couple of friends (but thankfully not all!) are already saying I "need to eat" like i'm underweight or something. I'm quite some way from that even if i'm below the average these days. I was skinny until I got a job in the city and started having rich lunches and regular Friday drinks together with a shorter commute cycle and it's slowly accumulated since.

The All Fat Diet

SBS has had a series of diet/health related shows on recently and out of curiosity I looked up some of the stuff i've been typically eating.

Apart from a very low total number of joules around 2/3 of the energy was from fats and oils. Protein was low. I think "diet", "light", "lite" and "low fat" foods are complete nonsense so I certainly wasn't having any of that.

Just bread and butter ended up being a big part of what I was eating (and i don't hold back on the butter). A few spuds and some rice at times. Lots of nuts, mostly almonds. Matured cheddar cheese. Lemons and limes as they are in season. Almost no meat, some but not much grog. Lots of coffee (usually black+none) and tea (green, or white+1/4). But I also had what herbs i could find in the garden mostly in tom-yum-like soup, and plenty of chillies and other random stuff along the way (the chillies were important in one critical way beyond making the bland more palatable). Lots and lots of water.

This isn't what I had all of the time but it was in the majority. It's also not something i'll be continuing but it was certainly effective at losing some weight this time.

One thing I did notice is that big dinner means big morning hunger. I'd already noticed this before but i'm now more convinced of it. If you ignore it it just goes away but it's also easy to eat less at night.

A wretched hive of scum and villainy

This episode of gout also enforced just how useless the internet has become as a source of general information. Almost all the gout 'advice' is useless, being generous. Even the stuff from established medical sources wasn't terribly applicable to me due to being neither 70 nor obese.

For example low-fat is always recommended: but it's never specified whether that has anything (at all) to do with gout or merely just being overweight.

Anyway i'm glad I lost the weight, I wonder if it will stay off this time?

Update: Well 6 weeks later and it's still dropping - 72-73Kg now. I'm eating properly now too - probably better than i have for years and certainly not going hungry.