Saturday, 24 October 2015

yay summer

Just got back from the beach. Should've had a camera - water was literally like glass. It was overcast so the horizon vanished in places which is always a nifty effect. Water was crystal clear. It is also still pretty cold though and once you go out past the sand-bar things get a bit nippy. Had the cold water all to myself though there were a good few people out and about but almost nobody went beyond the shoreline unless they were on some floating craft. The closest beach is a comfortable 30m ride and between jetties so it's quiet, with no posers or jet-skies (so far?) and the sand is really nice (best sand on any beach i've been on). Should be a nice night too if the stillness keeps up, it's absolutely dead-calm again (@6pm) which we've had a fair bit so far this spring.

Nice to be 'skinny' again too. You don't notice when you're putting it on but its like i just stopped carrying 2x10KG sacks of flour (and a carton of milk!) around with me everywhere, which is definitely noticeable now it's gone. Not that it ever stopped me getting my gear off at the beach it feels better losing the dugong. Its like winding the block back about 15 years, well apart from the mental and physical scars, aches and pains, and now all-grey hair. Although i'm still getting used to looking down at my skinny knees when i'm cycling. I got some new shorts a couple of weeks ago and they already feel loose - but i think they were a pretty generous "Size 32" to start with.

Of course I should be having a beer or nice limey g&t right about now but the gout is already niggling (probably from a couple of beers i had monday with a visitor, first time in weeks) - my toe was even a bit sore after the first dip. Sigh. More fucking tea I guess. For me it seems the main cause is just not drinking enough water (or actually not weeing enough) - but by enough that means at least 3L per day just to start with. Which gets to be a bit of a drag and if i'm doing physical things i'm not sure I can even drink enough to offset the sweating.

It makes going out a pretty abysmal experience because i just don't enjoy myself. Actually I haven't been enjoying myself much going out the last few years but a couple of beers at least made it passable for the occasional highlight. I can have a few but might suffer later so usually don't risk it. Not that i've had much opportunity anyway but i've been almost dreading when they do turn up - i'd rather just garden or find some way to pass time by myself lately. Maybe it'll pass but I don't care if it doesn't. Speaking of which someone just called and i turned him down, but i might change my mind in the next hour. I was all settled in after the trip to the beach and don't feel like dressing up to hit the city, if i can even find clothes that still fit. I never like going out on Saturday anyway as it's too obnoxious and busy and the crowd a bit too keen, and i'm a little underslept and too cranky to put up with that shit anymore.

I've been eating a lot of chillies at home - i'm getting pretty steadily through my stash of frozen habaneros from a few years ago so hopefully I don't run out before something else comes online. I might have to buy an advanced seedling as growing from seed here without a greenhouse takes a very long time due to the cold nights. I finally finished a bottle of Blairs "Possible Side Effects" sauce this week - i can't remember when i opened it but it's years past it's use-by date although I didn't use it for some time. It's really expensive here in AU so i kept it anyway and after i topped it up with some lime juice it tasted better. It's not the nicest flavour but it's ok, and it's hot. Now onto the "Ultimate Insanity" sauce which has been in the fridge too long ... But due to just not eating much anymore and the palatte destruction of lots of chillies often with limes or lemons ... eating out has become super-bland and quite unappealing.

Work has been quite taxing (in a good way, although i'm a little blah on the whole need to work right now) so i haven't been coding much outside of that. I'm not watching much TV. I watch a small amount of twitch/youtube stuff but not that often. I've been playing PS4 a bit more - although it's mostly DRIVECLUB and Res0Gun. Res0Gun is just a brilliant game and although I'm still pretty mediocre at it the more you play the better you get - it's very pinballish. I turn up my new amp enough to feel the explosions. Yum. The other regular is DC which I do some single-events, community challenges, or occasionally tour events. None of my friends have it so I just see what's there (one just got a ps4 but he's not into car games). I've started doing many-lap races to learn some tracks and cars and with sped-up time it makes them pretty interesting once you get into the swing of things. With the new PS communities I played a bit with the photomode and uploaded some for something to do. I got a couple of messages so I guess somebody saw them but it's a little inflexble at the moment - something like a web forum with each 'community' a single topic. I even tried streaming on twitch today which at least I confirmed my router didn't crap-out, but i'll have to be in the right mood to try that again. And check with a GNU box how it looks.

I finally finished the last book i was reading - it really dragged on, and/or i kept falling asleep on the last bit. I dunno some dreary battle against evil by characters i didn't give a shit about (with a weird and entirely unnecessary prologue which explained what happened afterwards). And then straight onto the next of the 12 part(?) epic it is part of. This could take a while ...

I guess it's time to decide on what to do this fine and warm evening.

I can grab my tea and go wander out in the garden in a pair of shorts until the grass gets too cool and then watch some shit on TV or play a game or just go to bed ...

Or get tarted up enough to go to a pub (not much tarting admittedly) and head to the currently trendy part of town to drink water and hang out with my leery mate and his wanker friends (actually only one is obnoxious, if he's even there) while they get drunk and perve on the chicks half their age. And then ride home feeling afterwards probably feeling miserable (it's just happened the last few times).

Yeah I already decided when he called. Time to go see what the bugs have been eating if it hasn't cooled down too much while i've been here. Still dead calm at least.

Friday, 23 October 2015

Catching up

Not much going on so today here's a diary entry ...

I've added a few little things to ZCL - it's coming along quite nicely, I should probably work toward another release. I'm slowly adding the functional-like stuff into it, I decided to go with 'q.offer()' as the enqueue function, and 'ofXX()' as the factory methods. Together with the garbage collector support it does offer some interesting possibilities for code-reuse but i'm still experimenting with it in practice.

I needed to access a webcam so i added another api to jjmpeg (but i might move it) which just wraps v4l2 devices directly from the file descriptor (no library). Actually I did that a while ago but have slowly been filling it out as I needed more functionality. OTOH I started looking into the total snot-show that webcam access is on microsoft platforms and decided to give up - you can't even build the media-framework libraries with mingw-w64 as far as I can tell and you need vs of some form just to get the "system" headers. Ugh.

However I did find webcam-capture library which has already solved these problems. It's probably what i will look at as a fallback, but on linux efficiency is a bit low. The simplest webcam dump to JavaFX with my library generates almost no garbage (it provides static access directly to the driver buffers) and uses 50% less cpu time compared to using the low-level interfaces it provides despite a pretty expensive YUYV conversion step. The high level swing one is 4x the cpu overhead.

Along the way I found that was actually using videoInput "library" but I couldn't get that to link (cross compile at any rate, it should work with the ms sdk) - and in any event that just uses the directshow stuff which I had working ... i dunno, years ago. But the driver no longer works for the webcam i'm using and i'd have to buy it, ... so yeah that can wait.

And that ultimately led me to openimaj which probably would've saved me doing almost all that myself, although i would've needed to understand it anyway for OpenCL translation. And maven, ffs. But I guess I should at least have a look.

Discovery of useful software isn't that easy these days with so much noise - even if you go looking which I can't say I did ...

I've also been using a small amount of OpenGL and interoperating with OpenCL. It's become all a bit ... naff, and JOGL has been necessarily messed up to support all this nafficity. Pity I had to do this now rather than in a couple of months otherwise I would be looking at vulkan instead but with any luck that will be out soon enough to move to it before I need to get too far into GL (with the steammachines in november?). I'm sure microsoft will find some way to totally fuck-up it's cross-platform parts again though. My current thinking is that i will write a java binding for it once I have my hands on it (if only just-because) but I haven't looked into it all so far. But removing the static per-thread state should make it a lot saner.

For now I have some simple classes to do some off-screen rendering, and some OpenCL interop which is enough for what I want. Access to an output texture in JavaFX would be nice but it sounds like this is just not going to happen. Although one would expect a vulkan backend to be done at some point it will probably suffer the same hiding issues (well, with good reason I guess). If I really needed more performance i'd just use another toolkit - which is sad as that doesn't appear to be the intended vision of the javafx designers.

On that interop I had to fill out the extension mechanism in ZCL. I followed the prototype I'd created earlier. Currently each extension is provided by a different CLExtension class. It holds a pointer which in C-land is a function table resolved on the platform, and each platform object manages these. At first I was just going to use this as the mechanism for accessing extensions but it quickly becomes messy - you have to find the platform the object belongs to an in some cases this requires multiple queries (e.g. q.getDevice().getPlatform()). One approach I tried was to hide that by providing the extension methods directly on the object they extend - e.g. new CLContext or CLCommandQueue methods. These then manage looking up the extension and invoking the correct method for the given object. The details still need to be resolved but only once per object and it's all handled java-side.

There's a bit more behind the original mechanism than just code tidyiness for the extensions - they could potentially be loaded at runtime, or written separately from the core library. But on reflection how useless is that? The problem with this approach is each extension has it's own object - this is good and bad in that eventually you end up with a table required per context, queue, device, or whatever.

I think putting the extension methods on the target object is correct and after that the details don't really matter so much since it's an internal detail. But (on the fly design) I guess i should just maintain a CLPlatform reference on each object which can be extended and handle it that way. The extension objects will still be per-extension which keeps a cleaner namespace but they only need to be set per-platform which doesn't happen often. I'm pretty sure all objects have a 1:1 platform relationship, that would be the only thing to throw a spanner in the works; but the whole extension mechanism wouldn't work if that were the case.

Somewhere along this journey i came across some C++ code for something, I can't remember what it was. It was how I find most C++ code - it's been so over-engineered almost all of the actual lines of code is just boilerplate. The workings are so hidden I gave up trying to find it. It's just as bad in Java land where everyone wants to write a fucking framework before they even get started. Cut the bullshit and get to the point. C++ shows its heritage as being born from a time when "Software Engineering" was going to solve the worlds software problems by taking the programmer out of programming; using UML and CASE tools and auto-generating everything. It really shows, it is not a good language. This craziness was at it's peak just as I was going through uni and its done it's fair share of damage to the world and clearly continues to if abominations like C++ still exists.

That'll do for now.

Friday, 2 October 2015

OpenVX

Here's something I missed being out of the loop for the last good while: OpenVX.

Although it appears so did the rest of the industry? Another standard sent to rot?

I'll have to have a closer look though; if it is well thought out it will be worth borrowing a few ideas at the least as it's a problem i've come up with multiple solutions for myself and it's always useful to get a fresh perspective.

Or maybe it'll give me something to play with along with ZCL either as a binding and/or a pure Java (re)implementation for prototyping.

I wonder how far away vulkan is for the rest of us.

Update: Some movement has occurred on the HSA front recently. I suppose just some more announcements or partnerships or something. There's a lot of technology and software that needs to come together to get anywhere and it's taking it's time. I wonder if it'll ever get traction though. I guess it solves one of the problems with OpenCL - you actually have to program parallel software which is a skill few have the time to learn and fewer companies have the budget to fund. At least it has the potential to stay in the compiler layer, hidden from your typical code-monkey.

OpenCL notes

As tangentially related observation I've been working on some OpenCL stuff of late and I came across a problem I just couldn't get to run very fast on the dev GPU. I ended up resorting to using an OpenCL/CPU kernel and another queue but along the way i tried a couple of other interesting things.

One was using a native kernel written in Java - but this still requires a CPU driver / queue since the GPU driver i'm using doesn't support native kernels. ZCL has fairly simple interface to this:

public interface CLNativeKernel {

    /**
     * Native kernel entry point.
     *
     * @param args arguments, with any CLMemory objects replaced with
     * ByteBuffer.
     */
    public void invoke(Object[] args);
}

public class CLCommandQueue {
...
public native void enqueueNativeKernel(
    CLNativeKernel kernel,
    CLEventList waiters,
    CLEventList events,
    Object... args) throws CLException;
}

The JNI hides the details but it behaves the same way as the C code whereby any memory objects in the argument list are replaced by pointers (ByteBuffer here). Not sure if i'll keep the varargs prototype because it is inconsistent with every other enqueue function and only saves a little bit of typing. I'll review it when i look at the functional stuff i mentioned in the last post.

Which can be used efficiently and conveniently in Java 8:

  CLBuffer points = cl.createBuffer(0, 1024 * 4);
  q.enqueueNativeKernel((Object[] args)-> {
    // object[0] = ByteBuffer = points contents
    // object[1] = Integer = 12
    // object[1] = Integer = 24
  }, null, null,
  points, 12, 24);

Since my prototype harness didn't have a CPU queue until I subsequently added it my first approach was to emulate a native kernel using event callbacks and user events. It actually worked pretty well and was about the same runnning time, although it took a bit more (fairly straightforward) code to set up.

One approach I took was to have two queues - the primary 'gpu' queue where the work is done, and another one used for the memory transfer and rendezvous point.

  // setup CLEventLists and the user event

  // gpu part of work
  gpu.enqueueXXKernel(..., gpudone);
  // prepare for cpu component
  work.enqueueReadBuffer(..., gpudone, readdone);
  // prepare for gpu again
  work.enqueueWriteBuffer(..., cpudone, writedone);
  // make sure q is ready
  gpu.enqueueMarkerWithWaitList(writedone, null);

  memdone.setEventCallback(CL_COMPLETE, (CLEvent e, int status) -> {
    // do the work
    cpudoneevent.setUserEventStatus(CL_COMPLETE);
  });

In this case all the enqueue operations are performed at once and events are used to synchronise. This simplifies the callback code a little bit. Now i'm looking it it there's probably no need for the separate queue if the gpu queue is synchronised with it anyway. (like with most of these examples it is a summary of what i came up with, but not the full sequence of how i got there which explains some of the decisions).

This is a trivial approach to ensuring the 'gpu' queue behaves as the caller expects: that is, as if the work was performed in sequence on the queue and without having to pass explicit events. I'm using the read/write interfaces rather than map/unmap or otherwise mostly out of habit, but the data in question is quite small so it shouldn't make much difference either way.

And FWIW for this problem ... this approach or the java NativeKernel one actually runs a tiny bit quicker than using the OpenCL/CPU device let alone the GPU (all wall-clock time on the opencl q).

I had to make some small tweaks to the CLEventList code to make this all work and to tie properly into the garbage collection system. Mostly this was adding a CLEvent array rather than just using the pointer array and fixing the implementation. I kept the pointer array to simplify the jni lookup. I also had to have construction go through the same mechanism as the other CLObjects so they retain global reference uniqueness. This greatly simplifies (i.e. completely removes) reference tracking which is always nice with async callbacks. I think it should all "just work"; it does from the Java side - but i need to check from the OpenCL side of things whether actions like setUserEvent() adds an explicit reference bump.

This is a prime example of what HSA should solve well, but for now this is what i've got to work with.

I've been so busy with details i haven't had a chance to even look at any OpenCL 2.0, let alone OpenVX, HSA, or much else. And frankly the spirit is just not that willing of late. Spring is just the latest of a litany of excuses for that.

Thursday, 1 October 2015

OpenCL lambda enqueue

Just had a thought on an alternative api for CLCommandQueue in zcl. No this has nothing to do with lambda calculus in OpenCL.

An inconvenience in the current api is that all the enqueue functions take a lot of arguments, many of which are typically default values. This can be addressed using function overloading but this just adds additional inconvenience as there are also simply a lot of functions to overload. A related issue is things like extensions can add additional entry points which are object-orientedly resident on the queue object but placing them there doesn't necessarily fit.

And finally new compound operations need to be placed elsewhere but also fit a similar semantic model of enqueing a task to a specific queue.

So the thought is to instead to use java's lambda expressions to create queueable objects which know how to run themselves, and then at least the waiters/events parameter overload can be handled in one place.

So rather than:

// some compound task
  public void runop(CLCommandQueue q, CLImage src, CLImage dst,
      CLEventList waiters, CLEventList events) {
     ... enqueue one or more jobs ...
  }
  public void runop(CLCommandQueue q, CLImage src, CLImage dst) {
     runop(q, src, dst, null, null);
  }
  public void runop(CLCommandQueue q, CLImage src, CLImage dst,
      CLEventList event) {
     runop(q, src, dst, null, event);
  }

// usages
 runop(q, src, dst, waiters, events);
 runop(q, src, dst, events);
 runop(q, src, dst);

I can do:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);
}

// the creation (only one required)
 public CLTask of(CLImage src, CLImage dst) {
   return (q, w, e) -> {
     ... enqueue one or more jobs ...
   };
 }

// usages
 q.run(op.of(src, dst));
 q.run(op.of(src, dst), events);
 q.run(op.of(src, dst), waiters, events);

This could extend throughout the rest of the api so that for example a CLBuffer would provide it's own read task factories:

  public CLBuffer {

    public CLTask ofRead(byte[] target) {
      return (q, w, e) -> {
        q.enqueueReadBuffer(this, true, 0, target.length, target, 0, w, e);
      };
    }
  }

// usages
  q.run(buffer.ofRead(target));
  q.run(buffer.ofRead(target), events);
  q.run(buffer.ofRead(target), waiters, events);

vs

// typical usage (without overloading)
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, null);
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, null, events);
  q.enqueueReadBuffer(this, true, 0, target.length, target, 0, waiters, events);

I think this would provide a way to add the convenience of overloading without a method count explosion. But the real question is whether it would actually improve the api in any meaningful way or merely make it different. Probably at this point it's a tentative yes on that one for many of the same reasons lambdas are convenient such as encapsulation and reuse.

There are some issues of resolving state at point-of-execution and threads but these are already an issue with OpenCL code to some extent and definitely with lambdas in general.

One could keep going:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);

  public default void on(CLCommandQueue q) {
    enqueue(q, null, null);
  }
  public default void on(CLCommandQueue q, CLEventList w, CLEventList e) {
    enqueue(q, w, e);
  }
}

// usage
 buffer.ofRead(target).on(q);

Despite this having the benefit of layering in isolation above the base api I think it starts to get a little absurd and turns into "much of a muchness" deckchair shuffling.

Although this addition is probably useful:

// the interface
interface CLTask {
  public void enqueue(CLCommandQueue q, CLEventList w, CLEventList e);

  public default CLTask andThen(CLTask after) {
     return (q, w, e) -> {
        enqueue(q, w, e);
        after.enqueue(q, w, e);
     };
  }
}

// usage
 q.run( buffer1.ofRead(target).andThen(buffer2.ofWrite(target)) );

Actually I didn't really intend it as an outcome but this also becomes a lot more usable if the resources in questions are automatically reclaimable via gc as per my last post. Whole state and work spaces can be retained and reused through nothing more than a CLTask reference.

I think i've convinced myself of the utility now but either way it takes very little code to try it.

OpenCL garbage

I was working on some higher level containers for managing OpenCL stuff and came to the conclusion that I wanted to add automatic resource reclaimation to zcl - it was either that or fill a whole hierarchy of objects with reference counting. But reference counting is slow, error-prone, and a big mess to write in so it isn't at all attractive when there is an alternative.

I'd already done it in jjmpeg but i wasn't really keen on the way i implemented it there and wanted to see if i could come with a more streamlined solution. Like when I did it for jjmpeg I started with this article about JavaSE finalisation and using weak reference queues.

I think the solution I came up with will work ... and it turned out to be rather simple in the end.

Previously all CLObjects were a simple lightweight pointer handle with all the details passed to the C functions. They all have an init(pointer) constructor which was called directly from the JNI layer. Duplicate objects referencing the same resource were not an issue so I just let it happen. Well it's easy to break but if you treat objects like the C pointers they are and know that dangling references are possible then it's not unsolvable.

But for GC to work the references need to be unique. This is fairly easy to guarantee as the resources are just memory pointers - which are guaranteed to be unique and unchanging. So rather than the JNI layer invoke the constructors directly I just call a factory method with a type index which lets me move some of the code into Java - it isn't significantly simpler but it is more flexible.

For the reference queue to work properly I need to store them in a container anyway so this conveniently meshes with using a hashtable to uniqify the objects.

  static CLObject toObject(int ctype, long p) {
    CLObjectHandle h = referenceMap.get(p);

    if (h != null)
      return h.get();

    return classTable[ctype].newInstance(ctype, p);
  }

My first attempt passed the Class through (this is how i did it in JNI) but I changed it to an integer. It makes the JNI a bit easier and having the type as an integer simplifies the release call (OpenCL api isn't OO and has per-type release functions). Being able to identify the object fully using primitive types also lets me freely use them without polluting the reference tree; which is critically important when dealing with gc.

Now comes the bit which i fucked up in jjmpeg (well the biggest bit). Each object is represented by 4(!) classes. An autogenerated native abstract class which includes the static native method prototypes and a hand-written native concrete class which implements any type-specific dispose or construction semantics. Then there is an autogenerated abstract public class which includes all the autogenerated methods again - this time invoking all the methods on the native class after looking up the object pointer. And finally a hand-written public concrete class which includes constructors, helpers, and any other special cases where the details are better hidden.

This is just a lot of code - every public method on the "java" class ends up calling a native method on the "native" class so every method needs at least two implementations; . This was the main driver for ZCL simply using a single JNI implementation and foregoing this redundant juggling of the call stack just to insert the resource pointer into the call. In most cases in ZCL the public api is just the native method and it needs no redundant wrapper.

This time I just added a single general-purpose CLObjectHandle weak reference type which is used by all instances to track the native resource. It just holds the pointer (and the ctype) and implements the release. I just add one of these to each CLObject in one place.

  public abstract class CLNative {
    final long p;

    protected CLNative(long p) {
      this.p = p;
    }
...
  }

  public abstract class CLObject extends CLNative {
    final CLObjectHandle h;

    protected CLObject(int ctype, long p) {
      super(p);
      h = new CLObjectHandle(this, ctype, p);
    }

...
    static class CLObjectHandle extends WeakReference<CLObject> {
      long p;
      int ctype;

      CLObjectHandle(CLObject referent, int ctype, long p) {
        super(referent, referenceQueue);
        this.p = p;
        this.ctype = ctype;
        referenceMap.put(p, this);
      }

      void release() {
        if (p != 0) {
          map.remove(p);
          CObject.release(ctype, p);
          p = 0;
        }
      }
    }
...

  }

This and a bit of house-keeping is all that is required.

Having release be idempotent allows explicit release mechanisms to remain - for those cases where you can't afford to let the native resource management be at the whim of the garbage collector. For this reason i may also have to move the native pointer resolution in the JNI from a CLNative.p field lookup to resolving it via the handle. I need to investigate the cost of doing this first, and also whether explicit release like this will actually work in practice (e.g. if you release an object with more than one reference, does it fuck up?). Doing this would also let me use the correct integral type if I felt the need by just creating two different CLObjectHandle classes (32/64) and resolving sizes in the JNI code.

There is some potential problems where you resolve an object for the first time via a non-referencing api (for example clGetProgramInfo(CL_PROGRAM_CONTEXT) and the like) and then let the reference expire. But this shouldn't normally be a problem since you would have to get the context before creating the program and are going to be keeping it around for the lifetime of the program and thus only one xxRelease is every invoked. And this should normally hold for everything else too. If it turns out to be an issue I have mechanisms I can use to address it from adding an explicit object reference to the given objects (e.g. a CLContext to each CLProgram created), or adding phantom reference bumps on specific apis.

It's actually a devilishly difficult thing to test and verify: even once you know the exact reference counting semantics of every OpenCL api the interaction with the JVM will hide faults.

I haven't explored further but having unique objects and gc lets me freely cache local copies of resource handles for convenience or efficiency and so on. It really simplifies using the library enough as it is.

The next zcl release will include this as well as a couple of bug fixes and some other things which make it easier to use. Dunno when that might be though.