Tuesday, 29 September 2015

BOOPSOGL time waster

The long story: I finally replaced my AVR (hi-fi amplifier) a couple of weeks ago after blowing it up 1-2 years ago and the new one has some network features. There's a web page to control it and a phone app - but both are pretty shitful. Actually the app isn't all bad but it's a pain for the things I use it for most: volume and mute because the volume knob is clumsy and the way the app handles screen blanking means mute isn't as easily accessed as it should be. I played a bit with the web app and worked out some of it's terrible 'xml-ish' remote control protocol and wrote a little application to perform both - but javafx is way too fat for this. I was recently looking into some opengl stuff and came across a trivial example which uses GLX to setup the screen - it also had some simple X11 Display code so I thought I could just write a super-lightweight Xlib tool for this. But then you need at least a little bit of 'toolkit' to make this doable ...

I'd had a blast of Res0gun and DRIVECLUB earlier but TV was dull so I started poking around some trivial C struct-based object system but then realised how much i'd forgotten since GObject and CamelObject. And then realised all the boilerplate that would be needed to even use such a one, so I went back to my RKRM: Libraries and looked into cloning BOOPSI instead. The only boilerplate that needs is setting a dispatch method, although the dispatch method itself ends up being fat as it fulfills the role a vtable would.

BOOPSI (basic object oriented programming system for intuition) was the AmigaOS 2 solution to general 'objects in C' which was apparently based on SmallTalk (Amiga libraries and devices are also object oriented but are not as general). Everything is implemented using a programmed dispatch call stack rather than vtables. It's not particularly fast but it is very small and flexible and it does have one rather interesting benefit not found in C or C++ - the ability to change any object in the hierarchy without a full recompile whilst still retaining single-instance memory blocks.

The short story: I got a couple of hundred lines into the code which is enough to instantiate objects and define classes together with some core support utilities.

Will I keep poking? I'm slightly curious perhaps but not quite curious enough for that as it gets involved very quickly. Maybe if I use GLX instead of the raw X I was thinking of (BOOPSOGL?). OpenVG? Text rendering is the biggest hassle either way. And layouts, although i've looked at that before.

I guess at least one observation is that back then this stuff looked so fat and cumbersome (albeit a large improvement over base intuition or gadtools), but then yeah, i've seen what else has come since and it really really wasn't.

Friday, 18 September 2015

Nice curves!

Bezeir Curves.

Wow what a page.

Thursday, 17 September 2015

empirically corrected-approximate integer division

Although i haven't been posting about it i've been continuing to poke around in bits and pieces of code. Well, a little bit.

I did a bit of OpenCL last week and that was pretty fun. I had enough time to really dig into optimising a particular routine and was down to inspecting the ISA output from the driver. Good stuff. The GCN isa is pretty foreign to me so I had to use small snippets to isolate operations of interest.

For example one construct that comes up repeatedly when parallelising code is using a divide and modulus operator when splitting up a non-work-sized job into work-group sized blocks.

  int block_size = info.block_size;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = id / block_size;
    int block_index = id % block_size;
     // do work

Where possible one just chooses a power of 2 so this is a simple shift and mask, or integer divide by a constant isn't too bad as it can usually be optimised by the compiler. But this problem required a dynamic block size that wasn't a power of 2.

The solution? Use floating point multiply the reciprocal which can be calculated efficiently or here off-line. The problem is that this introduces enough rounding error to be worthless without some more work.

I must admit I just found the solution empirically here: i had a limited range of possible values so I just exhaustively tested them all against a couple of guesses. Hey it works, i'm no scientist.

  float block_size_1 = info.block_size_1;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = (int)(id * block_size_1 + 1.0f / 16384);
    int block_index = id - (block_no * block_size);
     // do work

This replaces the many instruction integer division decomposition with a convert+mad+convert.

On some work-loads this was a 25% improvement to the entire routine and these 2 lines are in an inner loop of about 50 lines of code.

Well it's been fun to play at this level again - its ... mostly ... pointless going to this level but just adds to the toolkit and I enjoy poking. Maybe one day i'll have a job where it's useful.

I gave zcl a go on this as originally I was thinking of trying some OpenCL 2 stuff but I may not bother now. Given the lack of use/testing it was pretty much bug free but I started filling out the API with some more convenient entry points. I also decided to add some more java-array interfaces here and there: they're just too convenient and it hides the mess in the C even if they might not be the most efficient in all cases.

This is the sort of thing i'm talking about:

  float[] data = new float[] { 1, 2, 3, 4, 5 };
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data);
  float[] data = new float[] { 1, 2, 3, 4, 5 };
  ByteBuffer bb = ByteBuffer.allocateDirect(data.length * 4).order(ByteOrder.nativeOrder());
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data.length * 4, bb);
It's only two fewer lines of code ... but yeah that shit gets old fast. The first is more efficient too because this is a native method and it avoids the copy. In the first case CL_MEM_USE_HOST_PTR throws an exception though, and in the second it works (library call permitting).

The main downside is adding these convenience calls blows out the method count very quickly if you support all the primitive types - which detracts from the ease of use they're supposed to increase.

Another release? Who knows when.

And this week i've been poking at some OpenGL. My it's grown. I'm experimenting using JOGL for this although i'm not a fan of some of it's binding choices. It's crossed my mind but i'm pretty sure i don't want to create yet another binding as in a 'ZGL'. Hmm, I wonder if vulkan will clean up the cross platform junk from opengl.

Unfortunately my home workstation seems to have developed a faulty DIMM or something (unrelated note of note).