Thursday, 22 July 2010

Java2D and Float Images

So much for a day off, well I didn't get too wrapped up in work nor too wrapped in coding but I did dabble a bit. It was one of those crappy cold days - no wind, just no sun and a seeping cold that gets into your bones and turns your toes numb and fingers stiff.

I did finally work out one thing, or maybe re-worked it out; how to create Java BufferedImage's backed by a floating point buffer (and i'll get the details out of the way first).
int width = 1024, height = 768;
float [] data = new float[width * height * 4];

ColorSpace cs = ColorSpace.getInstance(ColorSpace.CS_sRGB);
ColorModel cm = new ComponentColorModel(cs, true, false,
SampleModel sm = new ComponentSampleModel(DataBuffer.TYPE_FLOAT, width,
height, 4, bounds.width * 4, new int[]{0, 1, 2, 3});
DataBufferFloat db = new DataBufferFloat(data, data.length);
WritableRaster wr = Raster.createWritableRaster(sm, db, null);
BufferedImage bimage = new BufferedImage(cm, wr, false, null);
Each pixel is then stored in the array data[] in the order R G B, A. With the backing array image ops can work directly on the float data (FFT convolution anyone?), or you can get a Graphics2D from the bimage and work with that.

I was poking at a layered image system, using floating point buffers in RGBA to store everything. I load the image, convert it to the buffers, and run a really crap, really simple multi-pass compositing system to blend them into the display. So I have an image I can scroll around and set the alpha of, but what about drawing?

I recall trying to get float backed buffers working before and not having much luck, so I was going to look at using java2d to write to a byte or short buffer and then just converting that over (good enough for what i want). But I did finally work out the float buffers so I don't need to do that - and that's despite the documentation saying that 'TYPE_FLOAT' is just a placeholder. Actually it's even better since I can just attach a BufferedImage to any arbitrary layer's float array and then use the nice Java2D API to write to directly - there goes most of a 'drawing application'. It only needs a little bit of damaged-area tracking to get this onto the screen efficiently.

Currently i'm still converting the composited float image to INT_RGBA since that is a bit faster than drawing the float-backed image itself, but it isn't a huge difference.

Tada ... 2 layered image, the photo is about 70% opacity by the slider on the right with the 'background' showing through, and the top layer was drawn to using Java2D (I forgot to turn on anti-aliasing, but that's trivial to add). Actually java has it's own compositing mechanism so I can probably throw those few lines of code away too. Update: Tried this. Way too slow. Nevermind.

It doesn't do much, but then it didn't take a lot of code to do it either.

XBMC beagle, GSOC 2010

Well I 'promised' an update on the beagleboard gsoc 2010 xbmc whatsit, and since we've just had the 'mid-terms' and I have some spare time it seems like a good point to poke it out.

The good news of the day is that Tobias passed the midterms well - although I haven't had a huge amount of time to devote to it, he has thankfully worked very well independently. He's been working well with both the xbmc and beagleboard communities, finding relevant experts to aid the task which has let me off the hook quite a bit. He's had to spend a lot of time just on the beagleboard environment which was an unavoidable pain since the hardware arrived late - and xbmc is a mammoth bit of code that takes an age and a half to compile. But most of the code to this point has been changing the rendering system from a game-like render-all loop to a damage-based system - which could be done on a pc. Still bugs, but it's getting there. The patches look nice, and he keeps the commited code building (just as well - it takes hours to build on the target).

He's started on the video overlay system now, so i'm expecting some big improvements. Some initial timing suggests it's spending nearly 60% of it's time in the 'gpu' doing YUV conversion (i'm not sure what resolution he's running it at). The video overlay will do that for free, and more in that it reduces the memory bandwidth requirements significantly.

XBMC basically 'runs' on the beagleboard now, but can only play quite low-resolution video and there's a few issues with missing text, but it does run. With a simpler theme and the video overlay work i'm hoping it will at least be at the SD-video media player level. The XM might even manage 720p for simpler video formats like mpeg2. Although out of scope for this stage of the project, there's also the DSP sitting idle at the moment so the hardware is capable of quite a bit more yet.

Lots o threads

I got a new work machine - hence the previous post. That was a short diversion into ms vista 7, which I thankfully didn't need to keep up - I was having massive problems with the nvidia graphics drivers under fedora 13, and problems with my code. But it turned out that it was just my broken code and it crashed just as badly in ms visa 7. Wow what a horrid system they've designed. Move a window to look at something behind it and suddenly it maximises so you can't see what you wanted, the 'file browser manager' thing which seems confused as to what it's trying to be, and probably the worst item - move a mouse over a list and the scroll wheel keeps scrolling the last list that had focus. Not even clicking on the scroll-bar gives it focus and you need to click in the list (often activating it - which you don't want). Ugh. It's like a hollow shell of a tech demo of slightly wacky ideas from GNOME and KDE all wrapped together with a questionably 'pretty' interface (i found it far too spaced out with poor font choices). It kind of looks ok, but there's no meat under it and lots of things don't work quite right. The OS installs pretty fast at least - but you don't get anything that lets you do any work and it just turns into a labourious hunt for some crap that probably doesn't work very well, install, repeat, until you have a remotely usable system. And it still does product registration? Jesus fucking Christ, that's just offensive.


So I had a few problems when I started moving code from the ATI card i've been using to the Nvidia one in the new machine. The compiler is a bit pickier/different about a few things, although iirc that was mostly not auto-casting scalars to vector types in a variable declaration. A bit of a pain but fortunately I don't have too much code yet and it was mostly a mechanical conversion process.

I suppose the main problem I had - had I known it at the time it would've saved me a very long and wasted day or two - was that the CPUs are much pickier about the code they'll execute. The ATI card doesn't mind some stray memory accesses but the nvidia one just crashes. That is good really since the code is buggy - but unfortunately you get no indicator of why it crashed, or even when it crashed. At some random point after some code you've queued to execute runs you get a random and meaningless (and undocumented/not to spec I might add) error code which says things have stopped working. I was thrown out because the nvidia drivers were a pain to set up - the `development drivers' just wouldn't run, and the production drivers ran but were a little touchy - if I log out of the session X wont restart. I was also thrown out since adding some debugging code made the routine run too (and since I had it working on the other machine ...).

Anyway now that I know any of these random errors are actually just segfaults it's much easier to deal with without getting a splitting headache. Actually I think I was getting so stressed (or maybe it's because i've been eating all sorts of crap) I spent most of one day with an anxiety induced dizzy spell and headache (ms vista 7 helped there too).

So anyway, the one main routine on which i've been working for the last few weeks got running again and I cleaned it up and whatnot. It's only about 2x faster than the ATI card (HD 5770, vs GTX 480 IIRC), but the code was 'tuned' for the ATI. Although using the word 'tuned' is being a bit generous really, I just kept trying things and seeing what was faster, since there are zero tools on Linux to perform any detailed profiling. I guess that isn't so surprising - if I coded it right it should be completely memory constrained anyway. I did make some minor changes since the nvidia cpu's support better datatype conversion than juniper, e.g. loading floats from bytes in 2 instructions, not dozens (it was much faster to load uint's directly and convert manually on Juniper, but the other way around on nvidia). Right now i'm taking data and converting to floats and working with that everywhere which was the right approach on the Juniper arch but might not be on nvidia since it multiplies the memory bandwidth by 4. But there's just not enough time in every day to try everything - I worked over the wet dreary weekend and ended up over 50 hours by COB Wednesday so i'm having a break now. I was supposed to be dropping to 4 days/week this financial year!

I'm still getting to grips with mapping problems efficiently to the GPUs. I've had some success with a more complex approach which copies data to local memory in coalesced accesses and then works from the local memory - which is fast (and pretty much essential on the ATI with no cache). But for smaller problem sets it gets difficult to find enough threads to work together on the problem or even to work out the addressing arithmetic so the algorithm works. Although I don't think it leads to the ultimate performance, and may not work terribly well on the ATI - a solution that seems to be working somewhat is to just throw as many threads at it as possible - reduce the address arithmetic to very simple operations and then process as little as one result per kernel. And it makes it practical to vary parameters without needing to hand-code every scenario to get usable performance, let alone best performance.

Free as a thing of freeness

If I could think of something to work on i'd also like to write some free software using OpenCL now i'm starting to get the hang of it - well if I can invent a time machine so I can add an extra week to every week so I can fit it in. But the trivial stuff I can think of seems too pointless, or the more complex stuff way too complex.

In the back of my mind i've had the idea of doing a Gimp-ish/ImageJ-ish application in Java (see ImageJ - many big operations work faster than the gimp), and using OpenCL to accelerate (or indeed completely implement) the operations. But ... it's such a big fucking task to get something even useful - and requires a huge amount of work in the UI department, so i'm not sure I want to commit to it. Just the basic window with a zoomable editable layered surface with a couple of drawing tools, selection and filter/effect options is quite a task (ok ok, it's basically the whole app ;-). I guess if I can get over the hurdle of a main editing surface widget I might be able to move forward with this idea.

Another idea is a 'gimp for video'. There's a nice java wrapping for ffmpeg which sorts out the codec end of things (yes there is, although like many java things, its fucking hard to find non-stale shit on google - xuggler). But here i'm lacking a bit of domain knowledge (and about all I really want to do is create slideshows/splice video together), and i'm not sure OpenCL is a good fit (simple fades and wipes are probably faster on a modern cpu). And working with media containers is entering a world of pain. Let alone the sorry fucked up state of linux sound which is something I don't think I could face sober and wouldn't put up with drunk. Might leave that idea.

I can't really think of anything else I might use that could make use of it to be honest.

Saturday, 17 July 2010


Well, now I know where KDE4 got its fucked up shithouse 'start menu' from. And the original from which is blatantly copied is also fucked up and shithouse.


How fucked up.

And shithouse.

Tuesday, 13 July 2010


I had just a few limequats left on a rather sickly looking tree I have in a pot so I thought i'd make some cordial from it before I used them up ('syrup' for americans). They have a very nice flavour - much as you'd expect, ripe lime mixed with kumquat, so a little like a tart orange/lemon. I threw in a couple of lemons too since the limequats were a bit small.

I ended up with nearly 2 litres of this nice golden liquid, plus some glace peel I can use in a cake if I remember to save it.

I dropped the sugar a bit off the recipe I found on the ABC, and a bit more citric acid because I don't like it too sweet and a bit more tart (and the pot was too full!). I used 1kg sugar and about 1.5tbs of acid.

The tree was looking pretty ill and I think I overdid the treatments and now most of the rest of the leaves have fallen off! It was much like that last year by mid-winter too, so I guess i'll have to wait till spring to see if it will recover - I hope so because I love the flavour. Being in a big pot I didn't keep it watered properly over summer either. I also found that my lime tree has borers in the trunk and given I also failed to water it properly over summer it wasn't in great shape anyway - may well lose that one. If so I might get a native lime (if i ever see one in a shop again - saw one once, 5 years ago), or a more acidic lime (or lemon).

Thursday, 8 July 2010


Had a bit of a victory today - after a kick in the nuts or two. Finally got some of my OpenCL code running with the correct results at a reasonable clip.

I spent most of the day working out why the results were wrong - partially because of a minor bug or three, but mostly because all of the synchronisation primitives don't work when you call a kernel function from another kernel function (at least in the ATI sdk). Wish I had have known that to start with ...

I think it's roughly 100x faster than the original java or c code (although I should quantify it), so that's a pretty penny in the bank, and I think there's a bit more I can squeeze out of it - let alone using beefier hardware. One of the keys was to use a native format for most operations - I take the input data which is in a packed byte format and convert it to floats, and then operate on those. The other key is to use local memory as a programmed cache to reduce the load on global memory. And finally to utilise registers as much as possible - once i've loaded data from memory re-use the data repeatedly before needing to go back to memory or running out of registers. The OpenCL api also has some nice queuing and job management which makes it easy to let the CPU do other work whilst the GPU is busy, without having to synchronise every operation - which is the real mind killer. And it goes without saying that the data is loaded once to the graphics card memory and all operations operate there until I get a result out (converted to the format I need).

I still haven't managed to get the image datatypes to work but I will keep trying as t should fit this problem well (and nice to see that the JOCL guys were quick to implement the missing api's to support them). Using arrays is a bit of a pita tbh - i've had to split my work 'tile' into multiple slices, and keeping track of where each of the work units (threads) within the work group ('process') gets hairier than a hippies armpits. Using the texture units should let me remove all of the manual cache code and messy address arithmetic - although whether it executes faster is the real test.