Thursday, 22 July 2010

Lots o threads

I got a new work machine - hence the previous post. That was a short diversion into ms vista 7, which I thankfully didn't need to keep up - I was having massive problems with the nvidia graphics drivers under fedora 13, and problems with my code. But it turned out that it was just my broken code and it crashed just as badly in ms visa 7. Wow what a horrid system they've designed. Move a window to look at something behind it and suddenly it maximises so you can't see what you wanted, the 'file browser manager' thing which seems confused as to what it's trying to be, and probably the worst item - move a mouse over a list and the scroll wheel keeps scrolling the last list that had focus. Not even clicking on the scroll-bar gives it focus and you need to click in the list (often activating it - which you don't want). Ugh. It's like a hollow shell of a tech demo of slightly wacky ideas from GNOME and KDE all wrapped together with a questionably 'pretty' interface (i found it far too spaced out with poor font choices). It kind of looks ok, but there's no meat under it and lots of things don't work quite right. The OS installs pretty fast at least - but you don't get anything that lets you do any work and it just turns into a labourious hunt for some crap that probably doesn't work very well, install, repeat, until you have a remotely usable system. And it still does product registration? Jesus fucking Christ, that's just offensive.


So I had a few problems when I started moving code from the ATI card i've been using to the Nvidia one in the new machine. The compiler is a bit pickier/different about a few things, although iirc that was mostly not auto-casting scalars to vector types in a variable declaration. A bit of a pain but fortunately I don't have too much code yet and it was mostly a mechanical conversion process.

I suppose the main problem I had - had I known it at the time it would've saved me a very long and wasted day or two - was that the CPUs are much pickier about the code they'll execute. The ATI card doesn't mind some stray memory accesses but the nvidia one just crashes. That is good really since the code is buggy - but unfortunately you get no indicator of why it crashed, or even when it crashed. At some random point after some code you've queued to execute runs you get a random and meaningless (and undocumented/not to spec I might add) error code which says things have stopped working. I was thrown out because the nvidia drivers were a pain to set up - the `development drivers' just wouldn't run, and the production drivers ran but were a little touchy - if I log out of the session X wont restart. I was also thrown out since adding some debugging code made the routine run too (and since I had it working on the other machine ...).

Anyway now that I know any of these random errors are actually just segfaults it's much easier to deal with without getting a splitting headache. Actually I think I was getting so stressed (or maybe it's because i've been eating all sorts of crap) I spent most of one day with an anxiety induced dizzy spell and headache (ms vista 7 helped there too).

So anyway, the one main routine on which i've been working for the last few weeks got running again and I cleaned it up and whatnot. It's only about 2x faster than the ATI card (HD 5770, vs GTX 480 IIRC), but the code was 'tuned' for the ATI. Although using the word 'tuned' is being a bit generous really, I just kept trying things and seeing what was faster, since there are zero tools on Linux to perform any detailed profiling. I guess that isn't so surprising - if I coded it right it should be completely memory constrained anyway. I did make some minor changes since the nvidia cpu's support better datatype conversion than juniper, e.g. loading floats from bytes in 2 instructions, not dozens (it was much faster to load uint's directly and convert manually on Juniper, but the other way around on nvidia). Right now i'm taking data and converting to floats and working with that everywhere which was the right approach on the Juniper arch but might not be on nvidia since it multiplies the memory bandwidth by 4. But there's just not enough time in every day to try everything - I worked over the wet dreary weekend and ended up over 50 hours by COB Wednesday so i'm having a break now. I was supposed to be dropping to 4 days/week this financial year!

I'm still getting to grips with mapping problems efficiently to the GPUs. I've had some success with a more complex approach which copies data to local memory in coalesced accesses and then works from the local memory - which is fast (and pretty much essential on the ATI with no cache). But for smaller problem sets it gets difficult to find enough threads to work together on the problem or even to work out the addressing arithmetic so the algorithm works. Although I don't think it leads to the ultimate performance, and may not work terribly well on the ATI - a solution that seems to be working somewhat is to just throw as many threads at it as possible - reduce the address arithmetic to very simple operations and then process as little as one result per kernel. And it makes it practical to vary parameters without needing to hand-code every scenario to get usable performance, let alone best performance.

Free as a thing of freeness

If I could think of something to work on i'd also like to write some free software using OpenCL now i'm starting to get the hang of it - well if I can invent a time machine so I can add an extra week to every week so I can fit it in. But the trivial stuff I can think of seems too pointless, or the more complex stuff way too complex.

In the back of my mind i've had the idea of doing a Gimp-ish/ImageJ-ish application in Java (see ImageJ - many big operations work faster than the gimp), and using OpenCL to accelerate (or indeed completely implement) the operations. But ... it's such a big fucking task to get something even useful - and requires a huge amount of work in the UI department, so i'm not sure I want to commit to it. Just the basic window with a zoomable editable layered surface with a couple of drawing tools, selection and filter/effect options is quite a task (ok ok, it's basically the whole app ;-). I guess if I can get over the hurdle of a main editing surface widget I might be able to move forward with this idea.

Another idea is a 'gimp for video'. There's a nice java wrapping for ffmpeg which sorts out the codec end of things (yes there is, although like many java things, its fucking hard to find non-stale shit on google - xuggler). But here i'm lacking a bit of domain knowledge (and about all I really want to do is create slideshows/splice video together), and i'm not sure OpenCL is a good fit (simple fades and wipes are probably faster on a modern cpu). And working with media containers is entering a world of pain. Let alone the sorry fucked up state of linux sound which is something I don't think I could face sober and wouldn't put up with drunk. Might leave that idea.

I can't really think of anything else I might use that could make use of it to be honest.

No comments: