Thursday, 20 October 2011

Ho hum.

Have a new AMD card - HD 6950 - for my workstation, need the catalyst driver for the OpenCL stuff. I use XFCE so the gnome3 incompatibilities are of no interest to me.

Couldn't get the driver built for FC13 (all sorts of bugs/problems with the rpm and I really just couldn't be fagged with it all late at night), so `upgraded' to FC15 ...

It kind of works, but is really slow in really weird ways - when changing virtual desktops one window refreshes at 'cpu speed'. glxgears @ 6000fps which is really way too slow: I'm getting 10KFPS on my rather older 5770 card on my other older/slower machine. Although fgl_gxgears is twice as fast on this new card. Using the AMD CPU backend for OpenCL causes more interference with graphics update than using the GPU backend(!) The other machine is using catalyst 10.12 on fedora 14, new one is 11.9 on fedora 15 ...

I've blacklisted the kernel radeon module and whatnot. I'm using xinerama - i tried without it and it was even slower.

I think there's just something wrong with the whole system as everything feels rather sluggish - or is that just the price of 'progress'? I'm trying a yum update (all 1G's worth) and if that doesn't work I might have to try something more drastic. Obviously the upgrade was a risky choice, but one would hope having the right kernel and X driver would be enough for the video driver ...

Only 1000 packages to go now ...

Later ...

Well it's still really slow. I tried an older driver release (on windows - hard to find them for fedora) but it wouldn't support the card. On windows the wall-clock of part of my application runs about 2x vs linux: which is pretty significant since much of the time is just waiting around for the video frame to arrive so the speed-up is presumably more than that. Needless to say the desktop is smoother too.

I also tried the viola-jones detector from socles. Ouch, this really really struggles - about 100x slower than running on nvidia hardware. I tried a few things that didn't make any noticable difference apart from removing the single rarely-used atomic_inc which made it jump to about 30x faster - but even with that huge increase it was still well behind the GTX 480.

I think probably I will have to try some other possible ideas to deal with this:
  • Scale the images so that each sliding scan reads adjacent locations (i.e. coalesced reads), and go back to 1-thread-per-test/cascade.
  • Pre-calculate the scaled weights/regions on the cpu so they can be stored in constant memory.
  • Cache the region/weight information in LS.
  • Unpack the region/weight info into a flat structure so it is read sequentially rather than walking a tree stored in an array.
  • ? separate the sum calculations from the weight calculations. By doing less work there might be more locality of reference/chance for any cache to function. This is just another way to try the first point I guess.
  • Use atomic counters if available since global atomics are obviously a huge no-no on cayman.

I had also better check it on my HD 5770 which runs the fc14 desktop very snappy and runs OpenCL ok to verify it isn't just all down to a shoddy driver (Hmm, now I think about it, I haven't tried OpenCL on it since 'upgrading' to fc14 from a hacked up ancient gnewsense).

glxgears does start to slow down on the 5770 vs the 6950 as you make the window bigger - so the hardware itself is somewhat faster. The problems must be in the overhead of the os/drivers. No question that ATI aren't doing a great job here but on the other hand, the xorg, fdo, and linux guys seem to change their minds about driver/graphics architecture every 6 months too ...

I was looking forward to playing with some new hardware, but apart from the sluggish GUI and having to `upgrade' the system, most of the application I work on no longer functions as critical routines are returning broken results. Not fun. Some of these are going to turn out to be bugs but i've already found problems with the compiler (e.g. commenting out all of the #pragma unroll directives fixed a bunch of stuff).

Well as the boss said, these things are so cheap it probably isn't worth my time (or his money!) for me trying to fix these issues ...

Later Still ...

Well I seem to have most of the code working again. Apart from the #pragma unroll error, they seem to be my own fault.

First, a bunch of queue synchronisation problems: data being over-written before it was fully processed for example. NVidias libraries are more aggressive about starting work without an explicit clFlush(). And apart from that I just made some mistakes along the way which weren't exposed until now.

And one odd one which took a while to track down: passing the same image as both a read_only image, and a write_only one. I knew this was suss when I did it, but 'it worked' so i left it there: I had it in the back of my mind that this was the sort of thing I should check, but I couldn't remember where I'd done it.

I still have newly added stability issues - the dreaded and meaningless 'error 134': but in the past these have usually been bugs too. Although not always.

So perhaps the drivers aren't so bad after-all; although they are still too slow from linux.

I guess I should've stuck to one of my rules of thumb of late: if you think you're getting the wrong result from the compiler, you just haven't checked your code closely enough yet.

No comments: