Friday, 28 October 2011

Fucked up Fridays

What is it about Fridays lately ...

Well the latest little thing to ruin my day has been the inability of Firefox 7 to function correctly with the primary selection. It seems to want to ignore middlemouse.contantLoadURL for some reason. Given that it's a recently new setting and fully documented I presume it's just a bug, but what a pain.

It's not something I use constantly but discovering it doesn't work is pretty annoying.

Update: So now it decides it's going to work. Well what can I say ... except maybe that I need to get AFK more often.

I'm totally sick of the upgrade treadmill and feel somewhat annoyed by being forced to install a newer version of Fedora just to get my graphics card working. I had everything working just nicely and was familiar enough with any of the the warts left to not notice them. And now I have to go through all that crap again. The thought that firefox will become 'versionless' horrifies me, as does the love-fest that is HTML5+JavaScript where I will no longer be able to ignore CO2 belching crap like I can now by just disabling flash.

Thursday, 27 October 2011

socles demos

I finally got off my fat arse - or is that sat on it further enlargening[sic] it - and tied up some of the test driver code I have for socles into a set of demos.

I also implemented the colour mode for the DCT denoising algorithm. Over-all it's a little slow still - i.e. not fast enough for real-time video. One of these days i'll get around to the complex wavelet version, that should be a lot faster and can also sharpen. I haven't been able to suss out DCT sharpening and so far my attempts add too many artefacts to be useful (i.e. pixel-level chess pattern).

The demos so far are:
AdaptiveBlur
An interactive window that shows an experimental algorithm I came up with some time ago for de-noising. It uses sobel filter to detect edges, then uses that to progressively blend between a blurred and non-burred image. Works ok sometimes.
ConvolveNonSeparable
Simple non-separable convolution that blurs an image.
ConvolveSeparable
Separable convolution to do the same thing (and demonstrates the code is broken atm - demo was broken, fixed)
DCT8x8Mono, DCT8x8Colour
Interactive DCT based denoise demo for mono/colour images.
WebcamFX
Another old interactive demo I wrote which uses Video4Linux to access a webcam and apply a bunch of effects including KLT motion detection and viola-jones face detect. It also shows the first half of a low-overhead video display path: the GPU does the colour conversion from raw frames. Well as low as possible with v4l4j anyway.
They're in the soclesdemo sub-module in socles' cvs.

Hmm, another week nearly down. I've been reading lots of papers and trying to suss out some fiddly crap for work, so this stuff has been a nice distraction. That's finally going somewhere so might keep me busy for a bit.

Wednesday, 26 October 2011

GC, finalisers

So I was doing some memory profiling the other day (using netbeans excellent excellent profiler - boy I could've used this 10 years ago) to try to track down some resource leakages and I noticed that xuggle was really exercising the system heavily.

So it seems I might look at moving to use jjmpeg in my client's application fairly soon. There are some other reasons as well: i.e. not being able to run in a 64-bit JVM on microsoft windows is starting to become a problem, and the bundled ffmpeg is just a bit out of date.

Since I haven't implemented memory handling completely in jjmpeg I went about looking how to do it 'properly'. I was just going to try to use finalisers, but then I came across this article on java finalisers java finalisers which said it probably wasn't a good idea.

I was going to have a short look this morning but suddenly it was 4 hours later and although I had something which works i'm not sure yet that I like it. It seems the cleanest way to implement the suggestions of using weak references, and mixing the auto-generated and hand-crafted code I want, so I will probably end up running with it. The public api didn't need to change.

Previously, the binding worked with an object class hierarchy something like this
 AVNative [
ByteBuffer p (points to allocated/mapped native memory)
]
+- AVFormatContextAbstract [
Generated field accessors and native methods
Most methods are object methods
]
+- AVFormatContext [
Public factory methods/constructors
Hand-coded specific methods
Hand-coded helper native methods
Hand-coded finalise/dispose methods
]

The new structure:
WeakReference<AVObject>
+- AVNative [
ByteBuffer p pointing to native memory
internal dispose() method
weak reference queue/cleanup as from article above
Weak reference is AVObject
]
+- AVFormatContextNativeAbstract [
Generated field accessors and native methods
All methods and field accessors are static
]
+- AVFormatContextNative [
Hand-coded helper native methods
Implements native resource dispose
]

Together with
AVObject [
AVNative n (the pointer to the native wrapper object)
public dispose method
]
+- AVFormatContextAbstract [
Generated public access methods which use AVFormatContextNative(Abstract) methods.
]
+- AVFormaContext [
Public factory methods/constructors
Hand-coded specific methods
]

So yeah - a bit more complicated, and it requires 2 objects for each instance (and often 3 including the C side instance it's wrapping), as well as the overhead of the weakreference instance data and the list entry for tracking the references. The extra layer of indirection also adds another method invocation/stack frame to every method call.

On the other hand, it lets the client code use dispose() when it wants to, or if it forgets then dispose will automatically be called eventually. And makes it obvious in the code where dispose needs to sit.

As usual it's a question of trade-offs. If the article is correct then presumably these trade-offs are worth it.

In this case the whole point of using jjmpeg is to avoid numerous allocations every frame anyway: I can allocate working and output buffers once and just use them directly. In this case the actual number of objects is quite small and doesn't happen very often, so I suspect that either mechanism would work about as well as the other.

Well this distraction has blown my morning away; I'd better leave it for now so I can clock up some work hours after lunch.

Update I figured i'd gone too far down this route to do anything other than keep it. I've checked this in now as well as a bunch of other stuff described on the project page. Update 2: Oracle keeps breaking links, but i've updated the pointer. I'm looking at this again (September 2012) because of some issues in jjmpeg.

Monday, 24 October 2011

OpenCL DCT Denoise

I've just checked in an OpenCL implementation of the DCT de-noising algorithm I mentioned previously. I've only done the mono version so far.

It's not terribly fast - 10ms wall-clock for a 512x512 mono image, and given that it requires 64 DCT's per 8x8 block and needs to accumulate the results, it probably never will be.

The kernel source.

Update: Colour version implemented now.

Saturday, 22 October 2011

Its beaten me. For now.

I should've stayed outside in the sun today gardening - but curiosity got the better of me. I hope the (absolutely stunning) weather continues tomorrow, otherwise i've blown it on nothing ...

I tried working on the AMD performance of the Viola & Jones detector in socles: I tried a whole bunch of stuff, from copying the image tiles pre-scaled (as summed area table) to local memory, to completely re-arranging the data structures so they are workgroup aligned, to even trying the cpu single-thread-per-location version.

I got some minor improvement, the most being the copying the tile to local store and removing some of the calculations (since it doesn't need to scale the rects): but that only took a simple test case from about 25ms to 20ms. Barely really noticeable in my webcam test harness.

I think the problem is with the fact it has to read so much data for each single test. It requires 3-4 uint4's just to describe the test, and 8-12 uint texture lookups for the summed area table lookups. The cascade I have has ~6 400 regions to test grouped in ~3 000 features, and although most aren't tested it's just a lot of data. It's too much for constant memory for example.

With a fix to use the atomic counters AMD hardware provides at least it's now in the same order of magnitude as the nvidia hardware, but still 2-4x slower.

Maybe ... if the stages were broken up into smaller parts it could work more efficiently, but it does seem a pretty long shot to me as the problem remains with the sheer amount of stuff that needs to be loaded for each test.

Time probably better spent on something else.

Thursday, 20 October 2011

Ho hum.

Have a new AMD card - HD 6950 - for my workstation, need the catalyst driver for the OpenCL stuff. I use XFCE so the gnome3 incompatibilities are of no interest to me.

Couldn't get the driver built for FC13 (all sorts of bugs/problems with the rpm and I really just couldn't be fagged with it all late at night), so `upgraded' to FC15 ...

It kind of works, but is really slow in really weird ways - when changing virtual desktops one window refreshes at 'cpu speed'. glxgears @ 6000fps which is really way too slow: I'm getting 10KFPS on my rather older 5770 card on my other older/slower machine. Although fgl_gxgears is twice as fast on this new card. Using the AMD CPU backend for OpenCL causes more interference with graphics update than using the GPU backend(!) The other machine is using catalyst 10.12 on fedora 14, new one is 11.9 on fedora 15 ...

I've blacklisted the kernel radeon module and whatnot. I'm using xinerama - i tried without it and it was even slower.

I think there's just something wrong with the whole system as everything feels rather sluggish - or is that just the price of 'progress'? I'm trying a yum update (all 1G's worth) and if that doesn't work I might have to try something more drastic. Obviously the upgrade was a risky choice, but one would hope having the right kernel and X driver would be enough for the video driver ...

Only 1000 packages to go now ...

Later ...

Well it's still really slow. I tried an older driver release (on windows - hard to find them for fedora) but it wouldn't support the card. On windows the wall-clock of part of my application runs about 2x vs linux: which is pretty significant since much of the time is just waiting around for the video frame to arrive so the speed-up is presumably more than that. Needless to say the desktop is smoother too.

I also tried the viola-jones detector from socles. Ouch, this really really struggles - about 100x slower than running on nvidia hardware. I tried a few things that didn't make any noticable difference apart from removing the single rarely-used atomic_inc which made it jump to about 30x faster - but even with that huge increase it was still well behind the GTX 480.

I think probably I will have to try some other possible ideas to deal with this:
  • Scale the images so that each sliding scan reads adjacent locations (i.e. coalesced reads), and go back to 1-thread-per-test/cascade.
  • Pre-calculate the scaled weights/regions on the cpu so they can be stored in constant memory.
  • Cache the region/weight information in LS.
  • Unpack the region/weight info into a flat structure so it is read sequentially rather than walking a tree stored in an array.
  • ? separate the sum calculations from the weight calculations. By doing less work there might be more locality of reference/chance for any cache to function. This is just another way to try the first point I guess.
  • Use atomic counters if available since global atomics are obviously a huge no-no on cayman.

I had also better check it on my HD 5770 which runs the fc14 desktop very snappy and runs OpenCL ok to verify it isn't just all down to a shoddy driver (Hmm, now I think about it, I haven't tried OpenCL on it since 'upgrading' to fc14 from a hacked up ancient gnewsense).

glxgears does start to slow down on the 5770 vs the 6950 as you make the window bigger - so the hardware itself is somewhat faster. The problems must be in the overhead of the os/drivers. No question that ATI aren't doing a great job here but on the other hand, the xorg, fdo, and linux guys seem to change their minds about driver/graphics architecture every 6 months too ...

I was looking forward to playing with some new hardware, but apart from the sluggish GUI and having to `upgrade' the system, most of the application I work on no longer functions as critical routines are returning broken results. Not fun. Some of these are going to turn out to be bugs but i've already found problems with the compiler (e.g. commenting out all of the #pragma unroll directives fixed a bunch of stuff).

Well as the boss said, these things are so cheap it probably isn't worth my time (or his money!) for me trying to fix these issues ...

Later Still ...

Well I seem to have most of the code working again. Apart from the #pragma unroll error, they seem to be my own fault.

First, a bunch of queue synchronisation problems: data being over-written before it was fully processed for example. NVidias libraries are more aggressive about starting work without an explicit clFlush(). And apart from that I just made some mistakes along the way which weren't exposed until now.

And one odd one which took a while to track down: passing the same image as both a read_only image, and a write_only one. I knew this was suss when I did it, but 'it worked' so i left it there: I had it in the back of my mind that this was the sort of thing I should check, but I couldn't remember where I'd done it.

I still have newly added stability issues - the dreaded and meaningless 'error 134': but in the past these have usually been bugs too. Although not always.

So perhaps the drivers aren't so bad after-all; although they are still too slow from linux.

I guess I should've stuck to one of my rules of thumb of late: if you think you're getting the wrong result from the compiler, you just haven't checked your code closely enough yet.

Tuesday, 18 October 2011

DCT denoising

Ok now the weekend's over, time to calm down and stop ranting ... ;-) Bummer about Australia losing though, apart from some real shockers right from the kick-off they did calm down and start playing fairly well. When they did have a good run - and they had a few - they were let down badly by not enough support at the breakdown. Still, NZ deserved winners ... And channel 9's race-caller sucked the whole way through.

I just found this very well put together site about using the discrete cosine transform (DCT) to do threshold de-noising in a manner similar to the wavelet threshold denoising and sharpening I mentioned before.

DCT Denoising

Very slick, complete with well formatted mathematics that puts most microsoft-word based papers to shame, GPL3 source and on-line demo!

I downloaded the code and modified it not to add the noise and tried it myself on Lenna:

The results are effectively the same as with the complex DTCWT version for moderate settings - visually even the artefacts it introduces are the same.

In the form provided however it is somewhat more computationally intensive - it's sliding window is offset by single pixels, and the way the C++ is written isn't the most efficient. I wonder how well it would work with a hanning window and 4 pixel offsets. I wonder if it can also sharpen - from a quick search it looks like it can.

Very interesting, and it also works with colour images in smarter ways than just processing each channel separately.

When I get the time I'll look at coding this up for ImageZ and socles, although I just noticed blogger mucked up something else - looking at images - so the threshold of having to do something about that is ever approaching (I found the option to disable 'lightbox' mode).

Update: Just another advert for Java. It looked simple enough so I coded up a version in Java using an 8x8 DCT and it runs single-threaded over 3x faster than the C++ version, including the JVM startup or over 4x once it's going. Rather than generate all 255 025(!) patches, transform, threshold, inverse, and merge, it fully processes a single patch each time: requiring that much less DCT memory (i.e. rather a lot - over 62MB less). So that's 0.9s vs 3.9s for this 512x512 mono image. Although I can't fathom why my version needs 1/2 the threshold to give a similar result ...

Update: See follow-on post where i mention implementing it in OpenCL for socles.

Update: I've now added it to ImageZ. DCT8Denoise is the main entry point. I changed it to work with separate colour planes rather than planes stored in a single array, just to make it easier to invoke from ImageZ. It's only single-threaded atm.

Sunday, 16 October 2011

Well ...

Just when you thought it couldn't get any worse, channel 9 - who hardly showed any of the world cup to start with - have what sounds like a horse-race caller doing the commentary on the AU/NZ semi-final. He does know the players at least, but doesn't seem to know the rules or that we too can see the same pictures as he is. So much for a bit of atmosphere, i had to turn the sound right down to be able to focus on the game and not this dickhead.

You don't realise how much the commentators make the game until you get a complete fuck-wit like this.

The one bright spot of the channel 9 coverage of the whole world cup - that they didn't provide their own wanker commentators - eclipsed in a moment.

Australia aren't looking like winners here after the first half, but there isn't much surprise there. Given a bit of bad luck and some very poor execution they're lucky they're still in it. NZ have made too many mistakes too.

Goodbye google news

Well, it's been a weekend for disappointment. Damn Wales were unlucky ... I'm actually not sure who I want to win out of New Zealand and Australia today - the kiwis just demand so much respect it's hard to barrack against them; i'll have a few drinks and go for whom-ever is playing the best I think. If they're both on their game it could be a real cracker of a match. But I digress ...

So, again google has decided to muck about with something which pretty much didn't need fixing. Last time they messed with news.google.com.au I wasn't particularly happy but continued to use it fairly regularly as the changes were just cosmetic usability issues but I think these latest changes are going to be too much on-top of a few other reasons i'll detail later.

TBH I can't believe i'm devoting so much time to such a post - it really doesn't mean that much to me on it's own - but in the over-all scheme of things these small (and not so small) issues do mount up. It turned into a bit of a mega-rant at the end and the language deteriorates as it goes ...

First, the existing google news as I see on this laptop ... starting with the top of the page:



And then the middle of the page:



First thing: Yes I (very much) like to use Bitstream Vera Sans as my font for everything: coding, and reading documents. And even then, only 1 specific size works the best (not being able to do this is the single specific reason I wont even bother to try Chrome). So all you designers painstakingly choosing your typefaces and font sizes: you're wasting your time, if one can't read the information it is worthless. Most sites actually work fine with this, although a few have some minor formatting issues (mostly text overrunning the bottom of iframes).

And secondly I do have a crappy 1024x768 IBM laptop screen. Although few laptops have resolutions to match anymore, plenty of phones, netbooks and iLandfill slabs don't even get this far.

Ok, now on to the layout. There is still a big wasted load of space on the left that they added in the last major layout update, but basically most of the page is used for information content. Each story has a few alternative links from common (and sometimes not so common) news sources, an email link, and at most a single picture. Mouse-over's (at least today) are restricted to highlighting the link which is about the most i'd like any browser to do with them.

Now, to my suprise, I was greeted with the following page when I opened google news on my other laptop this morning:



Hmm, something doesn't look right. First, everything is in one column. A huge chunk of wasted space on both the left and the right now. And what's more, the real killer feature of google news - at a glance being able to see the 'feel' of the media reporting of the news story is conspicuously absent. There is only a single link to a single news source.

Actually I couldn't work out how to find anything more than that: I normally browse with Javascript disabled on that machine - because I don't like my lap burning, nor fæcebook to know where my mouse is whilst i'm reading a news article on an unrelated site - and all you end up with is a single link.

Enabling javascript and reloading, and I discovered a huge pile of annoying mouse-over shit (AMOS).



So, now you actually have to click an ugly button to bring this stuff up. Hoorah, now we have popup-pox infesting web pages too, just loverly[sic].

And on-top of that it's now somewhat more difficult to decipher - it is trying to add extra information to the other links beyond their titles. Do I really care that it is an opinion link? Or why the special notoriety of articles "From the United States [of America]"? Is their opinion somehow more important?

And apart from that, there's rubbish like a fæcebook, twatter, and plus-one button in addition to the email link, and 3 video links in addition to the picture. Clutter.

So ... I did a search and apparently the cog button is the settings icon these days. Who'd knew ... (actually I thought it was some logo, not a cog for that matter: it looks more to me like a high-contrast themed variation of the xfce main menu button) Of course, none of the buttons function if you have Javascript turned off ...

So to the rather bare settings. 1 or 2 columns, and auto-refresh. Færy-nuff, lets try ...



Oh hang on. That looks broken. Why would anyone possibly want to read the site that way? Not to mention more AMOS to 'enhance the experience', and the same big blank section on the right.

At least the killer-feature reporting-at-a-glance is back, but there's just no way anyone would labour through such a horrible interface for that.

Oddly enough ... if you disable javascript ...



You get the right-hand side-bar back, and thankfully the AMOS disappears as well.

Well almost ...



For some reason the top of the page has this non-functional news selection slider thing stuck to it.

Thoughts

I can only think that google has a particular idea in mind here: if you're not using a 24" widescreen monitor, then you must be using a phone or some iLandfill toy. Although that doesn't completely make sense since the new site would be even more useless on phones so they must have 2 separate stylesheets/designs for each one anyway. So why fuck it up so royally?

More and more of the web requires javascript - whilst usually using it for pointless crap like implementing buttons in a non-recognisable os-agnostic way (those damn designers again, thinking they can redefine 30 years of progress in human-computer interaction on every page). I find this whole idea of javascript everywhere very questionable security wise - a web page can load 3rd party application which can then send information (e.g. where your mouse is) to any other 4th party without your knowledge. And hence more and more web pages are being turned into 'crapplications'. They're slower, uglier (and certainly not 'theme aware'), and more clumsy than local applications, but they're much heavier cpu and data wise compared to remote ones. It also closes off the avenues for using alternate browsers: having to have a very high performnace rendering engine and javascript vm is a massive barrier to entry (e.g. even firefox 3.6 is ruled out of many sites now).

Welcome to the 3rd age of thick-client computing. All the local computing power required to run local applications, combined with the speed, grace, availability and security of remote ones. Oh boy! Hold me back!

No news is good news?

On a personal note i've been trying to avoid reading the news too much anyway and google news itself. Its always the same old shit. It's mostly depressing, or at best it's just click-bait to rile you up.

And google news's aggregation algorithms are pretty much like watching TV based on the ratings: not the sort of experience I'm really after. For example, apparently `funniest home videos' the most popular show in Australia? Do I really want the bogans who watch channel 9 deciding what news makes the front page (depressingly the truth is of course that yes, they already do). With such an ignorant population, no wonder 'no more boats' is a (an almost) winning election slogan around these parts, or that the global warming denialists get so much airtime. A timely reminder - and exactly what I thought the first time I saw the advert with the sound not muted (which is how I watch advertising if i'm watching 'live' tv, although my tv mute button wore out ...).

Still, I do like to check at least once every couple of days - least i become one of the ignorant masses if nothing else. Or to fill a spot to give my brain a rest or whilst waiting for a routine to run ... Unfortunately now I use Java there's no more waiting around for compilation - the 50KLOC bit of code I work on compiles and launches the application from scratch in about 1/2 a second (ant doesn't include resources properly in the jar without a clean rebuild - and building jars is the single terribly weak fucking reason to justify it's utterly shit and astronomically painful fucked up existence - so I have to do it every time when working on opencl code. Fucking adjective!).

I guess I can use fairfax for the little Australian news i'm after (democratic politics died for me when Howard went to war, and without that what is the point of listening to those arseholes - and without the politics there's fuck-all left), The Guardian for Europe and summaries or links in a few blogs I visit will do me from now. I gave up on The ABC months ago - which should really now just be called `The Opposition Says Sydney-Siders Gazette'. Even SBS TV news has been shit for ages, since they cut their budget it's little more than a patchwork of cheap stories from other services (many barely trying to hide themselves from the happy-story pro-war/pro-usa propaganda they are, like some of the BBC stuff from iraq/afghanistan).

Barely any of the services do any local news at all. Most of it is broadcast/published straight out of Sydney or Melbourne. Not that much of import happens around here, but sometimes you do need to know about local stuff.

One thing google news showed me (until now) is just how much of the news is just the exact same story repeated ad nauseam, so at least I know I wont be 'missing out' on anything by not using it.

Friday, 14 October 2011

Goodbye Mythtv

I knew there was a reason I hadn't updated my system in a while, it wanted install rubbish I don't want.

Dependencies Resolved

========================================================
Package Arch
========================================================
Removing:
PackageKit i686
Removing for dependencies:
PackageKit-glib i686
PackageKit-gstreamer-plugin i686
PackageKit-yum i686
k3b i686
k3b-common noarch
k3b-libs i686
kdebase-runtime i686
kdebase-runtime-flags noarch
kdebase-runtime-libs i686
kdelibs i686
kdemultimedia-libs i686
kdepimlibs i686
mythtv-common i686
mythtv-frontend i686
mythtv-libs i686
phonon i686
phonon-backend-gstreamer i686
qt-webkit i686

Transaction Summary
========================================================
Remove 19 Package(s)

Installed size: 161 M
Is this ok [y/N]:


All I can say is "What the Deuce?"

I'm pretty sick of fighting with this type of bullshit. Why the fuck is anything depending on that PackageKit crap?

So yes, it is ok to remove that snot - it's only a console that saves me walking into the next room to set what i'm going to record anyway. And it's only tv. wodim is easier to use than k3b for burning isos for that matter.

How poetic ... (just arrived in email):
    Date: Fri, 14 Oct 2011 00:13:22 -0400
    From: "Wordsmith" <wsmith@wordsmith.org>
    Subject: A.Word.A.Day--vituperation

    This week's theme: Negative words

    vituperation (vy-too-puh-RAY-shuhn, -tyoo-, vi-) noun

    Bitter and abusive language; condemnation.

    [From Latin vituperare (to blame), from vitium (fault) + parare (to make or
    prepare). Earliest documented use: 1481.]

Later ...

So this episode got me searching for a blacklist option, and I found the exclude option for yum.

Yay!

    exclude=PackageKit
exclude=pulseaudio

It seems it had something to do with the phonone-backend-gstreamer and there are alternatives which don't need such rubbish.

Never did like gstreamer ...

Chances are the mythtv guys have changed the database format again, so i might hold off on trying to install it anyway: i've had enough excitement for one day. The secret is N-tier architecture guys ...

Special-Case Code and Multi-Pass Algorithms

Ok, so without going into too much detail I have a function which needs to resample 3 float2 planes of data to another resolution, and then perform very simple arithmetic on it (a few mult, add). The scale factors are powers of two up and down. One complication is that the numbers have to be pre-sampled first at pixel corners before being interpolated.

I implemented it initially using bilinear interpolation for simplicity, and yesterday looked at implementing bicubic filtering.

It wasn't really that bad - the given routine was about 1.5x the original speed which is ok, and overall this was only a 3% impact.

But I thought I would try a few ideas to speed it up ...

A) I separated the routine into separate implementations, one for each scale. I still used the same sampling routine, but just passed it a fixed-value for the scale. In previous micro-benchmarks on the bilinear code I noticed this lead to a pretty decent improvement.

But in this case it didn't. It slowed down some scales by a factor of 1-2, and moreover, made other routines in the same source file execute slower(!). I can only assume the growth in code-size was a significant factor here. I also noticed the register usage hit 63 again - which probably means all i've done is hit a bug in the compiler again (I should really upgrade the driver: we're moving to AMD hardware RSN anyway).

B) Using two passes. A separate scale pass followed by a calculation pass. Intuitively this should be somewhat slower: the calculation after the scaling is simple and can be done in registers.

But of course it turned out faster. Not a huge amount, about 20% for the routine in question.

I did have to do some work to make it happen though: using local memory and 2d workgroup sizes, and separate code for the scaling down functions (e.g. it just sums 2x2 block to go down by 2). In this case using separate functions for each size worked quite well (more evidence of compiler bugs). I was also able to batch the 3 planes separately to get added parallelism - the problem size is quite small so this should hep.

... and after writing (C) below I re-arranged the upscaler to use hard-coded sizes as well, and re-did the bicubic interpolator to accept integer and offset values separately: the compiler can remove some of the calculations here since i'm always using the same pixel offsets.

... and i also experimented with changing the output type to float8 rather than float2 and writing 4 pixels at once for the 4x upscale. This was 2x faster again for this routine (and uses fewer registers?), although I can't trust this number as the results are now broken (and i really have had about enough of it and don't want to debug it).

C) Doing more at once. e.g. doing 1/2, 1, and 2x at the same time. Actually because the 2x scale uses hard-coded interpolation numbers the bicubic interpolation can be simplified greatly (that just gave me an idea to improve B) above).

I didn't get this incorporated because it required a bit of re-arrangement of the host code, but this could shave off a bit more. I usually need a few scales of the same data in each pass so this would be useful.

Conclusions

Although all these could also be applied to the bilinear code, I now (with the changes in B above) have bicubic interpolation for this routine running much the same speed as the original bilinear did.

But it shows that you sometimes don't want to do too much in a given routine - compiler bugs, register spillage, or just more registers end up being used, which adversely affect parallelism and performance. Although a trip to memory is quite costly, these other factors can greatly outweigh it.

After all this, and a few more changes in this particular routine i'm working on, I only managed about a 9% improvement. TBH i'm not sure it's really worth it ... and I probably only went so far as I had a bit of time between getting this to a working state and heading back to reading papers.

Wednesday, 12 October 2011

Awesome-ease Chicken

Been a while since i shared a recipe, and i've been making some variation of this fairly regularly of late ... This is a sort of kitchen-friendly variation on Portuguese Chicken done in an oven. And it's super-shit-easy to make. I used to make it on a BBQ but this is probably nicer to eat and easier to cook properly.

PS I admit i've had a couple of very lovely glasses of Church Block '07 and came up with the utterly-naff name which i've never used before. It's just a super-tasty roast chicken.

1. Cut chicken

Start by cutting a chicken up the breast-bone.


2. Prepare pan

Place a handful of (freshly picked of course) thyme in the middle of a suitably sized dish/oven-proof frying pan.


3. Mount the fowl

Push down on the back of the chicken to flatten it out - you should hear bones/joints breaking - if you're picky you can also break out the rib-bones at this point to make it easier to eat - and then place it over the thyme. I also poked it over with a fork to help the seasoning in and the fat out.


4. Seasoning, Lemon & Salt

Cover with the juice of one (small) lemon, and if you have it, about a 2 teaspoons of Asian 'chicken seasoning' - this is about 1/2 salt, with some flour, MSG, onion and stock powder mixed in. A good teaspoon of vegetta powdered stock, or simply salt and some pepper would suffice.


5. Seasoning, Herbs

Cover with broken fresh herbs (e.g. sage) and sliced ripe chillies. I also sometimes add a few thin slices of ripe tomato at this point, but my tomato plants are still growing this early in the season ...



6. Cook It

Being flat, it cooks a bit faster even at the normal 180C. I usually baste it a couple of times as well to bring out some colour, and when it looks cooked it usually is. This small fowl was an hour in a pre-heated oven - about 45-50 minutes/kilo rather than 60. I also upped the temperature for the last 15 minutes, but one has to be careful not to burn the herbs too much.


7. Eat It

Because the chicken is laid down flat it traps the steam inside and cooks from both the inside and outside at the same time (i'm sure the black pan helps). This cooks it faster and keeps it very moist. And with the skin upwards it crisps up nicely and builds up a strong flavour.

It scales in the obvious way to larger fowl - I've cooked up to size 20 chickens this way.

Wavelet Denoise & Sharpen

So I had some luck with a bit of fiddling with the scaling function for wavelet sharpening. And managed to get both sharpening and smoothing working at the same time. I'm fairly happy with the results.

Update: see also a further post on using the DCT in a similar way.
Update: I've now implemented a version of this in ImageZ, see the follow-on post

Ok, first the raw Lenna input image I used - converted to greyscale by Java2D. Just to make comparison easier and to add another pretty face to the page.



Now, with the sharpening ramped right up. As you can see it's pretty much the same as using unsharp-mask with a well-selected radius and a medium weight. And like unsharp mask it tends to emphasise any noise.



Unsharp mask/Wiener Deconvolution can still work better if the image is simply de-focussed as they have a PSF function to estimate the amount of defocusing.

Now, with the same settings, and also de-noised very heavily. Despite the obvious and unnatural looking heavy processing the edge sharpness and most of the detail is still retained rather well. Most added artefacts are relatively smooth and natural looking too. If you've ever tried using a median filter or a selective Gaussian blur, you'd know they pretty much suck at retaining any texture detail or clean edges.



And finally, a more natural level of sharpening and de-noising.



Pretty happy with it given how simple the maths is. I've over-emphasised some of the results by using high values, but a smooth variation in results between the original and any of the extreme values is possible.

Two steps are applied to each complex coefficient in turn in a way that can be done whilst the coefficients are in registers. So if you have other processing going on it's essentially free.

Threshold De-noise
C = C * { abs(C) > T ? ( abs(C) - T ) / abs(C) : 0 }

Where:

C the complex transform coefficient;
abs(x) returns the magnitude of the complex number x;
T input threshold from about 0.01 to 0.001.

(see the previous post for a dead link to the source of this)

This zeros out small coefficients - which are apparently likely to be noise - and scales the rest to their original range.

Scale Bands
C = C * { ( exp( (bandcount - nband) * scale) - 1 ) * weight + 1 }

Where:

bandcount is depth of wavelet transform;
nband is number of the band (0 is the highest frequency);
scale input sharpness 'gradient' from 0-1; and
weight input sharpness weight from 0-1.

scale is a general 'sharpening factor' setting, and weight specifies how heavily it is applied.

Monday, 10 October 2011

Wavelet Denoise

As a test routine for some low-level code I threw together a little test harness of a complex wavelet de-noise algorithm.

It was based on some papers and demo code from this link (which appears to be dead now ... and has been for some time at that). It's just using a very simple threshold-and-scale of the wavelet coefficients, so apart from the relatively expensive Dual-Tree Complex Wavelet Transform it is simple and cheap to implement. The 1.7ms reported is the time to forward transform, apply the thresholding, the inverse, and download the (float) image to Java and convert it to a greyscale byte image. (I know, the screenshot should have been a png, so it's not entirely clear here ...)


This has nothing to do with what i'm working on but I thought it looked quite interesting. It preserves edge detail much better than techniques like a median filter or a Gaussian blur, and introduces fewer artefacts compared to the adaptive blurs i've seen. According to that now-broken-link, using the complex waveform produces subjectively better results compared to the DWT.

Perhaps i could use it as a processing step: if you already have the DTCWT coefficients it's a cheap additional process. Somewhat like doing a convolution in the frequency domain, it's basically free if you're already there.

I also played a bit with working out a sharpening algorithm on the weekend - I couldn't really find any simple papers: they all relied on adaptive processes, and the results reported didn't seems worth all the effort. In the end all I did was linearly scaled the coefficients by some made up numbers. Scale up for the highest frequency components and scale each subsequent wavelet band by 1/2 of the one above.

Unsharp Mask vs Wavelet Sharpen by scaling coefficients with approximately (but not a very good approximation) similar adjustment. Unsharp Mask is on the left.

The result is pretty much the same as unsharp-mask, but it only takes 1 tuning parameter instead of 2, and subjectively it appears to me to a smidgen less noisy. But I need to experiment a bit more, one would expect to be able to reduce the noise compared to unsharp mask and I think my low frequency scaling factors are out and it's affecting the tonal quality too much.

Saturday, 8 October 2011

Sharpening ImageZ

I thought it about time to fix a few little bits and pieces with ImageZ that I actually use ... so I tackled some of that. I fixed some of the wiener deconvolution code - so that odd-sized images work for instance. I also tried thoroughly thread-ising it, although I only got a modest performance boost: jtransforms is already using multiple threads for the FFT which is the expensive bit.

Unsharp mask in a feathered mask. I dialed it up to make it obvious.


Unsharp mask is something I always find really handy, so I finally coded that up too. Rather than start with the mess of the Gaussian filter code I already I have i coded another one from scratch. A bit simpler so I will merge and share the code at some point, or at least put it in a common place. It also mirrors the edges rather than clamping, which seems to produce a more natural response on the edges.

There are still a couple of things I use the gimp for that i'd rather not have to, but I guess that can wait for another day.

I really need to get out of the house this weekend, but i've pretty much pulled up all the weeds, it's been raining enough to water the garden, and the neighbours were using a chainsaw this morning. So I just found myself stuck at the computer again ... and I might watch the rugby on soon too.

Java v OpenCL/CPU

I've been using the AMD CPU driver a bit for debugging and testing: i never really considered it for performance but for various reasons late tonight I ended up poking around with a simple routine and wondered how it compared.

At first I thought i'd discovered a disaster, but that's because I wasn't initialising the data: too many non-normal floating point operations slowing it down significantly. Oops, glad I checked that before posting. Although it's getting late so who knows what else I may have stuffed up.

I was testing using a simple matrix multiply, a 4096x4096 matrix stored in row-major order, multiplied by a 4096 row column-vector. It isn't something i'm in any need of, but after poking around this site which i've read a few times, and with nothing on TV I decided to play around a bit. Then after exhausting my interest on the GPU I tried the CPU version - I was originally going to see if just doing it locally with the CPU driver would be quicker than a device copy and back, but it isn't, the GPU is still 5-10x faster.

I tested 4 implementations:
  1. OpenCL written for a CPU target using float types, one work-group and one work-item per row, 4096 work groups
  2. OpenCL using float4 types, same
  3. Java, single threaded
  4. Java, using a ThreadPoolExecutor w/ 12 threads, 32 jobs.
 Code             Time (s)
Java single 1.5
Java pool 0.39
OpenCL float 0.43
OpenCL float4 0.37

So I had to resort to float4 types to beat the thread pool code, and then only just. It's kind of debatable as to which is easier to write: the Java code must explicitly deal with the range allocation and job launching. But then it's all built-in, and doesn't require a different language, runtime, interface and foreign memory management ... and one that's prone to crashing with zero information, and otherwise and also excruciatingly difficult to debug at that. Ok scratch that: the Java clearly wins here.

One can either conclude that the AMD compiler is a bit below-par to start with (mostly likely true), and only by using vectorised code that it was able to beat the Java. Or perhaps that the hotspot compiler is rather good at this particular problem (again, most likely true), and is possibly using SSE opcodes to implement the loop too. Not that SSEn really seems to add much of a boost in general apart from a few extra registers - it's not like on an SPU where vectorised code can be 10x faster than scalar.

I had until this point thought of the CPU drivers for OpenCL providing a sort of 'portable assembly language' for higher level languages, but if you have a decent compiler already it doesn't seem worth it - at least for some problems.

I suppose another implementation might do better; but you're still stuck with a pretty hostile debugging environment and if you're after performance you'll be using a GPU anyway. So about all it seems useful for is debugging/verifying code. Given that, perhaps it would be useful to add more checking in the compiled code to help with debugging rather than worrying about performance ... Unlike C, OpenCL has a much simpler memory model for which accurate and full run-time address-range-checking can be ?easily? added.

Thursday, 6 October 2011

Images vs Arrays 4

Update 7/10/11: I uploaded the array convolution generator to socles

And so it goes ...

I've got a fairly convoluted convolution algorithm for performing a complex wavelet transform and I was looking to re-do it. Part of that re-doing is to move to using arrays rather than image types.

I got a bit side-tracked whilst revisiting convolutions again ... I started with the generator from socles for separable convolution and modified it to work with arrays too. Then I tried a couple of ideas and timed a whole bunch of runs.

One idea I wanted to try was using a rolling buffer to reduce the memory load for the Y convolution. I also wanted to see if using more work-items in a local workgroup to simplify the local memory load would help or hinder. Otherwise it was pretty much just getting an array implementation working. As is often the case I haven't fully tested these actually work, but i'm reasonably confident they should as i fixed a few bugs along the way.

The candidates

convolvex_a
This is a simple implementation which uses local memory and a work-group size of 64x4. 128x4 words of data are loaded into the local memory, and then 64x4 results are generated in parallel purely from the local memory.

convolvey_a
This uses no local memory, and just steps through the addresses vertically, producing 64x4 results concurrently. As all memory loads are coalesced it runs quite well.

convolvex_b
This version tries to use extra work-items just to load the memory, after wards only using 64x4 threads. In some testing I had for small jobs this seemed to be a win, but for larger jobs it is a big hit to concurrency.

convolvey_b
This version uses a 64x4 `rolling buffer' to cache image values for all items in the work-group. For each row of the convolution, the data is loaded once rather than 4x.

imagex, imagey
Is from the socles implementation in ConvolveXYGenerator which uses local memory to cache input data.

simplex, simpley
Is from the socles implementation in ConvolveXYGenerator which relies on the texture cache only.

convolvex_a(limit)
Is a version of convolvex_a which attempts to only load the amount of memory it needs, rather than doing a full work-group width each time.

convolvex_a(vec)
Is a version of convolvex_a which uses simple vector types for the local cache, rather than flattening all access to 32-bits to avoid bank conflicts. It is particularly poor with 4-channel input.

The array code implements CLAMP_TO_EDGE for source reads. The image code uses a 16x16 worksize, the array code 64x4. The image data is FLOAT format, and 1, 2, or 4 channels wide. The array data is float, float2, or float4. Images and arrays represent a 512x512 image. GPU is Nvidia GTX 480.

Results

The timing results - all timings are in micro-seconds as taken from computeprof. Most were invoked for 1, 2, or 4 channels and a batch size of 1 or 4. Image batches are implemented by multiple invocations.
                        batch=1                 batch= 4
channels 1 2 4 1 2 4

convolvex_a 42 58 103 151 219 398
convolvey_a 59 70 110 227 270 429

convolvex_b 48 70 121 182 271 475
convolvey_b 85 118 188 327 460 738

imagex 61 77 110 239 303 433
imagey 60 75 102 240 301 407

simplex 87 88 169
simpley 87 87 169

convolvex_a (limit) 44 60 95 160 220 366
convolvex_a (vec) 58 141

Thoughts

  • The rolling cache for the y convolution is a big loss. The address arithmetic and need for synchronisation seems to kill performance. So much for that idea. I guess there just isn't enough work to do each loop to make it work it (it only requires a single mad per thread).

  • Using more threads for loading, then dropping back when doing arithmetic is also a loss for larger problems since it limits how many groups of workgroups can execute on an SM.

  • Trying to reduce the memory accesses to only those required slows things down until you hit 4 element vectors. I guess for float and float2 the cached reads are effectively free, whereas the divergent branch is not.

  • Even with the texture cache, images benefit significantly from using a local cache.

  • Even with the local cache, images trail the array implementation - until one processes 4-element vectors, in which case they are even stevens for single images.

  • Arrays can also be batched - processing 'n' separate images concurrently. This adds a slight extra benefit as it can more fully utilise the SM cores, and reduces the need for extra host interaction. For smaller problems this could be important although this problem size is already giving the GPU a good sized workout so the differences are minimal.

  • Using single-channel data is under-utilising the GPU by quite a bit.

When I get time and work out how i want to do it i'll drop the array code into socles.

Saturday, 1 October 2011

Images vs Arrays 3

So i've been working on some code that works with lots of 2d array data: which gives me the option of using arrays or images.

And ... arrays won out: for simple memory access and writing they are somewhat faster than using images. And that's before you add the ability to batch-process: with images you're pretty much stuck with having to pass each one at a time and only pack up to 4 values in each element (3D image writes are not supported on my platform atm). With arrays you can process multiple 2D levels at once, or even flatten them if they are element-by-element - which can allow you to better fit the problem to the available CUs.

In some cases the improvements were dramatic where a lot of writes to different arrays were required (but the writes were otherwise independent).

Anyway, one particular area I thought images would still be a noticeable win is with some interpolation code I had to implement. I need to do fixed power of 2 scaling up and down. Apart from the bi-linear interpolation 'for free', there is also an interesting note in graphics gems 2 about using the bi-linear interpolation of the texture unit to perform bi-cubic interpolation using only 4 texture fetches rather than 16.

So I ran some tests with both an image and array implementation of the following algorithms:
  1. Bi-linear interopolation.
  2. Fast Bi-cubic using the graphics gems algorithm with a 64-element lookup table (I found the lookup-table version significantly faster than the calculated one).
  3. Bi-cubic using 64-element lookup tables generated from the convolution algorithm in wikipedia.
In both cases I was using float data, a 512x512 image, and 4x scaling in X and Y, and the numbers are in uS from the Nvidia profiler. The array implementation is doing CLAMP_TO_EDGE.

The results were quite interesting.
                        Image           Array
bi-linear 40 36
fast bi-cubic 56 79
table bi-cubic 106 63

With this sort of regular access, the array version of the bi-linear interpolation is actually slightly faster than the image version, although they approach each other as the scale approaches 1. This is a bit surprising.

Images do win out for bi-cubic interpolation, but the array version isn't too far off.

And in either case, the bi-cubic interpolation is really fairly cheap: only about 1.5x the cost of bi-linear which is 'pretty cool' considering how much more work is being done.

I also started to investigate a bi-cubic interpolator that uses local memory to cache the region being processed by the local work-group. Since the actual memory lookups are very regular and the block will always access at most worksize+3 elements of data (for scaling=1) it seemed like a good fit. I just tried a single 64x1 workgroup and managed around 60uS with some slightly-broken code: so perhaps the gap could be closed further.

Actually one problem I have is a little more complicated than this anyway: the samples I need to work on are not the base samples, but offset by half a pixel first to produce N+1 of them. With arrays I can use local memory to cache this calculation without having to either run a separate step or do many more lookups: so in this case it will almost certainly end up faster than the image version and I will have to get that local array version working.

For float4 data the images are only about 1.5x faster for this interpolation stuff: which for my problems is not enough to make up for the slower direct access. And the bicubic resampling is also 2-3 slower than the bi-linear, the amount of extra arithmetic is catching up.

Conclusions

Well, about all I conclude is that Nvidia's OpenCL implementation sucks at texture access. I've looked at some of the generated code and each image lookup generates a large chunk of code that appears to be a switch statement. For very big problems most of this can be hidden with overlapped processing but for smaller problems it can be fairly significant. I'm surprised that they, or OpenCL doesn't have some way of telling the compiler that a given image2d_t is always a specific type: the access could be optimised then. FWIW I'm using a driver from a few months ago.

Also I guess: the global memory cache isn't too bad if you have a good regular memory access pattern. Even optimised code that resulted in 4 simple coalesced global memory accesses per thread vs 16 was only a slight improvement.

Of course the other conclusion is that it's a fairly simple problem and no amount of 'cache' optimisation will hide the fact that at some point you still need to go to main memory, for the same amount of data.

I should really do some timings on AMD HW for comparison ... but the computer is in the next room which is also cold and dark.

Final Thought

If you really are doing image stuff with affine transformations and so on, then images are going to win because the access pattern will be regular but it wont be rectangular. The data-types available also match images.

But for scientific computing where you are accessing arrays, images are not going to give you any magical boost on current hardware and can sometimes be more difficult to use. They also add more flexible memory management (e.g. i can use the same memory buffer for smaller or multiple images) and the ability to batch in the 3rd dimension.