Saturday, 25 April 2015

Streaming tiles

Yesterday I pretty much got lost all day poking around streams and iterators and puzzling over compiler nuances. It was a horrid cold and windy day and my foot's in no state to do anything so I didn't really have much else to do on an rdo. I learnt way more than I really wanted to ...

I was trying to come up with ways to make pixel-programming 'easy' and fit it into the stream framework yet still run at near-optimal efficiency. Streams of primitives are fine for single-channel images but since you can only return a single value the typical 'function' interfaces just don't work.

This means that you have to return multi-element values in some other structure and given it's stored that way; just writing directly to the storage array with an offset works.

My initial approach was to internalise the loop over rows or spans of consective pixels and just pass offsets and a length. i.e. all operators are batch/vector operators. This lets one amortise per-row addressing costs and the compiler can do some other optimisations fairly simply.

The problem is that you're still left writing yet-another-fucking-for-loop each time you write a new function ... and apart from that you've lost the positional information which is sometimes required. Ho hum.

So then I tried functions which operate on a single pixel but are given the coordinates and the channel number as parameters. This looked fairly promising and the performance is still good but it doesn't really solve the initial problem in that each channel is still being processed separately.

Pixel

Then I tried a Pixel accessor object. This somewhat suprisingly ... was actually very fast.

Well not really. Is it? I thought it was? Things get a little odd here. I had various other cases but the most recent and relevant one is that as soon as the same class of an iterator is used for more than two callbacks suddenly the performance plummets by 5-10x; and stays there forever more. This would explain the strange results that I encountered in the previous post. It seemed like a compiler bug soI tried some older jdks and they were identical, but since i don't feel like removing all the lambdas they were all 1.8.x. It may just be a space/speed tradeoff rather than a bug.

So, ... well shit to that.

And I tried everything ... I first noticed that sharing a spliterator class between different target classes seemed to be the problem, but then i found i could use a general spliterator and then use a custom iterator for the inner-most loop, but then I found even that drops in speed once you have too many functions. So I tried per-target iterators and spliterators and forEach() implementations with out luck.

Since I had some pretty poor performance from immutable base types before I abused the interface somewhat anyway: each pixel is the same instance and I just change the parameters for each consumer invocation. This means the object can't be passed out of the first map() or forEach() stage. It still works for many cases though.

For example:

// greyscale quantise to 16 levels
src.pixels().forEach((Pixel p) -> {
    int v = p.get(0) & 0xf0;
    p.set(v | v>>4);
}

Just for kicks I might see how well an immutable accessor works - the pixel data will still be stored in the original image but the 'Pixel' object will track it's location so it can be passed through a map function. I can't see much use for chaining per-pixel processing in this way though as you can just add the extra steps in the same function.

If i'm really bored I might try one where the pixel data itself is also copied so you'd need to use a collector to reconstruct an image. Again; I can't see how this would be useful though.

Tile

Because I thought i'd solved the Pixel performance problem; and I had at least exhausted all my current ideas, I thought i'd tackle the next big one: tile processing.

So the goal here is to break an image into tiles, process each one independently, and then re-form the result. This ... actually looks a lot like a stream processor and indeed it fits the model quite well assuming you retain the tile home position through the stream (does this make it not really a stream? I don't know, I think it's close enough and the arguments otherwise are a bit pointless).

After a couple of iterations (you can do a lot in 14 hours) I'm getting pretty close to something i'm pretty pleased with. It supports a copy mode which can pass tiles through a processing pipeline and an accessor mode which can be used for readonly access or per-pixel operations. Tiles can be overlapped. I've got a collector which forms the tiles back into a single image based on their coordinates.

A more more ideas to try but this is the sort of thing I've got working already:

// process an image by tiles into a new image
BytePixels1 src = ...;
dst = src.tiles(64, 64, 0, 0)
    .map((Tile t) -> {
            System.out.printf("do stuff @ %d,%d\n", t.getX(), t.getY());
            return t;
        })
    .collect(new TileCollector(src.create()));

Because there are fewer of them there aren't so many concerns with gc blow-out as with the pixel case so these are pipelineable objects.

The practical api

I'm not going for api perfection but i'm working on practicality with a heavy dose of performance.

So I like the tile streaming bit: this is just a necessary feature and it's a pleasantly good fit for java streams and works well with threads.

I'm still not sure on approach to pixels yet, so far i've got convenience or performance but not both (at least not in the general case). This latter target seems to be in a bit of a fight with the compiler but there might be a way to put the specialisation in the correct place so the compiler knows how to optimise it or at least not slow down other cases. Another issue is that any temporary storage needs to be allocated in each callback as you don't have access to thread-instance information in order to index pre-allocated information. So anything that needs extra workspace will add even more overhead.

Ideally I would like to support java streams ... but it may be more practical to just use different mechanisms for pixels and leave the stream and multi-thread handling to tiles. i.e. i could just have serial map(), forEach(), and reduce() calls which can be invoked per-tile and leave the combining to the higher levels. Yes that sounds reasonable.

And there are also some cases where the stream interface doesn't really fit such as a calculation across the same location of different images (apart from using a spliterator to create a multi-source accessor?).

As an old boss used to say, just gotta suck it and see. Time to suck?

Steaming piles

The syntax for generics are giving me the shits a bit. Nesting gets ugly very quickly and when things get complicated it's a bit of pot-luck working out what is needed. This is obviously something that could be learnt but it's not very useful knowledge and i'd rather be doing more interesting things. So I'm mostly just twiddling strings and seeing what netbeans tells me - when it decides to respond.

Something has 'happened' to netbeans too. It's using way too much memory and running far too slow. I'm constantly getting out of memory errors which then throws the whole system into a confused state. I've tried closing all the projects i'm not working on and regularly close all windows and even disabling all the plugins i'm not immediately using right now. Barely makes any difference. It starts at 500MB and keeps growing until I have to restart it.

And that's besides all of it's really annoying behaviours which are apparently not bugs. I've turned off all the tooltips and everything automatic I can find but it still wants to pop up this shitty argument list thing every time I do an expansion via ctrl-space: i'm constantly having to hit escape just to fuck that shit off. And it seems to have given up guessing types altogether now. Type `Foo f = new <ctrl-space>' - No, I do not want or could possibly want to ever create a new AbstractMethodError. It's 'smart' build system is constantly out of date - i never know when i'll need to clean-rebuild so rather than face that hassle I just have to do it every time; and even then it executes some other set of classes so can be out of sync. It seems to randomly decide when it'll let you run the same application twice or not. That's one of those utterly pointless features that just isn't necessary and should be removed as soon as it doesn't work perfectly every single time - there's been enough bug reports about it.

And we're talking fairly small toy projects with little or not dependencies written by one man. This image library is only 5KLOC of code lines or 12KOC of source.

Emacs is looking more attractive but that has big problems formatting straightforward C these days (all too often forcing a file close - reopen for a momentary reprieve) and with the size of the java runtime api it's just not practical typing out every function call, package, and type name.

3 comments:

Solerman Kaplon said...

NB creates some kind of cache/index files inside ~/.local or ~/.netbeans, time to clean it up I guess

NotZed said...

version 8 no longer has ".netbeans//var/cache", don't worry as a long-time netbeans user i got used to flushing that regularly when it existed.

So i don't know where or if it's caching classes, but it certainly appears as though it's caching something that can get out of sync with build/classes.

The out of memory errors are likely to break things so it's probably just related to that.

NotZed said...

oh i found the cache, it moved to ~/.cache, which i vaguely recall finding before.

nearly a gb of snot in there, so perhaps that has something to do with it running like a pig too. After blowing that away and restarting with just the project i'm working on it's down to ~60MB but i'm sure it'll grow quickly as i visit more projects.

Not much of a "cache" if it grows unbounded without expiration ...

Some of that looked like it was left over from when i had the utterly stupid idea of building a single maven project. That wont happen again.