Sunday 31 August 2014

egpu mk ii part 2

After a couple of days relaxing break including a nice ride down to the coast yesterday I had another look and the ezegpu today.

First task was just to create a common 'demo' frontend which can be linked to different backends so I can easily test different cases. I then created a backend based on the current mk ii state.

Well, I guess I jumped the gun a bit the other day by testing it with a poor example of large and mostly coincident triangles. Using the star-grid test the implementation is considerably faster than the line based renderer. The test code uses slightly different parameters but a 4x4x4 star test is now hitting 57fps vs 35fps for the line-based version, versus 31fps for single-core arm.

Then I upped the test to 8x8x8 stars (total of 4096 triangles) and zoomed out a bit and now the improved primitive input stage and 2d grouping really starts to show it's paces: 22fps vs 7fps. The single-core ARM code is coping a bit better at 11fps.

Well that was nice to see I guess.

I guess i'll have a look through the points of the last few posts to decide what to look at next.

Thursday 28 August 2014

some notes

Some waking up thoughts to jot down for later. It's too nice to be inside today. I have stuff I should be doing but i'm a little immobile due to hurting my foot again so I might just sit in the sun drinking. I thought it was better and over-tested it last weekend - and I wasn't even drinking :(. I can get around ok - it just doesn't heal at all if i don't rest it enough.
  • Expand the state machine on the controller and make all DMA asynchronous - currently some is not.
  • Use both DMA channels in the controller asynchronously, probably 0 for external reads and 1 for internal writes.
  • A single non-chained DMA request can load up to two distinct objects if they are within 32K (why oh why weren't the strides 32-bit!), halving the dma request rate.
  • Implement an asynchronous queue primitive combining eport + remote queue + async dma, or another primitive if it simplifies execution. For example I did manage to avoid the need for the double-dma yesterday to write the "dma complete" status on a variable-length record: I just wrote the record backwards!
  • A controller for the line-mode rasteriser should significantly address the source bandwidth problem.
  • Investigate synchronising the write stage to take advantage of the improved write performance of the memory interface. This probably needs to run asynchronously via interrupt code.
  • The rasteriser can avoid the need to calculate the edge equations per-pixel if it knows a given region is fully in-range. This reduces the inner loop by 3 flops and 3 iops but is hard to take full advantage of due to flop latency.
  • The C compiler is doing a fair job of the rasteriser loop but is still making a bit of a pigs breakfast of the address calculation. I think the 21 instructions can be reduced to 15 but to be effective I need to convert a large swathe of code to assembly.
  • Use edge equations to accurately index the tiles.
  • Investigate optimised renderers for special cases (for particles?). Flat shading reduces the fragment processor to a simple write (or alpha blend). Flat shading + disabled z buffer + full rectangle reduces to a simple rectangular write. etc.
  • Investigate primitive synthesis on-core. e.g. particles.

I had some further thoughts on the results of yesterday; even though it's half the speed of the line renderer considering the complexity of the interactions and the forced requirement of an additional read/write cycle across another core for each fragment - it's probably actually fairly good. The main bottleneck seems to be the mismatch of rasterisation to fragment rendering time which has nothing to do with the architecture - but the fragment shaders are only trivial 3-term colour interpolations and if they were more complex then shifting the rasterisation to another core would leave more time for them to execute. So I will still hook it up to the gl frontend to test it and other backends which can use the same or similar controller setup.

Although I think due to the possibility of other highly optimised special cases a combined implementation will still be the ultimate target.

Wednesday 27 August 2014

egpu mk ii.5

Well that took a bit longer than I wanted; and all i've done is rejigged all the comms around but that's enough for today.

I made a bunch of changes to address some of the problems; i'm still not sure it will fix the performance but it's some stuff I wanted to look at anyway. The big performance issue remaining is the rasteriser to fragment processor stream; I have a new communication protocol that addresses it as much as possible and have changed the fragment processor to use it but I haven't written the rasteriser to feed it yet. I was going to do a quick-and-dirty but that would just be wasted work and working toward the current target goal ended up ballooning out into a big pile of changes.

  • I've changed the 4xtile geometry to 1x4 = 64x32 rather than 4x1 = 256x8. It made sense at the time. I'm hoping this simplifies the work of generating fragments in the right order but if nothing else it should divvy up the work more evenly.

  • I'm probably going to interleave the 4x64x8 tiles; the whole-row rasteriser showed that this distributes the work-load more effectively than other approaches, and maybe it will simplify the rasterisation loop as it needs to interleave output for effective streaming.

  • I'm creating a 2D index based on 64x64 tiles - this can be tuned a bit but very quickly chews memory depending on what limits i set (then again, that 32MB of shared ram isn't doing anything else atm). I'm just using the bounding box but it is quite simple and efficient to use the edge equations to make it exact to the index resolution.

  • The controller now assigns tiles to rasterisers and dynamically schedules across all of them. This is probably the most interesting change and once I had the 2D index (which is trivial) wasn't much more work than the static assignment.

    It scans across the playfield and assigns tiles to each rasteriser (3 in the current design) in turn. It then tries to keep each full of work by feeding them commands and primitives loaded from main memory using round-robin scheduling and non-blocking writes (i.e. it skips that rasteriser if it would block). If any rasteriser runs out of primitives it immediately flushes that one off and re-assigns it a new tile if there are any left - and then goes back to the stuffing loop.

    I took the opportunity to batch, double-buffer, and dma everything where appropriate, and ensure that nothing can block anything else. So if this is still a performance problem I'm out of ideas. This same controller can obviously be used if the tile topology changes or if i return to unified rasterisers+fragment processors if i am ultimately unable to improve the performance of the split one sufficiently.

    I've got a feeling that's where i'll end up.

  • I decided to create an async dma mechanism that just stores a pointer to the dma record rather than storing dma records in-line in the queue. Only needed two changes to the assembly (shift by 2 rather than 5, change an add to ldr). I haven't tested this yet but once I have confirmed it works it's likely to replace the current async dma implementation because it can add a lot of flexibility (chained, pre-calculated, scatter-gather, etc) while retaining interoperability the current api.

    I also needed a routine I could call in-line otherwise trying to call the current api from inside the rasteriser y-loop was going to spill all the registers I just spent all that code filling up with useful numbers. Since I can pre-calculate much of the DMA header into reusable blocks this should save some runtime overhead as well as I just need to update the addresses and go.

  • I changed the fragment processor protocol to take specific batched blocks rather than trying to batch smaller units across a cyclic buffer. The latter is more space efficient but gets messy when using DMA to copy the blocks in. This new design simplifies some of the logic and since I need to use DMA for any non-trivial copies makes this practical to implement.

    But I'm not entirely happy with the design so far though because to support asynchronous DMA operation I had to add a sentinal word which is written using chained DMA to notify the caller that it is ready. I use an eport to arbitrate the target location but it still needs to poll this ready indicator. This was the initial reason I needed the new async dma capability. Given that eports are probably going to need DMA more than I thought they would I might look at creating a combined "equeue" that hides some of these details and removes the need for the ready indicator (it can just update the remote head value).

Hmm, so what was again going to be a short little poke turned into a whole afternoon and now the sun is rapidly leaving this hemisphere to a crisp but cold evening. This stuff is just too interesting to put down and i've just spent another hour and a half writing this and tweaking a few things I found while writing it. Might keep going now ...

Update: Hacked into the later evening ... did some profiling. It's about half the speed of the combined by-line processor at this point. Whilst this is a very large improvement as to where it was, it's obviously not enough.

From some numbers I think the bottleneck is the rasteriser. The rasteriser routine is very simple and compiles quite well and the dma interface is about as minimal as possible so there is little possibility of improvement. It's probably just the 1:4 fan-out being too much.

Sunday 24 August 2014

then again ...

After the post yesterday I had a bit of a play around with the ideas. There are a couple of details I missed.

Firstly the current rasteriser implicitly maintains a per-primitive index of live pixels for the fragment processor. If I group them all together indexed (implicitly) by the column location then I have to somehow re-group the fragments afterwards so they can all be processed by the same inner loop to amortise the setup costs. After a couple of ideas I think this needs to be implemented by sorting the fragments by shader, then primitive, then X location. Because I want to leave as much processing time as possible for complex fragment shaders I was thinking of putting this onto the REZ cores; as they currently don't have a lot of work to do. This may be tunable depending on the shader vs geometry complexity.

Secondly if blending is not enabled/required then the primitives can be sorted by Z before they are sent to the epiphany; and this implicitly reduces most of the fragment processing to single pixels as well (depending on geometry) due to culling via the zbuffer test. i.e. all the work to split the fragment shaders from the rasterisers might not be much of a pay-off, particularly if it means losing 'free' alpha blending.

I did some testing using more stars (24x24x24) and found that proper z-order (front to back) makes a difference, but it's only something like 50%; but this is with a trivial fragment shader which isn't terribly representative.

Since time is not money here I'll give it a go anyway and see how it ends up. Now I write it down, restoring the primitive index by sorting would mean the same fragment processor could also support blending by just changing how and when the rasteriser outputs fragments; so I might be able to get the best of both.

I might also try changing the way the primitives are loaded in the mk i design: using (and/or dedicating) core 0,0 to load and distribute each band of primitives to the rendering engines to achieve (up to) a 16x bandwidth reduction of external reads should more than outweigh any wasted flops. I will also experiment with splitting the output into tiles instead of whole rows - the pathalogical case of a primitive taking up the whole row should be rare and if core 0,0 is handling the primitive index anyway i can add extra fidelity to the index without needing more memory to store it. I originally did rows because of the better/easier dma output and to reduce redundant setup costs and address calculations for the rasteriser, but its pretty much a wash on that front between the two approaches and 2D tiles might be a better fit.

Update: Had more of a poke today working on the setup and communications. I decided to go with tiles for the rendering off the bat because it allows more flexibility with memory: if i have a whole row in each core it forces a potentially excessively large fixed minimum size for various buffers throughout the pipeline - or an unreasonably narrow rendering resolution. But if I split it into tiles then the height can be adjusted if I need more memory. My first attempt is with tiles of 64x8 pixels which allows for a rendering width of 768 pixels if 12 cores are used for fragment shaders and only requires the same modest 8K for a 4-channel floating point colour buffer as the 512-pixel-width whole-row implementation.

I also decided to drop the fully deferred rendering idea for now - the cost of the sorting required in the rasteriser is putting me off. But It's something I can add later with most work required isolated in the rasteriser code.i

I'm still using the same topology as in the previous post with 3x rasterisers each feeding 4x fragment processors; the main driving factor for that split is the memory requirements of each stage and trying to have as many fragment processors in a round-number of cores as possible. The fact that it should route well though the mesh was mostly just a nice bonus. I'm just hoping at this point that this is also a reasonable work-balance fit as well. Because the rasteriser is going to be a fixed-function unit i'm trying to use as much of it's resources as possible, i'm sitting on around 27K of the RAM used total but I might be able to get that "a lot" higher with a bit of effort+luck.

So as of now I have a simple streaming protocol for the fragment shaders using an ezeport to arbitrate each individual fragment; this has a high(ish) overhead but it could be batched up by rows per processor. The primitive is fully rasterised across the 4 target tiles into a list of active fragments (x, y, w) - 8 bytes each. The w value of all are inverted together and then the fragments are streamed to the fragment processors with a bit of protocol compaction to reduce the transfer size and buffers required ('update y & prim id' message, 'render @ x' message). The work is streamed by row so interleaves across the 4x fragment processors - with enough buffer space (i.e. at least 64 fragments) should allow for some pipelining to hide latency across each 5-core rasteriser+fragment processor sub-system so long as the fragment shaders have enough work to perform.

Well that's where i'm at for the day. I haven't implemented the fragment shaders in the fragment processor or some of the global state broadcast from the controller. But having single messages to core 0,0 being exploded into a whole cascade of work across the mesh which is a pretty big step.

(It didn't quite go as smoothly as that suggests as I hit a bug in libezehost when dealing with heterogeneous workgroups which was a little frustrating till I worked out what was going on).

Update: Another day another bit of progress. Today I hooked up a fragment shader to the rasteriser and got it to render the single triangle test. At this point it's probably a bit slower than the previous code but there is more optimisation to be done.

I had to engineer a bit more of a streaming protocol between the rasteriser and the fragment shader; so I took the opportunity to batch up rows so they can be more efficiently written and read. I added some control codes in there as well for communicating other state and parameterising some of the processing.

I'm still not that happy with the way the rasteriser is forming the fragments: the actual rasterisation process is clean/simple but it has to output the fragments to a combined staging buffer across all tiles which must then be post-processed and broken into chunks for the 4x fragment processors. Having 4x tiles across makes the queue addressing calculation overly complex (in a loop of about 15 instructions almost anything is overly complex). As I am no longer doing deferred rendering without changing the current stream protocol it is possible to remove all the staging buffers from the rasteriser and just write directly to the stream buffers on the target cores; but I don't have a good solution yet (close though). Although i'm not sure what i'm going to do with the massive 16K x 3 this would free up!

Update: Oh damn. I tried rendering more than one triangle ... yeah its not good. Very slow proportional to the number of rendered (non-z-binned) fragments and at least for this workload the load balancing is also very bad - some cores render a ton of pixels and others render none due to the static scheduling. It looks like i miscalculated the rasteriser to fragment processor balance too; that 4x factor adds up very fast.

I went to my timing tester and did some off-core write tests: It seems i misunderstood the overhead of direct off-core writes from the EPUs - they seem to take a fixed (and unaccounted?) 9 cycles even if they "don't block". Yeah that's not going to cut it for this task. DMA seems to be able to get this down to about 1.7 cycles per float but the real benefit is that the epu runs independently and that easily outstrips the data generation rate. But it's going to need some bulky and hairy code to manage across multiple cores which is going to eat into any benefits. This definitely rules out a couple of ideas I had.

Hmm, maybe mk iii is closer than i thought. Perhaps just start with tiles so the output size is flexible and add some dynamic load balancing and a 2D primitive index. Perhaps group 2-4 cores together in terms of the front-end to try to deal with the primitive bandwidth issue; unless that upsets the balancing too much.

Saturday 23 August 2014

ezegpu mk i

Yesterday afternoon I started to clean up the current rasterisation code in order to dump to another point release of ezesdk. After hitting some hardware issues I found a good-enough workaround (for now) and this morning came up with a slightly more taxing/useful example for some more realistic profiling.

(imagine each is rotating on its centroid independently and all 64 are rotating around together, playfield is 512x512x32-bit)

Here's it's running on a single ARM core at about 30fps (but don't read too much into this since it isn't arm optimised). The main visible rendering artefact is a screen tear. The epiphany can only manage 43fps on this one - so as i'm adding more geometry to the scene it's performance over the arm is dropping (it's about 3x with a single star).

The loading of the primitives is becoming a bottleneck I always knew it was: I know this because if i zoom in closer the epiphany drops to 33fps but the arm chugs right down to about 15. So at least that is something I guess. OTOH I'm only uising one arm core. I can have two running with little impact on each other. Actually I had 3 outputs running at once with little impact on each other (one epiphany and two arms) which was starting to get a little bit impressive to me - combining them all together with a bit of NEON would provide a meaningful boost if they had nothing better to do.

But the problem is that currently each core runs the same code. Each row is rendered completely which involves scanning all the primitives in that band and rendering them. The sequence is essentially:

clear colour and zwbuffer
for each primitive
  for bounding box
    interpolate edge functions, z/w, 1/w
    if inside triangle and zbuffer test passes
      save new zbuffer value
      save 1/w and x location
    fi
  rof
  for saved 1/w values
    calculate reciprocal
  rof
  for saved fragments
    render fragment to colour buffer
  rof
rof
for each pixel
  scale/clamp
  output
rof

The primitives include the 3 float values for each of the 3 edge functions, the 1/w interpolator, the z/w interpolator, and the 3 colour channels: and all this data is being loaded each time through each row through each core - i.e. at least N cores per primitive (i'm using 12 to work around some stability issues and its enough to saturate the bus handily anyway) and another multiplying factor for the number of bands their bounding box crosses. With a bounding box and control word this is 136 bytes per primitive and it adds up very fast - to multiple megabytes.

I knew this was a bottleneck but I didn't (and still don't) have a feel yet for how much work a real fragment shader is going to be. But i'm pretty sure you'll be doing interesting stuff and still not hiding this.

Despite everything being on the core there is still plenty of space left, although 512 pixels is a little on the narrow side.

ezegpu mk ii

While waking up this morning I had a few ideas that might be able to address this and hope to implement in the coming days and weeks depending on motivation (i'll have some time due to another fortunate break in work). This is still just the first shot and I haven't tested any of them with real code; so as I discover problems I may need to alter the plans - although i do seem to be approaching the original ideas I had. This whole thing is a journey for me as the last time I did any "serious" 3D was using assembly language on an Amiga and it was pretty shit really. I don't have any expectations or baggage from the last 20 years of gpu progress and have no end-goal in mind (so if you're reading this and shaking your head with all the mistakes i'm making; well yes, i just don't know what i'm doing).

So these are a grab bag of ideas just off the top of my head right now and not all of them are compatible with each other.

  • Use core 0,0 as a management/controller. It is the only one which reads primitives from main memory providing a 'bandwidth multiplier' of 12x.
  • Break the primitives into two parts. The first part is the data required for the rasteriser: bounding box, edge equations and z/w equation. The 1/w equation can be created from the edge equations. This can fit into 64 bytes. The second part is for the fragment shaders and is only needed if the fragment is shaded.
  • Deferred rendering. A (primid,x,w) tuple per pixel is enough to be able to render it later. This drastically reduces how often the fragment shaders are executed - only once per pixel.
  • Deferred rendering allows the floating point "framebuffer" to be moved to registers(!).
  • Deferred rendering also reduces fragment code loading to once-per row, and varying equation loading to once per primitive per row.
  • Split the zwbuffer test and fragment generation from the rendering. This allows multiple rows to be rendered for each primitive which saves some data transfer and setup costs. The zbuffer becomes multi-row but lives in fewer cores freeing up resources elsewhere. Due to the mesh network design output of the currently fragment candidate can be written to the fragment shader cores without the need for any arbitration - at the same speed as a local write (if my understanding of the hardware is correct).

So putting most of that together this the current image forming in my head:

  +------+   +------+    +------+    +------+
  | CTRL |   | FR00 |    | FR10 |    | FR20 |
  +------+   +------+    +------+    +------+
    |||      |           |           |
  +------+   +------+    +------+    +------+
  | REZ0 |--o| FR01 |    | FR11 |    | FR21 |
  +------+   +------+    +------+    +------+
     ||      |           |           |
  +------+   +------+    +------+    +------+
  | REZ1 |-  | FR02 | --o| FR12 |    | FR22 |
  +------+   +------+    +------+    +------+
      |      |           |           |
  +------+   +------+    +------+    +------+
  | REZ2 |-  | FR03 | -- | FR13 | --o| FR23 |
  +------+   +------+    +------+    +------+

This is arranged assuming the mesh goes across rows first (i think it does) so all writes between cores should never block. REZ0 only writes to FR0x, REZ1 only writes to FR1x, etc.

CTRL
Main controller/primitive reader. This isn't actually much work and it leaves room/time for other functional blocks such as caches. It reads the primitives for each band and then copies them to the rasterisers. The bands will be indexed (or populated) in rows of 12. It could also be in charge of writing rendered pixels from the memory of the fragment shader cores to the framebuffer as an easy way to serialise (optimise) the writes.

REZ0-2
Rasteriser - edges and zwbuffer. These rasterise and perform zbuffering on 12 rows at once (4 rows each). It can send the (primid, x,1/w) tuple to the fragment processors using a single 8-byte, non-blocking, non-arbitrated(!) write. This is just splitting the first inner loop into a separate processor.

FRXY
Fragment processors. Whilst the rasteriser is populating the next row of input the fragment processor is rendering the deferred pixels. This doesn't need a floating point framebuffer since each pixel is only rendered once. Also means it doesn't need to clear it (ok; alpha blending would affect both of these but it affects the whole pipeline). The reciprocal pass will probably go here and the fact that it only needs to run once per visible pixel is another bonus from deferred rendering (although some reverse painters algorithm would also help the number of times a pixel makes it to the fragment processor in the mk i design).

The controller and fragment processors can be further pipelined internally to employ scatter-gather DMA to reduce latency effects.

This all looks pretty complicated but it should be a fairly modest amount of code - it has to be otherwise it wont fit! Actually by using deferred rendering and splitting the stuff up I will have big chunks of memory to spare; I could probably up the maximum rasterisation width to 1024 pixels although now i think about it that's too big for the memory speed. Something in the middle is more likely to be useful.

Because there are now different parts doing different things the differences in runtime of each component will start to dictate the total system performance (and hopefully not the read memory bandwidth). I don't know yet what that will be and it will depend on the rendering task and fragment shaders. If for example the fragment shaders are complicated and dominating execution time then scaling/clamping of the output, and/or reciprocal of the input could be moved elsewhere memory permitting.

Monday 18 August 2014

It lives!

Oops, wrong stride.

It lives!

I found the 1/2 hour required to hook up the epiphany rasteriser tonight.

Fun facts for that rotating double-triangular pyramid:

  • On Kaveri single-core 1000 frames takes 1.8s;
  • On Zynq ARM single-core 1000 frames takes 14.0s;
  • On epiphany-16 1000 frames takes 6.0s.

The epiphany should scale much better than the ARM, but I don't feel like poking more tonight.

Gawd i just realised that screenshot looks way too much like the damn windoze logo. Just an unfortunate coincidence as the colours were just the primaries and the background colours are supposed to be Commodore-64 like (the camera isn't picking them up very well).

The lack of any vblank interrupt in the video hardware ... well that's very uninspiring too (not that it should really come in to play, but it's the principal of the thing).

Update: Ok I had a tiny play. If I scale the model transform by 2x the times go to 2.6s, 23.5s, and 7.2s. i.e. much better scalability on the epiphany as expected.

Sunday 17 August 2014

Another rotation and scale invariant feature descriptor?

Just had an idea whilst waking up this morning so I punched out a quick couple hundred lines of code before lunch.

I guess it works?

This is just the first output, without any tuning or much mucking about.

It uses LOG to determine the scale-invariant feature locations and guassian edge detectors to determine the rotation invariance. A local binary pattern is used for the feature descriptor.

Not sure if someone else has taken the same approach (the scale/rotation variance is a standard technique), possibly worth writing this one up if not.

Worth a beer at any rate. Cheers big ears.

Saturday 16 August 2014

A weak face detector?

First, here is the raw output from one of the haar cascades included in OpenCV (frontalface-alt) using the Viola & Jones object detection algorithm. This is a 20x20 face detector which requires 2135 feature tests via 4630 individual rectangles and approximately 200KB of constant data storage at runtime assuming optimal data packing.

It is being executed over a good number of steps over range of scales and tested at every pixel location of the given scale.

To be usable as a robust face detector further post-processing on the raw hit rectangles must be performed and then further work is needed to weed out false positives such as the NASA logo or hands which would still pass this process. Together with a wise choice of search scale.

The following is the raw output from a short training run of the fodolc algorithm over the same scale range. This is a 20x20 face detector which requires exactly 400 tests of a 4 bit local binary pattern (LBP) encoded image and 800 bytes of data storage at runtime. This is using a loose threshold to give comparable results to the haar cascade and to see what sort of features give false positives.

Unlike the haar cascade the output is simply a distance. This can simplify the post-processing to a threshold and basic non-minimal suppression trough detection.

As an example the following the raw output from the same fodolc detector but with a tighter threshold.

There are still some false positives but obviously it is something of an improvement - trivial trough detection would clean it up to a point exceeding the result of the haar cascade.

This is only a weak classifier taken from just the first 15 minutes or so of classifier training on 10 000 training images extracted or synthesised solely from 880 portrait photographs. Longer training always generates a better detector. A higher quality training set should generate a better detector. Training is by a very basic genetic algorithm.

In Java on a beefy workstation the execution time is roughly the same between the two algorithms but the fodolc algorithm can be implemented in efficient SIMD or OpenCL/GPU with very little effort for significant (order of magnitude) gains.

The following is the entire code of the classifier outside of the LBP conversion and of course the classifier table itself.

// Can you copyright something so simple??
public class Classify {
    private final static short[] face = { ... };

    public int score(byte[] lbp, int stride, int xi, int yi) {
        int score = 0;
        for (int y=0,i=0;y<20;y++)
            for (int x=0;x<20;x++,i++)
                score += (face[i] >>> lbp[x+xi+(y+yi)*stride]) & 1;
        return score;
    }
}

Fast Face Detection in One Line of Code has a link to an unpublished paper with brief overview of the algorithm and local binary pattern used.

Please comment on this post if you think this is interesting. Or even if you're just as dumbfounded as I am that something so simple could possibly work.

The face detector was trained using images from the FERET database of facial images collected under the FERET program, sponsored by the DOD Counterdrug Technology Development Program Office (USA).

http://a-hackers-craic.blogspot.com.au/2014/08/a-weak-face-detector.html

Friday 15 August 2014

epiphany gpu and bits

Work was a bit too interesting this week to fit much else into my head so I didn't get much time to play with the softgpu until today.

This morning i spent a few hours just filling out a basic GLES2 style frontend (it is not going to ever be real GLES2 because of the shader compiler thing).

I had most of it for the Java SoftGPU code but I wanted to make some improvements and the translation to C always involves a bit of piss farting about fixing compile errors and runtime bugs. Each little bit isn't terribly big but it adds up to quite a collection of code and faffing about - i've got roughly 4KLOC of C and headers just to get to this point and double that in Java that I used to prototype a few times.

But as of an hour or two ago I have just enough to be able to take this code:

int main(int argc, char **argv) {
        int res;
        struct matrix4 m1, m2;

        res = fb_open("/dev/fb0");
        if (res == -1) {
                perror("Unable to open fb");
                return 1;
        }

        pglInit();
        pglSetTarget(fb_getFrameBuffer(), fb_getWidth(), fb_getHeight());

        glViewport(0, 0, 512, 512);

        glEnableVertexAttribArray(0);
        glEnableVertexAttribArray(1);

        glVertexAttribPointer(0, 4, GL_FLOAT, GL_TRUE, 0, star_vertices);
        glVertexAttribPointer(1, 3, GL_FLOAT, GL_TRUE, 0, star_colours);

        matrix4_setFrustum(&m1, -1, 1, -1, 1, 1, 20);
        matrix4_setIdentity(&m2);
        matrix4_rotate(&m2, 45, 0, 0, 1);
        matrix4_rotate(&m2, 45, 1, 0, 0);
        matrix4_translate(&m2, 0, 0, -5);
        matrix4_multBy(&m1, &m2);
 
        glUniformMatrix4fv(0, 1, 0, m1.M);

        glDrawElements(GL_TRIANGLES, 3*8, GL_UNSIGNED_BYTE, star_indices);

        glFinish();

        fb_close();

        return 0;
}

And turn it into this:

The vertex shader will run on-host and the code for the one above is:

static void vertexShaderRGB(float *attrib[], float *var, int varStride, int count, const float *uniforms) {
        float *pos = attrib[0];
        float *col = attrib[1];

        for (int i=0;i<count;i++) {
                matrix4_transform4(&uniforms[0], pos, var);
                var[4] = col[0];
                var[5] = col[1];
                var[6] = col[2];

                var += varStride;
                pos += 4;
                col += 3;
        }
}

I'm passing the vertex arrays as individual elements in the attrib[] array: i.e. array[0] is vertex array 0 and the size matches that set by the client code. For output, var[0] to var[3] is equivalent of "glPosition" and the rest are "user set" varyings. The vertex arrays are being converted to float1/float2/float3/float4 before it being called (actually only GL_FLOAT is implemented anyway) so they are not just the raw arrays.

I'm doing it this way at present as it allows the draw commands to iterate through the arrays in the presumably long dimension of the number of elements rather than packing across the active arrays. It can also allow for NEON-efficient shaders if the data is set-up in a typical way (i.e. float4 data) and because all vertices are processed in a batch.

For glDrawElements() I implemented the obvious optimisation in that it only processes the vertices indexed by the indices array and only once per unique vertex. The processed vertices are then expanded out using the indices before being passed to the primitive assembler. So for the triangular pyramid i'm using 8 input vertices to generate 24 triangle vertices via the indices which are then passed to the primitive assembler. Even a very simple new-happy prototype of this code in my Java SoftGPU led to a 10% performance boost of the blocks-snake demo.

But I got to the point of outputting a triangle with perspective and thought i'd blog about it. Even though it isn't very much work I haven't hooked up the epiphany backend yet, i'm just a bit too bloody tired and hungry right now. For some reason spring means i'm waking up way too early, not sure why i'm so hungry after a big breakfast, and the next bit has been keeping my head busy all week ...

fodolc

I've been playing quite a bit with my object detector algorithm and I came up with a better genetic algorithm for training it - and it's really working quite well. Mostly because the previous algorithm just wasn't very good and tended to get stuck in a monoculture due to the way it pooled the total population rather than separating the generations. In some cases I'm getting better accuracy and similar robustness as the viola & jones ('haarcascade') detectors, although i haven't tested it widely..

In particular I have a 24x16 object detector which requires only 768 bytes of classifier data (total) and a couple of lines of code to evaluate (sans the local binary pattern setup which isn't much more). It can be trained in a few hours (or much faster with OpenCL/GPU) and whilst not as robust as little as 100 positive images are enough to get a usable result. The equivalent detectors in OpenCV need 300K-400K of tables - at the very least - and that's after a lot of work on packing them down. I'm not employing boosting or validation set feedback yet - mostly because I don't understand it/can't get it to work - so maybe it can be improved.

Unlike other algorithms i'm aware of every stage is parallel-efficient to the instruction level and I have a NEON implementation that classifies more than one pixel per clock cycle. I may port it to parallella, I think across the 16 cores I can beat the per-clock performance of NEON but due to it's simplicity bandwidth will be the limiting factor there (again). At least the classifier data and code can fit entirely on-core and leave a relative-ton of space for image cache. It could probably fit into the FPGA for that matter. I might not either, because I have enough to keep me busy and other unspecified reasons.

Tuesday 12 August 2014

asm vs c II

I dunno, i'm almost lost for words on this one.

typedef float float4 __attribute__((vector_size(16))) __attribute__((aligned(16)));

void mult4(float *mat, float4 * src, float4 * dst) {
 dst[0] = src[0] + mat[0];
}
notzed@minized:src$ make simd.o
arm-linux-gnueabihf-gcc -c -o simd.o simd.c -O3 -mcpu=cortex-a9 -marm -mfpu=neon
notzed@minized:$ arm-linux-gnueabihf-objdump -dr simd.o

simd.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 :
   0:   f4610aef        vld1.64  {d16-d17}, [r1 :128]
   4:   ee103b90        vmov.32  r3, d16[0]
   8:   edd07a00        vldr     s15, [r0]
   c:   e24dd010        sub      sp, sp, #16
  10:   ee063a10        vmov     s12, r3
  14:   ee303b90        vmov.32  r3, d16[1]
  18:   ee063a90        vmov     s13, r3
  1c:   ee113b90        vmov.32  r3, d17[0]
  20:   ee366a27        vadd.f32 s12, s12, s15
  24:   ee073a10        vmov     s14, r3
  28:   ee313b90        vmov.32  r3, d17[1]
  2c:   ee766aa7        vadd.f32 s13, s13, s15
  30:   ee053a90        vmov     s11, r3
  34:   ee377a27        vadd.f32 s14, s14, s15
  38:   ee757aa7        vadd.f32 s15, s11, s15
  3c:   ed8d6a00        vstr     s12, [sp]
  40:   edcd6a01        vstr     s13, [sp, #4]
  44:   ed8d7a02        vstr     s14, [sp, #8]
  48:   edcd7a03        vstr     s15, [sp, #12]
  4c:   f46d0adf        vld1.64  {d16-d17}, [sp :64]
  50:   f4420aef        vst1.64  {d16-d17}, [r2 :128]
  54:   e28dd010        add      sp, sp, #16
  58:   e12fff1e        bx       lr
notzed@minized:/export/notzed/src/raster/gl/src$ 
I thought that the store/load/store via the stack was a particularly cute bit of work, especially given the results were already in the right order and in adequately aligned registers. r3 also seems a little too popular.

I guess the vector extensions to gcc just aren't finished - or just don't work. Maybe I used the wrong flags or my build is broken. It produces similar junk code for the epiphany mind you. I've never really tried using them but after a bunch of OpenCL in the past I thought it might be worth a shot to access SIMD without machine code.

My NEON is very rusty but I think it could be something like this:

notzed@minized:src$ arm-linux-gnueabihf-objdump -dr neon-mat4.o

neon-mat4.o:     file format elf32-littlearm

Disassembly of section .text:

00000000 :
   0:   f4a02caf        vld1.32  {d2[]-d3[]}, [r0]
   4:   f4210a8f        vld1.32  {d0-d1}, [r1]
   8:   f2000d42        vadd.f32 q0, q0, q1
   c:   f4020a8f        vst1.32  {d0-d1}, [r2]
  10:   e12fff1e        bx       lr

As can be seen from the names I started with a "simple" matrix multiply but whittled it down to something I thought the compiler could manage after seeing what it did to it - this is just a meaningless snippet.

After a pretty long day at work I was just half-heartedly poking at filling out the frontend to the epiphany gpu but just got distracted by whining at the compiler again. I should've just started with NEON, after a little poking I remembered how nice it was.

Monday 11 August 2014

And I thought I hated tooltips ...

Found on some m$ site whilst looking for how to turn off tooltips:

I've spent over 10 hours the last two days trying to solve the same problem. Apparently it is impossible to get rid of tooltips in Win7, which makes it absolutely unusable for me. I've wasted my time. I've wasted my money. Just installed Ubuntu which doesn't cost a penny and with one checkmark I get to disable all tooltips. With Microsoft I get to pay $100 for something that seems intentionally designed for maximum annoyance. Goodbye Microsoft. Last dime you ever got from me.

Had a laugh then thought fucking around in regedit wasn't worth it and so left it at that.

But yeah tooltips suck shit. If your GUI needs tooltips to be usable, it just needs bloody fixing. Icons just don't work when you've got more than a dozen or so choices (not much does for a human).

Sunday 10 August 2014

first triangle from epiphany soft-gpu

I was nearly going to leave it for the weekend but after Andreas twattered about the last post I figured i'd fill in the last little bit of work to get it running on-screen. It was a bit less work than I thought.

  • One epiphany core is a bit faster than one Zynq ARM core! 15s vs 18s (but a small amount of neon and a slightly different inner loop would make a huge difference);
  • Scaling is ok but not great at the high end, 4 cores = 5.5s, 8 cores = 4.6s, 16 cores = 3.9s;
  • The output dma isn't interlocked so it's losing about 1/2 the write performance once more than one core is active;
  • All memory and jobs are synchronous (ezesdk's async dma routines aren't working for some reason luser error on that one);
  • Scheduling is static, each core does interleaved rows;
  • Over half of the total processing time for rendering this single triangle is spent on the float4 to uint32 rgba clamping and conversion and it can't be sped up. This cost is fixed per frame, but who would have thought the humble clamp() could be the main bottleneck?
  • Total on-core .text is under 2K (could easily increase the render size to 768 pixels wide?);
  • It's all just C but I don't think significant gains are possible in assembly.

The times are for rotating the triangle around the centre of the screen for 360 degrees, one degree per frame. The active playfield is 512x512 pixels. Z buffer testing is on.

Actually the first triangle was a bit too boring, so it's a few hundred triangles later.

Update: I was just about to call it a night and I spotted a bit of testing code left in: it was always processing 1280 pixels for each triangle rather than the bounding-box. So the times are somewhat out and it's more like arm(-O2)=15.5s, epu 1x=11.5s 4x=3.9s 8x=3.1s 16x=2.4s. I also did some double-buffering and so on but before I spotted this bug but the timing is so shot it turned out to be pointless.

I did confirm that loading the primitive data is a major bottleneck however. But as a baseline the performance is a lot more interesting than it was a few hours ago.

Friday 8 August 2014

epiphany soft-gpu thoughts

I've been feeling a bit off of late so not hacking much of an evening but I did get a spare couple to poke at the soft-gpu and finally write some epiphany code.

Of course I got completely side-tracked on the optimisation side of things so I didn't get terribly far. But I solidified the plan-of-attack and sorted out some way to provide C based shader code in a way which will still get some performance. I have much of the interesting setup code done as well (although there is more uninteresting stuff, maybe I will just use java as the driver).

I've re-settled on the earlier idea of separating the rasterisation from the fragment shading but it will all run on the same core. There will be 3 loops.

  1. Rasteriser which performs in-triangle and Z/W buffer tests and generates the X coordinate and interpolated 1/W value for all to-be-rendered fragments;
  2. Reciprocaliser[sic] which inverts all the 1/W values in a batch;
  3. Fragment processor which interpolates all of the varying values and invokes the fragment shader.

This allows each loop to be optimised separately and reduces register pressure. Due to the visual similarity of some of the setup I thought there would be some duplicated calculations but there actually isn't since each is working with different values.

1 and 2 will be hard-coded as part of the platform but 3 will be compiled separately for each shader so that the shader can be compiled in-line. This is the only way to get any performance out of the C code.

The shaders will be compiled something like this:

/*
 * Shader fragment to call
 */
#define SHADER_INVOKE(colour) solid_gourad(colour, uniform, var0, var1, var2)

/*
 * An example shader - solid (interpolated) colour
 */
static inline void solid_gourad(float *colour, float *uniform, float var0, float var1, float var2) {
    colour[0] = var0;
    colour[1] = var1;
    colour[2] = var2;
    colour[3] = 1.0f;
}

/*
 * Include the actual routine to use
 */
#include "e-fragment-processor.h"
And e-fragment-processor will have a generic inner loop which will be something like:
void draw_row(... arguments) {
 ... setup
    const float var0x = v[VS_X+0];
    const float var1x = v[VS_X+1];
    const float var2x = v[VS_X+2];

    // Set start location for interpolants
    float var0_w = (var0x * fx + v[0 + VS_Y] * fy + v[0 + VS_Z]);
    float var1_w = (var1x * fx + v[1 + VS_Y] * fy + v[1 + VS_Z]);
    float var2_w = (var2x * fx + v[2 + VS_Y] * fy + v[2 + VS_Z]);
    // ... up to whatever limit I have, 16 is probably practical

    for (int i=0;i<count;i++) {
        struct fragment f = fragments[i];

        // divide by w to get interpolated value
        float var0 = (var0_w + f.x * var0x) * f.w;
        float var1 = (var1_w + f.x * var1x) * f.w;
        float var2 = (var2_w + f.x * var2x) * f.w;
        // .. etc

        // shader says how many varX's it uses so compiler automatically
        // removes any redundant calculations: so only one version of this file
        // need be created
        SHADER_INVOKE(colour + f.x * 4);
    }
}

Written this way a simple colour gourad shader is around 500 bytes or so and the inner loop is 20 instructions although not very well scheduled.

The end goal would be to have multiple shaders loaded dynamically at runtime but that sounds like too much work so i'll keep it simple and just link them in.

It's a trade-off between ease of use and performance although from some preliminary benchmarking (well, looking at what the compiler produces) I think this is about as good as the compiler is going to get. Being able to provide a programmable shader at near-optimal performance would be a nice bullet-point.

An alternative is that the shader must just implement draw_row() and the code template above is copied; this might be useful if some other hard-to-calculate value like the reciprocal is required per-pixel and it can separate that pass into a separate loop.

Memory

On memory i've decided to set the rendering size to 512 pixels. I was hoping for 1024 but that's just a bit too big to fit and a bit too much work for the memory bus besides.

  • 8192 float Colour buffer - 4x4x512
  • 2048 Z/W buffer - 4x512
  • 2048 1/W work - 4x512 (could be done in batches)
  • 2048 X work - 5x512 (could be done in batches, or use int16)
  • 2048 Frame buffer colour transfer 4x512
  • 1024 Primitive transfer buffers (at least 2).

That leaves 7K 15K (oops, out by 8k) for code and stack and some other control structures - which should be enough to do some interesting things. I decided the data needs to be transferred using DMA because the final pass only needs to scale and clamp the floating point framebuffer data to bytes: this is not enough work to prevent the output writes stalling the CPU. Having a separate buffer for the DMA allows the rest to run asynchronously. I will need to round-robin the DMA writes for greatest performance or run them via a central framebuffer controller (and/or dedicate a whole core to the job, in which case it would maintain the colour transfer buffers too).

Actually the above design does let me efficiently split the fragment shaders into separate cores too if I want because they only need to transfer (x,1/w) tuples for each fragment to render - this was my original idea. If I did that then I could probably fit a 1024-pixel row in memory too.

The bottlenecks?

The gpu will work most efficiently by processing every triangle in the scene in one pass: this allows the framebuffer to stay on-core (and in the native floating point format) which provides very high bandwidth and blending essentially free. One every primitive on that row has been rendered the local framebuffer row cache is converted to bytes and flushed out to the real framebuffer (multipass rendering would also require loading from the framebuffer first, but lets not get carried away here).

I'm intentionally not worrying about texture maps (as in, not implement anything for them). Yes they could be used but the performance hit is going to be so dire that it is not going to be desirable to use them. If they were to be used I think a separate texture fetch pass will be required before the fragment shader - so that can fire off some scatter-gather DMA and then process the results as they arrive. I think this is not going to be easy or efficient with the current DMA capabilities.

So, ... ignore that. I will need some useful noise functions so that interesting textures can be procedurally generated instead.

The epiphany to framebuffer speed is pretty low, but that's fixed: there's nothing I can do about that, so no use wasting time crying over spilt milk on that one.

So, ... ignore that too.

I think the main bottleneck will be the transfer of the primitives - because they will all have to be loaded for each row. I will add some input indexing mechanism to separate them into bands so the loading of out-of-range primitives is reduced but fully indexing every row would be costly. If I can work out how to get the broadcast DMA to work (if indeed, it does actually work) then that may help alleviate some of the bandwidth requirements although it comes at a cost of forcing all rasterisers to operate in lock-step across the same band of framebuffer - which might be worse.

I may be completely off on this though - I really gotta just code this up and see how it works.

Deferred Rendering

Actually just to get way ahead of myself here; another alternative is a type of deferred rendering. Rather than keep track of the colour buffer it could just keep of (triangle id, x, 1/w) for each visible pixel. Once it's finished it could then just process the visible pixels - at most once per pixel.

This could be implemented by splitting the triangle primitive into two parts - first the bounding box, edge and z/w and 1/w interpolation equations, and the second being the varying equations. Each pass only needs that set of data - so it could reduce bandwidth requirements too.

Blending is more difficult. With it on every visible triangle would need to be rendered immediately and any previously rendered triangles waiting in the deferred buffer would need to be flushed.

Something to defer till later I guess (ho ho).

Starting JavaFX from random Java code

I write a lot of prototype routines - too many to keep track of in separate projects so I end up with a ton of mains(). Best practice? Who gives a shit: its a prototype, played with for a day or a week and then forgotten.

So far for graphical output i've just been using Swing: actually there's probably not much reason not to use it for that because it does the job but once you need to add some interactivity it becomes a bit of a pain if you've been playing with JavaFX. I might add a 'display intermediate image' anywhere in the code and up it comes.

But JavaFX doesn't let you just call Platform.runLater() or new Stage() from anywhere as with Swing: the system needs initialising within an Application context.

Here's a solution. I have no claims it's a good one but it works for me so far.

// This code is placed in the public domain
public class FXUtils {

    static FXApplication app;
    static Semaphore sem = new Semaphore(0);

    public static void startFX(Runnable r) {
        if (app == null) {
            try {
                Thread t = new Thread(() -> {
                    FXApplication.start(r);
                });
                t.start();

                sem.acquire();
            } catch (InterruptedException ex) {
            }
        } else {
            Platform.runLater(r);
        }
    }

    public static class FXApplication extends Application {

        WritableImage image;
        static Runnable run;

        public FXApplication() {
        }

        @Override
        public void start(Stage stage) throws Exception {
            app = this;
            run.run();
            sem.release();
        }

        public static void start(Runnable r) {
            run = r;
            // Application.launch() can only be called from a static
            // method from a class that extends Application
            Application.launch();
            // no windows - no app!
            System.exit(0);
        }
    }
}

Whether start() calls System.exit() or not is up to you - personally when I close a window i'm prototyping stuff on I want everything else to fuck off for good.

And this is how to use it:

    public static void main(String[] args) {
        FXApplication.start(() -> {
            // Now on javafx thread
            Stage s = new Stage();

            s.setScene(new Scene(new VBox(new Label("foobar!"))));
            
            s.show();
        });

        // Will wait for javafx to start, but then continue here
        // exiting will leave the windows open, till they're closed
    }

This uses a thread to launch javafx so that the original main thread can continue; Application.launch() doesn't return until the last window is closed so would otherwise block. The thread could be made a daemon too for some different behaviours.

If you just want to launch a full JavaFX application from multiple mains then none of this is required, just create a static start() method which calls Application.launch().

lambdas & streams

As part of the experiment with the histogram equalisation stuff I started writing a utility library for playing with images for prototyping code.

One thing I was curious about was whether I could use streams to simplify the prototyping by saving having to type and retype the typical processing loop:

for (int y = 0; y < image.height; y++) {
    for (int x = 0; x < image.width; x++) {
        do something;
    }
}

When i'm prototyping stuff I type this in ... a lot.

I had mixed results.

Because I wanted to support arbitrary 2d subregions of an image which might be a mapping of an arbitrary 2d subregion I had to create my own 'spliterator' to do the work. After a couple of aborted attempts I came up with one that just turns the widthxheight range into a linear stream and then maps that to the local (x,y) when retrieving the pixel values (i tried to avoid the divide first, but made a pigs breakfast of the maths).

It lets me write something like this to calculate the histogram over a sub-range of an image:

  Image2D img;
  byte[] hist = new byte[256];

  img.bytes(0, 0, 16, 16).forEach((v) -> > {
    hist[v] += 1;
  });
Ok, so far so good. It's not necessarily the best way to do it - it can't be parallelised for instance, but this is fine, it saves a few keystrokes and it lets one access a whole bunch of stream functionality "for free".

The problem is with images you normally want to write to them or modify them. So you're back to just using a loop, or maybe a custom foreach which supplies coordinates to a lambda function: again this is fine but then you don't get any of the stream functionality for free here (although as in the next section: it's good enough?). You could just use an IntStream, ... but that doesn't really save any typing over a for loop.

Staying within the confines of the existing IntStream type for the sake of argument, the solution is a little clumsy. One first has to create a class which implements the functions required to be used as a collector.

    static class ByteArray {
        byte[] data;

        public void add(int b);
        public void addAll(ByteArray b);
    }
With that in place it can be used to collect the results of the calculation. In this case performing the pixel mapping from one set of intensity values to another.
  byte[] pixels = img.bytes()
                     .map((int v) -> map[operand])
                     .collect(ByteArray::new, ByteArray::add, ByteArray::addAll)
                     .data;

  Image2D dst = new ByteImage(src.width, src.height, pixels);

This can run in parallel: the downside is that each stage needs to allocate its own buffers and then allocate copies of these up to the final result. Probably works but yeah, it's not that pretty or efficient.

Indexed Stream

So I thought about it a little and perhaps a solution is to create another type of stream which indexes over the values. Some of the api usage gets a bit fatter if you want to use some of the basic stream facilities like sums and so on: but that's what .map() is for. I think it can get away without having to allocate the indexing object for each iteration: it is only needed when the stream range is split.

  class IndexedInt {
    int value;
    int x;
    int y;
  }

  dst = new ByteImage(src.width, src.height);
  img.bytes().forEach((ii) -> {
    dst.set(ii.x, ii.y, ii.value);
  });

I dunno, I suppose that's better than a double-for-loop, once the not-insignificant scaffolding is in place.

Actually; why bother even passing the value in this case, it may as well just be calculating indices. It doesn't really make any difference to the code and having a general purpose 2D indexer is somewhat useful.

  class Index2D {
    int x;
    int y;
  }

  dst = new ByteImage(src.width, src.height);
  Index2D.range(0, 0, src.width, src.height)
         .forEach((ii) -> {
           dst.set(ii.x, ii.y, src.get(ii.x, ii.y));
         });

Some of the functionality is a little less concise but the simplicity of the above is probably worth it.

  double average = Index2D.range(0, 0, src.width, src.height)
                      .mapToInt((ii) -> img.get(ii.x, ii.y))
                      .average()
                      .getAsDouble();

Much of that could be hidden in helper functions and the external interface could remain an IntStream, for cases where the pixel locations are not required.

Seems like a lot of work just to get a free parallelisable 'sum' function though? The implementing classes still need a bunch of boilerplate/helpers and they could have just implemented most of that themselves. I don't find the forkJoin() approach to paralellisation (which is used by the streams code) to be very efficient either.

But this is my first real look at it : experiments ongoing.

Parallel histogram

I mentioned earlier that the histogram calculation using a forEach isn't paralleisable as is (one could add a synchronized block inside the loop but one would have to be both naive and stupid to do so).

It can be parallelised using a collector. TBH it's a lot of boilerplate for such a simple function but the algorithm is identical to the one you would use in OpenCL even if it doesn't look the same.

First, one needs the class to hold the results and intermediate results.

class Histogram {

    int[] hist;

    public Histogram() {
        hist = new int[256];
    }

    public void add(int value) {
        hist[value] += 1;
    }

    public void addHistogram(Histogram o) {
        for (int i = 0; i < hist.length; i++)
            hist[i] += o.hist[i];
        }
}

And then the code:

  int[] hist;
  Image2D img;

  hist = img.bytes().parallel()
            .collect(Histogram::new, Histogram::add, Histogram::addHistogram)
            .hist;

*shrug*. I guess it saves some typing?

tunable histogram equalisation

So this is a rather simple but quite effective improvement to the basic histogram equalisation operation for automatic image correction.

I got the main idea from a paper: ``A Modified Histogram Equalization for Contrast Enhancement Preserving the Small Parts in Images''. Bit of a mouthful for what is just a bounded histogram.

I also made made another small modification which makes it tunable. Allowing for a fairly smooth range from what should be the same as 'normalise' in paint programs, up to the fully sun-seared over-exposed normal result from histogram equalisation.

Here's an example of the range of output.

The value is the proportion of the mean to which the histogram input is limited: a value of 0.0 should be equivalent to a contrast stretch or normalise operation, 1.0 matches the paper, and some large number (depending on the input, but approximately greater than 2) will be the same as a basic histogram equalisation.

One has to look closely with this particular image because they are already fairly balanced but a smooth range of 'enhancement' should be apparent.

I also played with a colour version which applies the histogram to the Y channel of a YUV image.

As of now it does tend to upset the colour balance a little bit and tends toward a metallic effect; but it's arguable better than what the gimp's equalise does to colour images.

Yikes. Although the image on the right is arguably more agreeable - the colour is very different from the source. Histogram equalising each component separately is effectively applying a white-balance correction where the colour temperature is some sort of flat grey: this sometimes works ok but it messes with the colour balance by definition.

I have some thoughts on applying the algorithm to floating point values using polynomial curve fitting, but I haven't tried them out yet. This would be to prevent binning from quantising the output.

For such a simple adjustment to the algorithm it's quite a good result - histogram equalisation is usually too harsh to use for much on it's own.

Thursday 7 August 2014

On my god, it's full of local binary patterns!

I recently had need to look into feature descriptors. I've previously played with SURF and looked into others but I wasn't really happy with the complexity of the generator and needed something Java anyway.

A small search turned up FREAK (why the silly 'catchy' acronym names? Maybe it started with S-USANs?) which looked doable so I had a bit of a play. There is code available but it's OpenCV and C++ and the version I saw just wasn't very good code. I wrote up my own because I had some different needs for what I was looking at and porting the C++/OpenCV was going to be a pain.

I guess they work as advertised but i'm not sure they're what I want right now. I tried porting the AGAST detector as well but it wasn't really getting what I was after - i'm after specific features not just 'good features'.

The paper does include this interesting diagram though:

Although the paper doesn't reference them this diagram is pretty much a description of local binary patterns. The FREAK descriptor itself is just a very long local binary pattern with optional orientation normalisation.

Perhaps more interestingly is that this specific diagram could just as well be a description for how my fast object detector works. It is effectively a direct implementation of this diagram.

Assuming the above diagram is representative of human vision I guess one could say that the whole of visual reality is made of local binary patterns.

Sunday 3 August 2014

bummer, that bandwidth thing again

I did some profiling by clearing the framebuffer directly from the epiphany.

Short summary:

  • Using dma takes about 20% longer than using the cpu (this is useful though because it can run asynchronously to the cpu).
  • Using int writes is 1/2 the speed of long writes (but this is a known feature of the design).
  • Writing sequential addresses is about 2x faster than not (again: known).
  • Writing with one core is 2x faster than with 1 cores (due to last point, known again).
  • Trying to get the compiler to do a long write is a pain. volatile seems to do it for this case.
  • Using hardware loops was nearly 10% faster than not, but you need to do 16 instructions (for easier loop count setup) in the loop which makes it too bulky. I don't understand why this is because I can add a couple of nops before it makes any difference to the execution time; must be something to do with the write-to-mesh pipeline mechanism.
  • A simple C loop on the ARM writing to the memory-mapped framebuffer using int is about 5x faster than the epiphany.

My screen is 1280x1024, 24-bit - don't know if that can be configured to fewer bits as i have no serial console and that's just how it starts up (at least it's not widescreen).

I know it's not the case but it appears as if the memory transfers are somehow synchronised with the framebuffer DMA. It's only getting about 60 fps. At any rate, they're running at the same speed.

that sucked a bit

I guess i should've known when i woke up feeling underslept with a headache. It was a dreadfully cold and stormy day so I ... well hacked again.

I had intended to just get a triangle out of an epiphany core, but it just didn't end up happening. I had to muck around getting the rev1 to a 'working' state which took surprisingly long. There are a lot of weird shitty changes to ubuntu (i mean, runlevels, why the fuck would anyone want those???) that took me a while to wade through. I did run the C code I have which outputs to the framebuffer, which worked but is a bit slow. I did seem to have weird issues with USB but a reboot more or less fixed that, with it running through a powered hub. nfs was super-slow till i set it to nfs3 mode. Apparently that's a thing that happens. I also ran one of the ezesdk tests and that worked ... i wasn't sure if that would.

And god, what the fuck have they done to the gtk+ version of emacs? It's like a really ugly version of notepad. I wish I had have found the emacs-lucid packages an hour earlier, would've saved my throat a good deal of violence. But I guess at some point the lucid version wont work anymore, I might have to find another editor (yeah it's that bad).

And what's with all the config/startup shit in debian and ubuntu? Run one tool and it gives a completely obtuse but apparently deep and meaningful message about using another tool, the other tool doesn't know what the fuck's going on and in the end you just edit a file by hand? Why is there even more than one? apt-get/dpkg/whatever else is bad enough. What sort of genious thought that "update-rc.d" was a nice name for a command anyone might ever want to run, ever? Trying to find solutions using a search engine is becoming pointless: it's the blind leading the blind and everything is years out of date. Try finding out how to disable screen blanking on the console for example?

This worked for me:

  echo "setterm -blank 0 -powersave off -powerdown 0" >> /etc/rc.local

Net"work"Manager was still spewing pointless shit to the logs whilst trying to "dynamically manage" a fixed ethernet cable ... so fuck that right off. Although i wish more shit actually logged happenings: it seems almost nothing logs when anything goes wrong. I don't see how dedicating a whole virtual terminal to the almost completely information-free "boot.log" is of use to anyone. The packaging system seems to have turned into a sort of enterprise configuration management tool: and you know what, they suck. They're slow and cumbersome and buggy as all shit, and we know because it's linux some doe-eyed fool will come along in a year or two with a new and even more broken 'fix' for all the brokenness.

I can't believe after 20 years of this shit ... it's now way more broken than how it started. At least back then the problems were restricted to hardware support. Now that's fantastic the software has all been fucked up by people poking their noses into places they have no fucking business being.

And i'm still underslept with a headache, with added fun of cold and hungry.

Rasterisers

After the last post I kind of remembered one reason to split the work across cores: calculating the reciprocal of 1/w is somewhat expensive unless it can be batched up.

So I was up way too late last night just trying different snippets of code to address that. I think I will go the branchless loop thing that performs the z-buffer test and in-triangle tests separately and then outputs a compact set of coordinates. The compiler was doing some funky stuff but I got some hand-rolled code down to like 10 cycles per pixel (and that could include the 1/w interpolation too); the only real problem with that being the memory required for the output :-/

A separate loop can then just calculate 1/(1/w) to a table (at something like 16 cycles per pixel), and the final loop can then interpolate all the varying values without having to many any decisions about which are live pixels. Without this kind of split there isn't enough registers to keep everything in registers within the inner loops.

Because of the memory it may have to do all this in several batches - slivers of 64 pixels each or somesuch.

Hello Triangle

But I kinda gave up after today and just worked on a "simple as possible" Java "gpu" to try and have something positive to hang onto after a miserable day (and i started before I got nfs fixed). I needed something which is the distilled/captured knowledge of what I know "so far' as a training simulator. There's still some stuff I need to work out wrt the 3d maths and it's just easier playing with some simple code to do it.

This for example is the code which generates the typical hello world example from OpenGL:

float[] vertices = {
        -0.75f, -0.75f, 0, 1,
        0.75f, 0, 0, 1,
        -0.75f, 0.75f, 0, 1,};

void helloTriangle() {
        Viewport vp = new Viewport(0, 0, width, height);
        PrimitiveTriangle tt = new PrimitiveTriangle();

        tt.setup(vp, 0, vertices);

        // red, green, blue
        tt.setVarying(0, 1, 0, 0);
        tt.setVarying(1, 0, 1, 0);
        tt.setVarying(2, 0, 0, 1);
 
        float uniformA = 1.0f;

        tt.draw(pbuffer, zbuffer, width, (float[] varying, float[] pixels, int x) -> {
                        pixels[x + 0] = varying[0];
                        pixels[x + 1] = varying[1];
                        pixels[x + 2] = varying[2];
                        pixels[x + 3] = uniformA;
                });
}

This is functionally equivalent to the low-level part of a gpu driver/hardware after the vertex shader (more or less).

Here the lambda expression is the fragment shader. The main problem with using Java as a fragment shader language is how ugly vector/matrix/complex maths ends up being when you need to use flat arrays for efficiency.

Right now this isn't really much more than a training tool and intellectual curiosity, but it's food for thought that computer systems (cpu+gpu+other) and compiler technology is explicitly working toward a point where code such as the above would be the way you "program" graphics drivers. And it would run just as fast as any other driver software. There will probably still be a need for some fixed-function units but these could also be encapsulated as co-processors. The reason this would be possible now when it wasn't previously is due to the technology that makes HSA possible.

A Saturday passes ...

I had a lot of trouble with the matrices and some with the triangle direction - as is usually the case with 3d maths. After playing with some opengl3 tutorials I got it enough worked out to get somewhere. I also played with framebuffer, javafx output, and parallel streams for the tile rendering. Oh and using fixed-point for the triangle edge calculations, which fix some rare edge cases with the edges and might be easier to optimise. And trying to optimise the reciprocal calculation and changing back to the fglrx driver as side-missions (so i could run the gl3 examples - for whatever reason mesa wasn't doing the job, except i forgot which kernel is the one I need to use and the one that mostly works causes some nasty bugs in X). Well you know, a ton of stuff - i lost track of time and suddenly it was 5am.

I should really add some lighting but it's quite mesmerising in full-frame-rate motion all the same. Ok result for a week of late nights piss-farting about.

Still no epiphany code; next perhaps?