Friday, 15 August 2014

epiphany gpu and bits

Work was a bit too interesting this week to fit much else into my head so I didn't get much time to play with the softgpu until today.

This morning i spent a few hours just filling out a basic GLES2 style frontend (it is not going to ever be real GLES2 because of the shader compiler thing).

I had most of it for the Java SoftGPU code but I wanted to make some improvements and the translation to C always involves a bit of piss farting about fixing compile errors and runtime bugs. Each little bit isn't terribly big but it adds up to quite a collection of code and faffing about - i've got roughly 4KLOC of C and headers just to get to this point and double that in Java that I used to prototype a few times.

But as of an hour or two ago I have just enough to be able to take this code:

int main(int argc, char **argv) {
        int res;
        struct matrix4 m1, m2;

        res = fb_open("/dev/fb0");
        if (res == -1) {
                perror("Unable to open fb");
                return 1;

        pglSetTarget(fb_getFrameBuffer(), fb_getWidth(), fb_getHeight());

        glViewport(0, 0, 512, 512);


        glVertexAttribPointer(0, 4, GL_FLOAT, GL_TRUE, 0, star_vertices);
        glVertexAttribPointer(1, 3, GL_FLOAT, GL_TRUE, 0, star_colours);

        matrix4_setFrustum(&m1, -1, 1, -1, 1, 1, 20);
        matrix4_rotate(&m2, 45, 0, 0, 1);
        matrix4_rotate(&m2, 45, 1, 0, 0);
        matrix4_translate(&m2, 0, 0, -5);
        matrix4_multBy(&m1, &m2);
        glUniformMatrix4fv(0, 1, 0, m1.M);

        glDrawElements(GL_TRIANGLES, 3*8, GL_UNSIGNED_BYTE, star_indices);



        return 0;

And turn it into this:

The vertex shader will run on-host and the code for the one above is:

static void vertexShaderRGB(float *attrib[], float *var, int varStride, int count, const float *uniforms) {
        float *pos = attrib[0];
        float *col = attrib[1];

        for (int i=0;i<count;i++) {
                matrix4_transform4(&uniforms[0], pos, var);
                var[4] = col[0];
                var[5] = col[1];
                var[6] = col[2];

                var += varStride;
                pos += 4;
                col += 3;

I'm passing the vertex arrays as individual elements in the attrib[] array: i.e. array[0] is vertex array 0 and the size matches that set by the client code. For output, var[0] to var[3] is equivalent of "glPosition" and the rest are "user set" varyings. The vertex arrays are being converted to float1/float2/float3/float4 before it being called (actually only GL_FLOAT is implemented anyway) so they are not just the raw arrays.

I'm doing it this way at present as it allows the draw commands to iterate through the arrays in the presumably long dimension of the number of elements rather than packing across the active arrays. It can also allow for NEON-efficient shaders if the data is set-up in a typical way (i.e. float4 data) and because all vertices are processed in a batch.

For glDrawElements() I implemented the obvious optimisation in that it only processes the vertices indexed by the indices array and only once per unique vertex. The processed vertices are then expanded out using the indices before being passed to the primitive assembler. So for the triangular pyramid i'm using 8 input vertices to generate 24 triangle vertices via the indices which are then passed to the primitive assembler. Even a very simple new-happy prototype of this code in my Java SoftGPU led to a 10% performance boost of the blocks-snake demo.

But I got to the point of outputting a triangle with perspective and thought i'd blog about it. Even though it isn't very much work I haven't hooked up the epiphany backend yet, i'm just a bit too bloody tired and hungry right now. For some reason spring means i'm waking up way too early, not sure why i'm so hungry after a big breakfast, and the next bit has been keeping my head busy all week ...


I've been playing quite a bit with my object detector algorithm and I came up with a better genetic algorithm for training it - and it's really working quite well. Mostly because the previous algorithm just wasn't very good and tended to get stuck in a monoculture due to the way it pooled the total population rather than separating the generations. In some cases I'm getting better accuracy and similar robustness as the viola & jones ('haarcascade') detectors, although i haven't tested it widely..

In particular I have a 24x16 object detector which requires only 768 bytes of classifier data (total) and a couple of lines of code to evaluate (sans the local binary pattern setup which isn't much more). It can be trained in a few hours (or much faster with OpenCL/GPU) and whilst not as robust as little as 100 positive images are enough to get a usable result. The equivalent detectors in OpenCV need 300K-400K of tables - at the very least - and that's after a lot of work on packing them down. I'm not employing boosting or validation set feedback yet - mostly because I don't understand it/can't get it to work - so maybe it can be improved.

Unlike other algorithms i'm aware of every stage is parallel-efficient to the instruction level and I have a NEON implementation that classifies more than one pixel per clock cycle. I may port it to parallella, I think across the 16 cores I can beat the per-clock performance of NEON but due to it's simplicity bandwidth will be the limiting factor there (again). At least the classifier data and code can fit entirely on-core and leave a relative-ton of space for image cache. It could probably fit into the FPGA for that matter. I might not either, because I have enough to keep me busy and other unspecified reasons.

No comments: