Friday, 8 August 2014

epiphany soft-gpu thoughts

I've been feeling a bit off of late so not hacking much of an evening but I did get a spare couple to poke at the soft-gpu and finally write some epiphany code.

Of course I got completely side-tracked on the optimisation side of things so I didn't get terribly far. But I solidified the plan-of-attack and sorted out some way to provide C based shader code in a way which will still get some performance. I have much of the interesting setup code done as well (although there is more uninteresting stuff, maybe I will just use java as the driver).

I've re-settled on the earlier idea of separating the rasterisation from the fragment shading but it will all run on the same core. There will be 3 loops.

  1. Rasteriser which performs in-triangle and Z/W buffer tests and generates the X coordinate and interpolated 1/W value for all to-be-rendered fragments;
  2. Reciprocaliser[sic] which inverts all the 1/W values in a batch;
  3. Fragment processor which interpolates all of the varying values and invokes the fragment shader.

This allows each loop to be optimised separately and reduces register pressure. Due to the visual similarity of some of the setup I thought there would be some duplicated calculations but there actually isn't since each is working with different values.

1 and 2 will be hard-coded as part of the platform but 3 will be compiled separately for each shader so that the shader can be compiled in-line. This is the only way to get any performance out of the C code.

The shaders will be compiled something like this:

 * Shader fragment to call
#define SHADER_INVOKE(colour) solid_gourad(colour, uniform, var0, var1, var2)

 * An example shader - solid (interpolated) colour
static inline void solid_gourad(float *colour, float *uniform, float var0, float var1, float var2) {
    colour[0] = var0;
    colour[1] = var1;
    colour[2] = var2;
    colour[3] = 1.0f;

 * Include the actual routine to use
#include "e-fragment-processor.h"
And e-fragment-processor will have a generic inner loop which will be something like:
void draw_row(... arguments) {
 ... setup
    const float var0x = v[VS_X+0];
    const float var1x = v[VS_X+1];
    const float var2x = v[VS_X+2];

    // Set start location for interpolants
    float var0_w = (var0x * fx + v[0 + VS_Y] * fy + v[0 + VS_Z]);
    float var1_w = (var1x * fx + v[1 + VS_Y] * fy + v[1 + VS_Z]);
    float var2_w = (var2x * fx + v[2 + VS_Y] * fy + v[2 + VS_Z]);
    // ... up to whatever limit I have, 16 is probably practical

    for (int i=0;i<count;i++) {
        struct fragment f = fragments[i];

        // divide by w to get interpolated value
        float var0 = (var0_w + f.x * var0x) * f.w;
        float var1 = (var1_w + f.x * var1x) * f.w;
        float var2 = (var2_w + f.x * var2x) * f.w;
        // .. etc

        // shader says how many varX's it uses so compiler automatically
        // removes any redundant calculations: so only one version of this file
        // need be created
        SHADER_INVOKE(colour + f.x * 4);

Written this way a simple colour gourad shader is around 500 bytes or so and the inner loop is 20 instructions although not very well scheduled.

The end goal would be to have multiple shaders loaded dynamically at runtime but that sounds like too much work so i'll keep it simple and just link them in.

It's a trade-off between ease of use and performance although from some preliminary benchmarking (well, looking at what the compiler produces) I think this is about as good as the compiler is going to get. Being able to provide a programmable shader at near-optimal performance would be a nice bullet-point.

An alternative is that the shader must just implement draw_row() and the code template above is copied; this might be useful if some other hard-to-calculate value like the reciprocal is required per-pixel and it can separate that pass into a separate loop.


On memory i've decided to set the rendering size to 512 pixels. I was hoping for 1024 but that's just a bit too big to fit and a bit too much work for the memory bus besides.

  • 8192 float Colour buffer - 4x4x512
  • 2048 Z/W buffer - 4x512
  • 2048 1/W work - 4x512 (could be done in batches)
  • 2048 X work - 5x512 (could be done in batches, or use int16)
  • 2048 Frame buffer colour transfer 4x512
  • 1024 Primitive transfer buffers (at least 2).

That leaves 7K 15K (oops, out by 8k) for code and stack and some other control structures - which should be enough to do some interesting things. I decided the data needs to be transferred using DMA because the final pass only needs to scale and clamp the floating point framebuffer data to bytes: this is not enough work to prevent the output writes stalling the CPU. Having a separate buffer for the DMA allows the rest to run asynchronously. I will need to round-robin the DMA writes for greatest performance or run them via a central framebuffer controller (and/or dedicate a whole core to the job, in which case it would maintain the colour transfer buffers too).

Actually the above design does let me efficiently split the fragment shaders into separate cores too if I want because they only need to transfer (x,1/w) tuples for each fragment to render - this was my original idea. If I did that then I could probably fit a 1024-pixel row in memory too.

The bottlenecks?

The gpu will work most efficiently by processing every triangle in the scene in one pass: this allows the framebuffer to stay on-core (and in the native floating point format) which provides very high bandwidth and blending essentially free. One every primitive on that row has been rendered the local framebuffer row cache is converted to bytes and flushed out to the real framebuffer (multipass rendering would also require loading from the framebuffer first, but lets not get carried away here).

I'm intentionally not worrying about texture maps (as in, not implement anything for them). Yes they could be used but the performance hit is going to be so dire that it is not going to be desirable to use them. If they were to be used I think a separate texture fetch pass will be required before the fragment shader - so that can fire off some scatter-gather DMA and then process the results as they arrive. I think this is not going to be easy or efficient with the current DMA capabilities.

So, ... ignore that. I will need some useful noise functions so that interesting textures can be procedurally generated instead.

The epiphany to framebuffer speed is pretty low, but that's fixed: there's nothing I can do about that, so no use wasting time crying over spilt milk on that one.

So, ... ignore that too.

I think the main bottleneck will be the transfer of the primitives - because they will all have to be loaded for each row. I will add some input indexing mechanism to separate them into bands so the loading of out-of-range primitives is reduced but fully indexing every row would be costly. If I can work out how to get the broadcast DMA to work (if indeed, it does actually work) then that may help alleviate some of the bandwidth requirements although it comes at a cost of forcing all rasterisers to operate in lock-step across the same band of framebuffer - which might be worse.

I may be completely off on this though - I really gotta just code this up and see how it works.

Deferred Rendering

Actually just to get way ahead of myself here; another alternative is a type of deferred rendering. Rather than keep track of the colour buffer it could just keep of (triangle id, x, 1/w) for each visible pixel. Once it's finished it could then just process the visible pixels - at most once per pixel.

This could be implemented by splitting the triangle primitive into two parts - first the bounding box, edge and z/w and 1/w interpolation equations, and the second being the varying equations. Each pass only needs that set of data - so it could reduce bandwidth requirements too.

Blending is more difficult. With it on every visible triangle would need to be rendered immediately and any previously rendered triangles waiting in the deferred buffer would need to be flushed.

Something to defer till later I guess (ho ho).


Bob H said...

I kind of wonder about blitting as well? I've seen DSPs doing blitting and it is quite useful for basic 2D.

NotZed said...

My guess at the moment is that the ARM with some NEON code would cream the epiphany as a blitter (particularly if it was accessing cacheable ram). The epiphany has some bandwidth restrictions but apart from that it's simply tailored to floating point operations and not bytes.

The fpga might be a better way to off-load rectangular dma work from both but it's not quite so easy to program.

Many blitter ops - apart from 'move/copy/fill rectangle' - are no longer really up to the task of modern 2d. e.g. scalable text, polygons, interpolated scaling and affine transforms.

JAM1 or JAM2 just doesn't cut it.