- Expand the state machine on the controller and make all DMA asynchronous - currently some is not.
- Use both DMA channels in the controller asynchronously, probably 0 for external reads and 1 for internal writes.
- A single non-chained DMA request can load up to two distinct objects if they are within 32K (why oh why weren't the strides 32-bit!), halving the dma request rate.
- Implement an asynchronous queue primitive combining eport + remote queue + async dma, or another primitive if it simplifies execution. For example I did manage to avoid the need for the double-dma yesterday to write the "dma complete" status on a variable-length record: I just wrote the record backwards!
- A controller for the line-mode rasteriser should significantly address the source bandwidth problem.
- Investigate synchronising the write stage to take advantage of the improved write performance of the memory interface. This probably needs to run asynchronously via interrupt code.
- The rasteriser can avoid the need to calculate the edge equations per-pixel if it knows a given region is fully in-range. This reduces the inner loop by 3 flops and 3 iops but is hard to take full advantage of due to flop latency.
- The C compiler is doing a fair job of the rasteriser loop but is still making a bit of a pigs breakfast of the address calculation. I think the 21 instructions can be reduced to 15 but to be effective I need to convert a large swathe of code to assembly.
- Use edge equations to accurately index the tiles.
- Investigate optimised renderers for special cases (for particles?). Flat shading reduces the fragment processor to a simple write (or alpha blend). Flat shading + disabled z buffer + full rectangle reduces to a simple rectangular write. etc.
- Investigate primitive synthesis on-core. e.g. particles.
I had some further thoughts on the results of yesterday; even though it's half the speed of the line renderer considering the complexity of the interactions and the forced requirement of an additional read/write cycle across another core for each fragment - it's probably actually fairly good. The main bottleneck seems to be the mismatch of rasterisation to fragment rendering time which has nothing to do with the architecture - but the fragment shaders are only trivial 3-term colour interpolations and if they were more complex then shifting the rasterisation to another core would leave more time for them to execute. So I will still hook it up to the gl frontend to test it and other backends which can use the same or similar controller setup.
Although I think due to the possibility of other highly optimised special cases a combined implementation will still be the ultimate target.