First I tried creating a tile-based implementation for the ARM/host version but this runs about 1/2 the speed of the line-oriented one. Not that I really optimised it but that's a lot to make up and i don't see the point; it's a convenient test-bed for experimenting though.
Then I tried creating tile-accurate indexing rather than using the bounding box. This improves the output a small amount on the purely arm version but takes a hit on the epiphany backend since the hit to the arm-side code exceeds the gains on the epiphany-side. It will depend on the workload and it might be worth it for larger triangles. Then again maybe the index isn't helping as much as I thought.
I also started (re)reading about some lighting stuff but didn't get very far.
Feeling pretty lazy today too.
Update: But not too lazy to poke a bit more it seems.
I made a "slight improvement" to the ARM based tile renderer and now it's a bit faster (10%) than the line-based one with a specific test-case. Being lazy the first time I was just processing the tile row by row rather than performing the rasteriser pass across the whole tile first and then processing the fragments afterwards. This just helps the compiler keep more setup data in registers for each loop and is closer to how i'm doing it on the epiphany.
Update: Haven't been able to get into it this last week. I think hayfever season is starting and even before the symptoms hit it just seems to wreck my sleep more than normal. Been really tired/lethargic and not really feeling like doing anything - it just feels like all i'm doing each day is hanging around waiting to escape from it into the unconsciousness of sleep again. Today I even feel like i'm "coming down with something" although i'm pretty sure i'm not and it's just some hayfever related nonsense. I've done a little gardening at least - preparing some garden beds, putting in a few seeds, and rejuvenating some pots.
But as a bit of a puzzle a few days ago I tried to see if i could get the rasteriser loop any faster. I think I can get the inner loop down to 8 cycles with some unrolling, double load/stores and some constant preloads. The previous best was 10 cycles but i'm not sure this new version is practical.
This came out of playing with the idea of breaking the work up into squares (4x4 or 8x8) rather than rows. This has overheads due to performing the edge tests multiple times outside of each pixel test but also reduces the overheads of calculating over the bounding box. But it's one of those things I need a solid afternoon to try out by coding it up.
These tile tests also allow one to determine full coverage outside of the loop - which removes the need for the edge testing calculations at all. So I tried to see if that could save anything in the inner loop; but so far the latency from the z buffer testing has prevented any gains being made. Even assuming I could pipeline that away I think I can only save 1 cycle.
I also toyed with creating an integer rasteriser that stores the framebuffer internally using bytes. For a flat shaded/z-buffered/non-blended triangle I think I can get that down to 7 cycles per pixel (and that's rendered, not just converted to fragments). Is that even useful? Who knows. But to test that idea out I need to work on a new design which will take another solid afternoon as well.