Sunday, 10 August 2014

first triangle from epiphany soft-gpu

I was nearly going to leave it for the weekend but after Andreas twattered about the last post I figured i'd fill in the last little bit of work to get it running on-screen. It was a bit less work than I thought.

  • One epiphany core is a bit faster than one Zynq ARM core! 15s vs 18s (but a small amount of neon and a slightly different inner loop would make a huge difference);
  • Scaling is ok but not great at the high end, 4 cores = 5.5s, 8 cores = 4.6s, 16 cores = 3.9s;
  • The output dma isn't interlocked so it's losing about 1/2 the write performance once more than one core is active;
  • All memory and jobs are synchronous (ezesdk's async dma routines aren't working for some reason luser error on that one);
  • Scheduling is static, each core does interleaved rows;
  • Over half of the total processing time for rendering this single triangle is spent on the float4 to uint32 rgba clamping and conversion and it can't be sped up. This cost is fixed per frame, but who would have thought the humble clamp() could be the main bottleneck?
  • Total on-core .text is under 2K (could easily increase the render size to 768 pixels wide?);
  • It's all just C but I don't think significant gains are possible in assembly.

The times are for rotating the triangle around the centre of the screen for 360 degrees, one degree per frame. The active playfield is 512x512 pixels. Z buffer testing is on.

Actually the first triangle was a bit too boring, so it's a few hundred triangles later.

Update: I was just about to call it a night and I spotted a bit of testing code left in: it was always processing 1280 pixels for each triangle rather than the bounding-box. So the times are somewhat out and it's more like arm(-O2)=15.5s, epu 1x=11.5s 4x=3.9s 8x=3.1s 16x=2.4s. I also did some double-buffering and so on but before I spotted this bug but the timing is so shot it turned out to be pointless.

I did confirm that loading the primitive data is a major bottleneck however. But as a baseline the performance is a lot more interesting than it was a few hours ago.

No comments: