I didn't think it would improve performance but I was trying to address the problem of generality and memory overhead.
The previous version grouped stages in blocks, but each stage with a given block needed to be complete. So the double-buffer buffers need to accomodate at least the single largest stage - which is over 7k even with the relatively simple cascade and "highly optimised" enncoding format I'm testing with. This together with keeping a few stages in LDS permanently meant it was using upwared of 20k just for the cascade data and wouldn't be able to handle bigger cascade stages without exhausting local memory. It was already extremely tight.
So I thought about how to break the cascade into fixed-size streamable blocks. The only state needed to be kept between blocks is the "stage sum", which needs to accumulate across the entire stage - so it is cheap to implement. Most of the time I spent writing the cascade compiler trying to fit it into a usable format, and then trying to debug the implementation (when it was the compiler which had the bug).
The new version uses tuneable block sizes, where the first block is stored permanently in LDS (to allow time to dma in the next block) and the rest DMAd from shared memory as with the previous implementation. Using 4k blocks (=12K LDS) even attains marginally better performance in the C implementation. Using 2k blocks (only 6k lds) is about equal.
Once I got that working I converted the assembly routine I had over - and it even worked the first time I ran it! Performance is identical to the previous version in this case although the increased generality and reduced memory requirements are a big plus.
I also found a way to avoid the need for dma abort (which as stated, doesn't work reliably and can crash the hardware) - by delaying it's request till part-way through processing the local cascade. Although the asm version is still too fast for it and some cycles are wasted waiting for the DMA channel to be clear.
And as a bonus I incidentally verified the results against another independent implementation which i know works well enough so i'm fairly confident the numbers are now valid too. Because it performs so many calculations even a small error should show up with wildly different results.
Best times on single epu:
c - 1.79s, dmawait=15MC asm - 0.87s, dmawait=31MC ARM c - 1.00sI could potentially reduce the image data dma external bandwidth needs by nearly 2/3 ... but i haven't looked at that yet - however it may be important for multi-core.
I'm fairly happy with these numbers, it will be interesting to see how it scales to multiple cores.
PS the assembly language leaf routine 149 instructions in 454 bytes. I used some interesting optimisations of the algorithm and took advantage of the separate float and integer status flags. However there may be room for micro-optimisation by improving the scheduling as todate it was only "by eye and intuition" (i.e educated guesses).