It happens when I try to communicate with the epu cores by writing directly to their memory like this:
struct job *jobqueue = @some pointer on-core@ int qid = ez_port_reserve(1) memcpy(jobqueue[qid], job, sizeof(job); ez_port_post(1);And when the core is busy reading from shared memory:
struct job jobqueue[2]; ez_port_t port; main() { while (1) { int qid = ez_port_await(1); ez_dma_memcpy(buffer, jobqueue[qid].input, jobqueue[qid].N); ez_port_complete(1); } }It's ok with one core but once I have multiple cores running concurrently it starts to fail - at 16 it fails fairly often. It often works once too but putting the requests in a loop causes failures.
If I change the memcpy of the job details to the core into a series of writes with long delays in between them (more than 1ms) it reduces the problem to a rarity but doesn't totally fix it.
I spent a few hours trying everything I could think of but that was the closest to success and still not good enough.
I'll have to wait to see if there is a newer fpga image which is compatible with my rev0 board, or perhaps it's time to finally start poking with the fpga tools myself, if they run on my box anyway.
fft
I was playing with an fft routine. Although you can just fit a 128x128 2D fft all on-core I was focusing on a length of up to 1024.
I did an enhancement to the loader so I can now specify the stack start location which lets me get access to 3 full banks of memory for data - which is just enough for 3x 1024 element complex float buffers to store the sin/cos tables and two work buffers for double-buffering. This lets me put the code, control data, and stack into the first 8k and leave the remaining 24k free for signal.
I also had to try to make an in-place fft routine - my previous one needed a single out-of-place pass to perform the bit reversal following which all operations are streamed in memory order (in two streams). I managed to include the in-place bit reversal with the first radix-2 kernel so execution time is nearly as good as the out of place version. However although it works on the arm it isn't working on the epiphany but I haven't worked out why yet. At worst I can use the out-of-place one and just shift the asynchronous dma of the next row of data to after the first pass of calculation.
Although i'm starting with radix-2 kernels i'm hoping I have enough code memory to include up to radix-8 kernels which can make a good deal of difference to performance. I might have to resort to assembly too, whilst the compiler is doing a pretty good job of the arithmetic all memory loads are just word loads rather than double-word. I think I can do a better job of the addressing and loop arithmetic, and probably even try out the hardware loop feature.
No comments:
Post a Comment