Saturday, 17 August 2013

DMA n stuff

So it looks like it's going to take a bit longer to get the object recognition code going on parallella.

First i was a bit dismayed that there is no way to signal the ARM with an interrupt - but subsequently satisfied that it is just work in progress and can be added to the FPGA glue logic later. Without this there is no mechanism for efficient epiphany to arm communications as the only option is polling a shared memory location (ugh, how x86).

Then I had to investigate just how to communicate using memory buffers. I had some trouble with the linker script (more on that later) so I hardcoded some addresses and managed to get it to work. I created a simple synchronous mailbox system that isn't too inefficient to poll from the ARM.

Once I had something going I tried a few variations to judge performance: simple memory accesses, and different types of DMA.

The test loop just squares the elements of one array into another, and uses a single e-core. The arrays are 512x512 floats.

seq How                         Total        DMA verified 
 1: Direct array access     216985626          0 ok
 3: Synchronous DMA          11076425          0 ok
 5: Simultanous DMA byte     90242983   88205438 ok
 7: Simultanous DMA short    46104601   44066927 ok
 9: Simultanous DMA word     24047251   22009451 ok
11: Simultanous DMA long     11179747    9131659 ok
13: Async Double DMA byte    88744259   88021739 ok
15: Async Double DMA short   44542240   43819736 ok
17: Async Double DMA word    22448029   21725573 ok
19: Async Double DMA long     9479016    8756500 ok

All numbers are in clock cycles. The DMA routines used one or two 8K buffers (but not aligned to banks). The DMA column is the "wasted cycles" waiting for asynchronous DMA to complete (where that number is available).

As expected the simple memory access is pretty slow - an order of magnitude slower than a simple synchronous DMA. The synchronous DMA is good for it's simplicity - just use e_dma_copy() to copy in/out each block.

Simultanous DMA uses two buffers and enqueue two separate DMA operations concurrently - one reading and the other writing. They both still need to wait for completion before moving on, and it appears they are bandwwidth limited.

Async double-buffered DMA uses the two buffers, but uses a chained DMA operation to write out the previous result and read in the next result - and the DMA operation runs asynchronous to the processing loop.

Some notes:

Avoid reading buffers using direct access!
Slow slow slow. Writing shouldn't be too bad as write transactions are fire and forget. I presume as i used core 0,0 this is actually the best-case scenario at that ...
Avoid anything smaller than 64-bit transfers.
Every DMA element transfer takes up a transaction slot, so it becomes very wasteful for any smaller size than the maximum.
Concurrent external DMA doesn't help
Presumably it's bandwidth limited.
Try to use async DMA.
Well, obviously.
The e-core performance far outstrips the external memory bandwidth
So yeah, next time i should use something a bit more complex than a square op. This is both good - yeah heaps of grunt - and not so good - memory scheduling is critical for maximising performance.
Multicore?
I have yet to experiment with bigger work-groups to see how it scales up (or doesn't).

Build Environment

So one reason I couldn't get the linker script to work properly (assigning blocks to sections) was due to my build setup. I was initially going to have a separate directory for epiphany code so that a makefile could just change CC, etc. But that just seemed too clumsy so I decided to use some implicit make rules which use new extensions to automate some of the work - .ec, .eo, .elf, .srec, etc. The only problem is the linker script takes the name of the extension into account, so all my section attributes were being ignored ...

I copied it and added the extensions to a couple of places and that fixed it - but i haven't gone back to adjust the code trying to take advantage of it.

Object recognition

So anyway i tried to fit this knowledge into the OR code, but i haven't yet got it working. The hack of hard-coding the address offsets doesn't work now since i'm getting the linker to drop some data into the same shared address space, and i'm not sure yet how i can resolve the linker-assigned addresses from the ARM side so i can properly map the memory blocks. So until I work this out there's no way to pass the job data to the e-cores.

I could just move all the data to the ARM side and have that initialise the tables, but then I have to manually 'link' the addresses in. So that is throwing away the facilities of the linker which are designed for this kind of thing. I could create a custom linker script which hard-codes the addresses in another way but that seems hacky and non-portable.

I might have to check the forums and see what others have come up with, and read that memory map a bit more closely.

No comments: