I included and/or implemented the various bits and pieces of mentioned on the last few parallella related posts - things like the global loader-defined barrier memory, async dma (via a queue and interrupts), and the startup routine that can pass arguments to kernels, track the current running state and set a return code.
I decided to just allocate the barrier memory for the whole workgroup before the code address every time even if barriers aren't being used. It can always be changed. Still yet to test the implementation.
The start of the memory-map (excluding the isv entries) now looks like this:
+----------+------ | 0028 | extmem | 002c | group_id | 0030 | group_rows | 0034 | group_cols | 0038 | core_row | 003c | core_col | 0040 | group_size | 0044 | core_index +----------+------ | 0048 | imask (short, but here so it can be loaded as int) | 004a | status (short) | 004c | exit code | 0050 | entry | 0054 | sp | 0058 | arg0 | 005c | arg1 | 0060 | arg2 | 0064 | arg3 +----------+------ | 0068 | barrier, group_size bytes | | +----------+------ | ≥006c | .text .data .bss | | .text.bank0 .data.bank0 .bss.bank0
Not sure if it'll work but i experimented with a @workgroup "tag" on section names. If present the allocation is multiplied by the workgroup size - this was whilst working on the barrier stuff before I realised that wont work because the barrier location has to be the same across all work-items in the work-group even if they're running separately linked code. Something I can play with later anyway.
After getting the most basic test running i'm to the point of being able to debug the new features. And I just got the async dma interfaces to function (yay?) before writing this up. Actually it works out pretty nicely. I define the async dma queue inside the isr handler code so that by using the c functions which reference the queue it drags in the isr and isr vector automatically, which the loader tracks so that the new
sync isr automagically sets the correct imask too.
Once i've got everything going i've got a bit of housekeeping stuff to deal with before it can go further. But for now I've settled on two libraries:
The host-side library (surprise). This includes a fork of the adapteva esdk 'e-hal' as well as the elf-loader stuff. The on-core runtime interface has been changed to accommodate the new features so it wont work with pre-linked binaries.
This is the on-core support library and equivalent to e-lib. Most of the functions are inline calls which generates smaller code-size and more efficient compilation of leaf functions.
I have assembly versions of almost every non-inline routine too; they save some code-space but maybe not enough to be worth it. I might include it as another library option. Perhaps.
I'm probably also going to look at different runtime mechanisms such as a "job queue" mode rather than the current "one shot" mode. This will be changed by specifying a different crt0.o file. Already the crt0 implementation I have allows one to restart the core by just using e_start() without requiring a reset first because the exit routine just idles rather than trapping.