But what I do have working is most of a relocating loader which can take a file linked with -r and splat it anywhere in a given memory space and resolve all the references to create a runnable programme. I worked out the epiphany reloc types I required for a small test code and verified them with the disassembler.
I ran in to some strange problems with writing directly to epu and the shared memory block but now I look at the code it seems I forgot to reset/halt the epu first. So hopefully it was just the epu running random junk and crashing the computer and not something funny about the writes.
To work around that (before i spotted the error above) I just wrote to a separate 32k block in memory and then dump it to an assembly file (as .bytes) i can then compile to a binary ... on which i can run a disassembler (e-objdump) to verify that the reloc has been spliced into the code properly. And what I have working looks all pretty hunky dory. About the only thing missing are some of the c-runtime support variables and constants like the stack location. Oh and resolving the weak symbols but that's straightforward. Hmm, and the code support stuff like symbol resolution. So plenty left.
Moving the architecture details from the linker to the loader provides some interesting possibilities although it also presents some problems.
As outlined last post the intention is to use the section names to assign data+code to different regions such as specific data banks, or shared or private external memory. But this can quickly turn into a non-trivial optimisation problem if any of the fixed regions start to overflow and you try to fit them in anyway. i.e. if you assign to bank1 and subsequently bank0 overflows, what then? It might still be possible to honour the bank targets by shifting things around, although all the linker does currently is fill from the start and just borks if anything overlaps. Right now i'm just processing sections as they arrive (which is fairly random straight from the compiler) but they can be processed in any order and/or a linker script used to clean it up.
Having it in the loader also means I run up against some of the C runtime support stuff which is hardcoded into the current linker scripts and runtime. e.g. the bss range and stack pointer. The current crt0 (bootstrap) code only initialises a single bss section although I intend to have many and the bootstrap code still needs to be be the one initialising it. So i'll probably need my own crt0/bootstrap stuff and work out how to get gcc to use it (gcc 'specs' files have always been an indecipherable and opaque mess to me, but i think i can just do it with a command line). But because everything is relocated dynamically the loader can initialise dynamic/complex structures and link them into the bootstrap code if necessary.
Linker scripts are still needed for some purposes but they can be far simpler. e.g. to put libc into external memory it's easiest just to use a linker script to rewrite the target sections appropriately. It's probably worth using it to order the sections a bit more sanely and strip out some of the weird section names.
The real bummer about all this is that gdb wont work as is on any binary loaded in this manner - there are no programe headers and the load addresses of every section are dynamically assigned at runtime (although this doesn't necessarily need to hold for the on-core code, it's necessary for any external code).
TBH I don't like the idea of trying to run gdb on a 64-core chip anyway and i've managed with printf's (or raster bars!) and hard system resets ever since I started coding so I can probably deal with it personally.
My intention isn't to create an alternative sdk anyway just to experiment with some low-level code for a change.
After too many probing questions on the parallella forums Andreas of Adapteva has released an early draft of an updated epiphany handbook which has some pretty interesting newly documented stuff in it. Some of the cooler features include a separate oob signal bus capable of work-group-wide signalling and synchronisation, and a broadcast write feature. The latter is will be very useful for loading the same code on multiple cores for instance, and the former for some multi-core primitives like barrier.
Looking forward to exploring that stuff ... when i finally get to it.