I posted a simple elf loader for the parallella to the boards, but here's another link to it: elf-loader-0.0.tar.gz
This just handles the single-core case with nothing 'extra', it still requires special linker scripts and other foo.
It was really just a test to check some assumptions and fortunately they checked out.
I spent a few hours the night before trying to nut out a multi-core relocating loader but although I think I can pull it off it's kind of messy and it can't be made to do what I want it to.
The problem is that the linker will pull in all the library functions as static and although i can relocate per-core code to the per-cores, I need to copy all the static library functions used across all cores; and that just isn't going to be good enough given the tight memory.
What I really want is to have the linker create a specific binary for each core, ... but somehow handle multple cores too. But the problem is the linker will either whine or create redundant copies of any shared data structures.
So this morning I had a bit of an epiphany (pun intended) with the idea of using weak symbols. I can then define 'undefined' labels and just resolve them at load time. It does force me to use relocable code, but I need to anyway. Resolving and relocating a few undefined labels isn't that difficult and it's easy to tell if you failed.
That got me thinking a bit further and the way the linker is currently used isn't really that useful on the epiphany because it doesn't have an MMU. You can write code that sits on one core, or the same code that runs on multiple cores, but if you want to have multiple codes running on multiple cores each using private global memory - you're stuck with a huge headache of having to manually assign private memory ranges in the linker script! Might be acceptable for embedded programmers, but yeah, way too Commodore 64 for me these days.
But AmigaOS never had a problem running multiple code blocks it just relocated everything, even sparsely if that so happened. Simply ignoring all the elf-features designed for MMU capability makes a lot more sense than trying to shoe-horn them to fit. So does finding a better approach than using compile-time-fixed linker-defined memory map.
This is my first thoughts on how to solve this problem. I drop the notion of being able to target multiple cores within the one file but I get a pile of benefits.
- A single executable will map to a single core.
- The same executable can be loaded onto multiple cores.
- Multiple executables can be defined for loading a whole work-group.
- Code is linked using -r to create a relocatable object. This is the only tool-chain specific option required.
- Any code can reference code or data in other cores using weak references.
- Platform specifics like the stack base or end of the bss section are also weak references.
- The loader relocates the on-core data onto the core directly.
- The loader could potentially automatically spill sections to external memory if they don't fit.
- The loader resolves all weak references at the workgroup level.
- Specific section names are used for global or private external mememory.
- Platform specifics like the ISR table offset and the total on-core ram can be handled in the loader.
I still have to nut out some C runtime issues and where the weak reference's concrete instantiations are defined, but I'm pretty confident this will work and moreover that it is a good solution.
Since all the code is relocable there's no reason to use position independent code (which the compiler doesn't support yet) and there's no reason to include platform-specific details inside the linker script. The linker script can be very simple and is just used to partition external sections between on-core and off-core memory.
Note that even though the code is statically linked this will allow multiple separate instances of the the c (or other) library to be loaded across multiple cores. This might waste a bit of space but is much easier than dynamic linking and solves all the concurrency issues.
Relocating is more work than not having to, but it's not much work. And if a 7mhz 68000 can do it fast enough to be practical, surely a 1Ghz ARM can too.
Update: Moved the location of the elf-loader distribution.