It relies on the fact that as is usually the case; relocs can have an additional fixed pointer-sized offset added to them.
Basically the idea is that when you reference an external core, you define a work-group relative address by providing an addend which defines the group-relative core address in the upper 12 bits as normal. At load time this external-core link is detected and offset by the workgroup base (which can be dynamic). It should still work cleanly for all the normal cases like referencing a member of a struct and so on which is what these relative addends are for.
A few simple macros should make it trivial to use but in raw code to define a reference to 'bufferx' in a core which is in column 1:
extern void *bufferx __attribute__ ((weak)); void *refx = (1<<20) + (void *)&bufferx;
Each case would need special handling in the link-loader for the core-address bits (row=31-26, col=25-20, iirc, or vice-versa; it isn't important here):
- row == 0, col == 0
- Left alone - remains a local address. Allows for programmatic resolution as i'm currently using via elib.
- row == 0 col != 0
- Resolves to this.row,
group.col + colthis.col +- col.
- row != 0 col == 0
- Resolves to
group.row + rowthis.row +- row, this.col.
- row != 0 col != 0
- Resolves to
group.row+row, group.col+colthis.row +- row, this.col +- col.
- Outside of workgroup or chip?
- Undefined behaviour? Leave it as is? Clamp? Let it resolve as above?
- Matches dram "window" address.
- Leave it alone.
Where group is the group root, and this is the core on which the code resides. I thought of using the non-zero values as 'this relative', but there aren't enough bits for a signed offset (actually, there is if I use wrap-around ... hmm, interesting thought, actually the more i think about it I think it's the better solution, otherwise you can't reference 0,x or x,0 ). Given that a 64x64 core w/ 1MB LDS each might be some time away I could always abuse some of the addressing bits for extra information anyway, but that's probably not a wise idea for the little benefit it might provide.
This will cover most of the common and useful cases and one can always just fall back to using e_get_global_address() for more complex data-flow topologies.
Unfortunately it requires more processing because each programme must be (re)linked for each target core rather than being able to broadcast the code to all common cores and these extra overheads might make it less attractive. OTOH it allows for load-time initialisation of data structures and less on-core code.
Hmm, how did it get to midnight. Blah.