So I veered off on another tangent ... just how to implement the software instruction cache from the previous post.
Came up with a few 'interesting' preliminary ideas although it definitely needs more work.
function linkage table
Each function call is mapped to a function linkage table. This is a 16-byte record which contains some code and data:
;; assembly format, some d* macros define the offset of the given type
dstruct
dint32 flt_move ; the move instruction
dint32 flt_branch ; the branch instruction
dint16 flt_section ; section address (or index?)
dint16 flt_offset ; instruction offset
dint16 flt_reserved0 ; padding
dint16 flt_next ; list of records in this section
dend flt_sizeof
The move instruction loads the index of the function itself into scratch register r12 and then either branches to a function which loads the section that contains the code, or a function which handles the function call itself. The runtime manages this. Unfortunately due to nesting it can't just invoke the target function directly. The data is needed to implement the functionality.
section table
The section table is also 16 bytes long and keeps track of the current base address of the section
dstruct
dint16 sc_base ; loaded base address or ~0 if not loaded
dint16 sc_flt ; function list for this section
dint16 sc_size ; code size
dint16 sc_reserved0
dint16 sc_next ; LRU list
dint16 sc_prev
dint32 sc_code ; global code location
dend sc_sizeof
section header
To avoid the need of an auxiliary sort required at garbage collection time, the sections loaded into memory also have a small header of 8 bytes (for alignment).
dstruct
dint32 sh_section ; the section record this belongs to
dint32 sh_size ; the size of the whole section + data
dend sh_sizeof
Function calls, hit
If the section is loaded ("cache hit") flt_branch points to a function which bounces to the actual function call and more importantly makes sure the calling function is loaded in memory before returning to it, which is the tricky bit.
Approximate algorithm:
rsp = return stack pointer
ssp = section pointer
docall(function, lr)
; save return address and current section
rsp.push((lr-ssp.sc_base));
rsp.push(ssp);
; get section and calc entry point
ssp = function.flt_section
entry = ssp.sc_base + function.flt_offset
; set rebound handler
lr = docall_return
; call function
goto entry
docall_return()
; restore ssp
ssp = rsp.pull();
; if still here, return to it (possibly different location)
if (ssp.sc_base) {
lr = rsp.pull() + ssp.sc_base;
goto lr;
}
; must load in section
load_section(ssp)
; return to it
lr = rsp.pull() + ssp.sc_base;
goto lr;
I think I have everything there. It's fairly straightforward if a little involved.
If the section is the same it could avoid most of the work but the linker wont generate such code unless the code uses function pointers. The function loaded by the stub (flt record) might just be an id (support 65K functions) or an address (i.e. all on-core functions).
I have a preliminary shot at it which adds about 22 cycles to each cross-section function call in the case the section is present.
Function calls, miss
If the section for the function is not loaded, then the branch will instead point to a stub which loads the section first before basically invoking docall() itself.
doload(function, lr)
; save return address and current section
rsp.push((lr-ssp));
rsp.push(ssp);
; load in section
ssp = function.flt_section
load_section(ssp);
; calculate function entry
entry = ssp.sc_base + function.flt_offset
; set rebound handler (same)
lr = docall_return
; call function
goto entry
load_section() is where the fun starts.
Cache management
So I thought of a couple of ways to manage the cache but settled on a solution which uses garbage collection and movable objects. This ensures every byte possible is available for function use and i'm pretty certain will take less code to implement.
This is where the sh struct comes in to play - the cache needs both an LRU ordered list and a memory-location ordered list and this is the easiest way to implement it.
Anyway i've written up a bit of C to test the idea and i'm pretty sure it's sound. It's fairly long but each step is simple. I'm using AmigOS/exec linked lists as they('re cool and) fit this type of problem well.
loader_ctx {
list loaded;
list unloaded;
int alloc;
int limit;
void *code;
} ctx;
load_section(ssp) {
needed = ctx.limit - ssp.sc_size - sizeof(sh);
if (ctx.alloc > needed) {
; not enough room - garbage collect based on LRU order
; scan LRU ordered list for sections which still fit
used = 0;
wnode = ctx.loaded.head;
nnode = wnode.next;
while (nnode) {
nused = wnode.sc_size + used + sizeof(sh);
if (nused > needed)
break;
used = nused;
wnode = nnode;
nnode = nnode.next;
}
; mark the rest as removed
while (nnode) {
wnode.sc_base = -1;
;; fix all entry points to "doload"
unload_functions(wnode.sc_flt);
wnode.remove();
ctx.unloaded.addhead(wnode);
wnode = nnode;
nnode = nnode.next;
}
; compact anything left, in address order
src = 0;
dst = 0;
while (dst < used) {
sh = ctx.code + src;
wnode = sh.sh_section;
size = sh.sh_size;
; has been expunged, skip it
if (wnode.sc_base == -1) {
src += size;
continue;
}
; move if necessary
if (src != dst) {
memmove(ctx.code + dst, ctx.code + src, size);
wnode.sc_base = dst + sizeof(sh);
}
src += size;
dst += size;
}
}
; load in new section
;; create section header
sh = ctx.code + ctx.alloc;
sh.sh_section = ssp;
sh.sh_size = ssp.size + sizeof(sh);
;; allocate section memory
ssp.sc_base = ctx.alloc + sizeof(sh);
ctx.alloc += ssp.size + sizeof(sh);
;; copy in code from global shared memory
memcpy(ssp.sc_base, ssp.sc_code, ssp.sc_size);
;; fix all entry points to "docall"
load_functions(ssp.sc_flt);
;; move to loaded list
ssp.remove();
ctx.loaded.addhead(ssp);
}
The last couple of lines could also be used at each function call to ensure the section LRU list is correct, which is probably worth the extra overhead. Because the LRU order is only used to decide what to expunge and the memory order is used for packing it doesn't seem to need to move functions very often - which is obviously desirable. It might look like a lot of code but considering this is all that is required in totality it isn't that much.
The functions load_functions() and unload_functions() just set a synthesised branch instruction in the function stubs as appropriate.
Worth it?
Dunno and It's quite a lot of work to try it out - all the code above basically needs to be in assembly language for various reasons. And the loader needs to create all the data-structures needed as well, which is the flt table, the section table, and the section blocks themselves. And ... there needs to be some more relocation stuff done if the sections use relative relocs (i.e. -mshort-calls) when they are loaded or moved - not that this is particularly onerous mind you.
AmigaOS exec Lists
The basic list operations for an exec list are always efficient but turn out to be particularly compact in epiphany code, if the object is guaranteed to be 8-byte aligned which it should be due to the abi.
For example, node.remove() is only 3 instructions:
; r0 = node pointer
ldrd r2,[r0] ; n.next, n.prev
str r2,[r3] ; n.prev.next = n.next
str r3,[r2,#1] ; n.next.prev = n.prev
The good thing about exec lists is that they don't need the list header to remove a node due to some (possibly) modern-c-breaking address-aliasing tricks, but asm has no such problems.
list.addTail() is only 4 instructions if you choose your registers wisely (5 otherwise).
; r3 = list pointer
; r0 = node pointer
ldr r2,[r3] ; l.head
strd r2,[r0] ; n.next = l.head, n.prev = &l.head
str r0,[r2,#1] ; l.head.prev = n
str r0,[r3] ; l.head = n
By design, &l.head == l.
Unfortunately the list iteration trick of going through d0 loads (which set the cc codes directly) on m68K doesn't translate quite as nicely, but it's still better than using a container list and still only uses 2 registers for the loop:
; iterate whole list
; r2 = list header
ldr r0,[r2] ; wnhead (work node)
ldr r1,[r0] ; nn = wn.next (next node)
sub r1,r1,0 ; while (nn) {
beq 2f
1:
; r0 is node pointer and free to be removed from list/reused
; node pointer is also ones 'data' so doesn't need an additional de-reference
mov r0,r1 ; wn = nn
ldr r1,[r1] ; nn = nn.next
sub r1,r1,0
bne 1b
2:
Again the implementation has a hidden bonus in that the working node can be removed and moved to another list or freed without changing the loop construct or requiring additional registers.
For comparison, the same operation using m68K (devpac syntax might be out, been 20 years):
; a0 = list pointer
mov.l (a0),a1 ; wn = l.head (work node)
mov.l (a1),d0 ; nn = wn.next (next node)
beq.s 2f ; while (nn) {
.loop:
; a1 is node pointer, free to be removed from list/reused
mov.l d0,a1 ; wn = nn
mov.l (a1),d0 ; nn = wn.next
bne.s .loop
2:
See lists.c from puppybits for a C implementation.