Friday, 18 April 2014

More thoughts on the software code cache

So I veered off on another tangent ... just how to implement the software instruction cache from the previous post.

Came up with a few 'interesting' preliminary ideas although it definitely needs more work.

function linkage table

Each function call is mapped to a function linkage table. This is a 16-byte record which contains some code and data:

 ;; assembly format, some d* macros define the offset of the given type

        dstruct
        dint32  flt_move        ; the move instruction
        dint32  flt_branch      ; the branch instruction
        dint16  flt_section     ; section address (or index?)
        dint16  flt_offset      ; instruction offset
        dint16  flt_reserved0   ; padding
        dint16  flt_next        ; list of records in this section
        dend    flt_sizeof

The move instruction loads the index of the function itself into scratch register r12 and then either branches to a function which loads the section that contains the code, or a function which handles the function call itself. The runtime manages this. Unfortunately due to nesting it can't just invoke the target function directly. The data is needed to implement the functionality.

section table

The section table is also 16 bytes long and keeps track of the current base address of the section

        dstruct
        dint16  sc_base         ; loaded base address or ~0 if not loaded
        dint16  sc_flt          ; function list for this section
        dint16  sc_size         ; code size
        dint16  sc_reserved0
        dint16  sc_next         ; LRU list
        dint16  sc_prev
        dint32  sc_code         ; global code location
        dend    sc_sizeof

section header

To avoid the need of an auxiliary sort required at garbage collection time, the sections loaded into memory also have a small header of 8 bytes (for alignment).

        dstruct
        dint32  sh_section      ; the section record this belongs to
        dint32  sh_size         ; the size of the whole section + data
        dend    sh_sizeof

Function calls, hit

If the section is loaded ("cache hit") flt_branch points to a function which bounces to the actual function call and more importantly makes sure the calling function is loaded in memory before returning to it, which is the tricky bit.

Approximate algorithm:

rsp = return stack pointer
ssp = section pointer

docall(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp.sc_base));
   rsp.push(ssp);

   ; get section and calc entry point   
   ssp = function.flt_section
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler
   lr = docall_return

   ; call function
   goto entry

docall_return()
   ; restore ssp
   ssp = rsp.pull();

   ; if still here, return to it (possibly different location)
   if (ssp.sc_base) {
      lr = rsp.pull() + ssp.sc_base;
      goto lr;
   }

   ; must load in section
   load_section(ssp)

   ; return to it
   lr = rsp.pull() + ssp.sc_base;
   goto lr;

I think I have everything there. It's fairly straightforward if a little involved.

If the section is the same it could avoid most of the work but the linker wont generate such code unless the code uses function pointers. The function loaded by the stub (flt record) might just be an id (support 65K functions) or an address (i.e. all on-core functions).

I have a preliminary shot at it which adds about 22 cycles to each cross-section function call in the case the section is present.

Function calls, miss

If the section for the function is not loaded, then the branch will instead point to a stub which loads the section first before basically invoking docall() itself.

doload(function, lr)
   ; save return address and current section
   rsp.push((lr-ssp));
   rsp.push(ssp);

   ; load in section
   ssp = function.flt_section
   load_section(ssp);

   ; calculate function entry
   entry = ssp.sc_base + function.flt_offset

   ; set rebound handler (same)
   lr = docall_return

   ; call function
   goto entry  

load_section() is where the fun starts.

Cache management

So I thought of a couple of ways to manage the cache but settled on a solution which uses garbage collection and movable objects. This ensures every byte possible is available for function use and i'm pretty certain will take less code to implement.

This is where the sh struct comes in to play - the cache needs both an LRU ordered list and a memory-location ordered list and this is the easiest way to implement it.

Anyway i've written up a bit of C to test the idea and i'm pretty sure it's sound. It's fairly long but each step is simple. I'm using AmigOS/exec linked lists as they('re cool and) fit this type of problem well.

loader_ctx {
  list loaded;
  list unloaded;
  int alloc;
  int limit;
  void *code;
} ctx;

load_section(ssp) {
   needed = ctx.limit - ssp.sc_size - sizeof(sh);

   if (ctx.alloc > needed) {
      ; not enough room - garbage collect based on LRU order

      ; scan LRU ordered list for sections which still fit
      used = 0;
      wnode = ctx.loaded.head;
      nnode = wnode.next;
      while (nnode) {
         nused = wnode.sc_size + used + sizeof(sh);

         if (nused > needed)
            break;

         used = nused;

         wnode = nnode;
         nnode = nnode.next;
      }

      ; mark the rest as removed
      while (nnode) {
         wnode.sc_base = -1;

         ;; fix all entry points to "doload"
         unload_functions(wnode.sc_flt);

         wnode.remove();
         ctx.unloaded.addhead(wnode);

         wnode = nnode;
         nnode = nnode.next;
      }

      ; compact anything left, in address order
      src = 0;
      dst = 0;
      while (dst < used) {
         sh = ctx.code + src;
         wnode = sh.sh_section;
         size = sh.sh_size;

         ; has been expunged, skip it
         if (wnode.sc_base == -1) {
            src += size;
            continue;
         }

         ; move if necessary
         if (src != dst) {
            memmove(ctx.code + dst, ctx.code + src, size);
            wnode.sc_base = dst + sizeof(sh);
         }

         src += size;
         dst += size;
      }
   }

   ; load in new section
   ;; create section header
   sh = ctx.code + ctx.alloc;
   sh.sh_section = ssp;
   sh.sh_size = ssp.size + sizeof(sh);

   ;; allocate section memory
   ssp.sc_base = ctx.alloc + sizeof(sh);
   ctx.alloc += ssp.size + sizeof(sh);

   ;; copy in code from global shared memory
   memcpy(ssp.sc_base, ssp.sc_code, ssp.sc_size);

   ;; fix all entry points to "docall"
   load_functions(ssp.sc_flt);

   ;; move to loaded list
   ssp.remove();
   ctx.loaded.addhead(ssp);   
}

The last couple of lines could also be used at each function call to ensure the section LRU list is correct, which is probably worth the extra overhead. Because the LRU order is only used to decide what to expunge and the memory order is used for packing it doesn't seem to need to move functions very often - which is obviously desirable. It might look like a lot of code but considering this is all that is required in totality it isn't that much.

The functions load_functions() and unload_functions() just set a synthesised branch instruction in the function stubs as appropriate.

Worth it?

Dunno and It's quite a lot of work to try it out - all the code above basically needs to be in assembly language for various reasons. And the loader needs to create all the data-structures needed as well, which is the flt table, the section table, and the section blocks themselves. And ... there needs to be some more relocation stuff done if the sections use relative relocs (i.e. -mshort-calls) when they are loaded or moved - not that this is particularly onerous mind you.

AmigaOS exec Lists

The basic list operations for an exec list are always efficient but turn out to be particularly compact in epiphany code, if the object is guaranteed to be 8-byte aligned which it should be due to the abi.

For example, node.remove() is only 3 instructions:

; r0 = node pointer
        ldrd    r2,[r0]         ; n.next, n.prev
        str     r2,[r3]         ; n.prev.next = n.next
        str     r3,[r2,#1]      ; n.next.prev = n.prev

The good thing about exec lists is that they don't need the list header to remove a node due to some (possibly) modern-c-breaking address-aliasing tricks, but asm has no such problems.

list.addTail() is only 4 instructions if you choose your registers wisely (5 otherwise).

; r3 = list pointer
; r0 = node pointer
        ldr     r2,[r3]         ; l.head
        strd    r2,[r0]         ; n.next = l.head, n.prev = &l.head
        str     r0,[r2,#1]      ; l.head.prev = n
        str     r0,[r3]         ; l.head = n

By design, &l.head == l.

Unfortunately the list iteration trick of going through d0 loads (which set the cc codes directly) on m68K doesn't translate quite as nicely, but it's still better than using a container list and still only uses 2 registers for the loop:

; iterate whole list
; r2 = list header
        ldr     r0,[r2]         ; wnhead  (work node)
        ldr     r1,[r0]         ; nn = wn.next (next node)
        sub     r1,r1,0         ; while (nn) {
        beq     2f
1:
        ; r0 is node pointer and free to be removed from list/reused
        ; node pointer is also ones 'data' so doesn't need an additional de-reference

        mov     r0,r1           ; wn = nn
        ldr     r1,[r1]         ; nn = nn.next
        sub     r1,r1,0
        bne     1b
2:

Again the implementation has a hidden bonus in that the working node can be removed and moved to another list or freed without changing the loop construct or requiring additional registers.

For comparison, the same operation using m68K (devpac syntax might be out, been 20 years):

; a0 = list pointer
        mov.l   (a0),a1         ; wn = l.head  (work node)
        mov.l   (a1),d0         ; nn = wn.next (next node)
        beq.s    2f             ; while (nn) {
.loop:
        ; a1 is node pointer, free to be removed from list/reused

        mov.l   d0,a1           ; wn = nn
        mov.l   (a1),d0         ; nn = wn.next
        bne.s   .loop
2:        

See lists.c from puppybits for a C implementation.

No comments: