Saturday, 19 July 2014

instruction matching

This was an interesting little diversion.

One of the requirements of a disassembler or code translator is to work out from the machine code what instruction is at a given address. Most instruction sets are variable in size so it has to determine that as well.

Depending on how the instruction set was designed this can either be a simple table lookup, ... or get somewhat more involved. It usually can't just be all a simple lookup though as there are usually some instructions that want to use most of the bits for data.

Linear

The current implementation in ezetool uses a simple linear search. For each instruction it sees if all the selector bits for the instruction match the test bits. This is trivial:

  boolean match = (select_mask & opcode) == select_bits;
It has to be executed up to two times because it doesn't know the instruction size yet. Bit 3 has that information for many instructions but not all.

Linear split

An obvious improvement that I didn't have time for initially is to split the search into two separate lists. One for 16-bit instructions and one for 32-bit instructions. Because it just Has To Be, all 32-bit instructions will be different in their first 16-bits to all 16-bit instructions so the lists can be separated.

This amounts to a pre-indexing of the instructions based on their size and lead to over 100% performance increase.

List of Hash Tables

A hashtable can't be used directly because you don't know in advance which of the bits are significant.

This can be shown by displaying the instructions which use the same selector bits as i'm using them. Some of the 'selector' bits here are separating instructions into different addressing modes which could be handled in a different way (the high-bits of the ldr and mov instructions) but this is the way i've written it so far.

 0000000f: b b  (16/32 bit variants)
 0000001f: ldr str ldr str ldr str mov lsr lsl asr bitr
 0000007f: add sub add sub add sub and orr eor asr lsr lsl fadd fsub fmul fmadd fmsub
 0000030f: mov
 000003ff: float fix fabs movts movfs jr jalr gie gid nop idle bkpt mkpt sync rti wand trap
 000f001f: lsr lsl asr bitr
 000f007f: add sub and orr eor asr lsr lsl fadd fsub fmul fmadd fmsub
 000f030f: mov
 000f03ff: float fix fabs movts movfs jr jalr
 0200001f: ldr str ldr str
 0060001f: ldr str testset ldr str
 1000001f: mov movt

A solution here would perform a linear search across all select-bit combinations, and each of those would be accessed via a hash table. Given the simplicity of the comparison test though it may as well just be a linear search.

I didn't try implementing this but as it removes the smask de-reference outside of the inner loop it may be ok.

Radix/index M-tree

Looking at the output above it is clear that at least 0x0f is used as a selector bit in every instruction. This can be used to create a first-level index and reduce the search space.

Indexing by the first nybble:

  0 [ 1]: b
  1 [ 2]: ldr str
  2 [15]: mov movts movfs jr jalr gie gid nop idle bkpt mkpt sync rti wand trap
  3 [ 3]: mov add sub
  4 [ 2]: ldr str
  5 [ 2]: ldr str
  6 [ 2]: lsr lsl
  7 [ 8]: fadd fsub fmul fmadd fmsub float fix fabs
  8 [ 1]: b
  9 [ 3]: ldr str testset
  a [ 8]: add sub and orr eor asr lsr lsl
  b [ 4]: mov movt add sub
  c [ 4]: ldr str ldr str
  d [ 2]: ldr str
  e [ 2]: asr bitr
  f [25]: lsr lsl asr bitr add sub and orr eor asr lsr lsl fadd fsub fmul fmadd
          fmsub float fix fabs mov movts movfs jr jalr

Well, it's simple ... but not very effective.

I did some analysis of the longer sets above and found most instructions use either 0x70 or 0xf0 as selector bits so a further level of m-tree can be added for those. For the others I just dump them into a linear search.

This reduces the worst-case to:

  f:   lsr lsl asr bitr mov
    0:  eor fadd movts
    1:  add fsub movfs
    2:  lsl fmul
    3:  sub fmadd
    4:  lsr fmsub jr
    5:  and float jalr
    6:  asr fix
    7:  orr fabs

Which involves: a direct radix-index based on the first 4 bits, a 2 or 3 element linear search from a 3-bit radix, and a 5-element linear search across the 'leftovers'. This turned out to be very fast - about 6x faster than the linear search, but did require a bit of human input for the tree sizes.

Radix/sparse tree

I'm not really sure what to call this but i guess its a sort of radix search but with a sparse tree. I was seeing if i could fully automate the tree building.

Each significant bit is indexed from the lsb upwards. Each tree node has a list of child nodes which specify which bit and bit-value they correspond to. The first 4 levels of the tree are fully filled but as the selector bits change it becomes a sparse tree.

My initial naive thought was that this could find a solution in one pass like a huffman code but of course this isn't the case: it still has to perform a fairly wide search because, again, it doesn't know which bits are significant (if it did, then it would work in one pass).

I implemented this as a 4-bit radix (i.e. index) into 16 separate trees. It was faster than a linear search but not as fast as the split-linear. I didn't think it was worth the effort to try to optimise the tree so that it would resemble the indexed m-tree.

8-bit index

So at this point I had a pretty good/quick implementation but wondered if i could get any better.

I tried using 8 bits for the first-stage index rather than 4. This complicates matters a little bit because many (most?) instructions don't use the first 8-bits as a selector which means they will alias to multiple locations. The solution: just alias them. So for example 'b' uses only 4 bits as a selector so one ends up with 16 copies.

Here's a part of the table.

   0: b
   1: ldr
   2: mov movts
   3: mov
   4: ldr
   5: ldr
   6: lsr
   7: fadd
   8: b
   9: ldr testset
  10: eor
  11: mov movt
  12: ldr ldr
  13: ldr
  14: asr
  15: lsr asr eor fadd mov movts

The worst case (over the whole table) is a single 8-bit indexed lookup followed by 6-element linear search.

Simple code, needs moderate space, and runs really fast ... well most of the time. At least in Java it has some strange cache-related effects so depending on the memory layout its runtime varies by nearly 100%.

Since most instructions don't use all 8-bits as a selector i also tried using 7 or 6 bits. For 7 bits the maximum run is still 6: i.e. well wasn't using 8 silly but with the half the index size. For 6-bits a couple of the runs got slightly longer but it takes half of the index size again.

Micro-Optimisations

I also looked at a bunch of micro-optimisations to the data-structures.

For the linear search, rather than iterating through a list of objects which require a dereference, iterate through a single array of integers which contain the selector mask and bits together. This could be applied anywhere a linear search is although other cases also need to include the instruction index.

Very slight improvement - Java seems good at object dereferences but it might be applicable to c.

I also tried flattening one of the array of lists implementations into a single list of integers. The first N elements are just indices into the array and then the structure at each array is a count followed by { mask, value, index } triplets. Well Java didn't really like this so it wasn't an improvement.

Numbers

I realised all implementations require two lookups so I did timed that. I didn't bother timing the ones which weren't competitive for a single lookup pass.

This is for 100 000 iterations of looking up every instruction in order (84 using my splits). So even the slowest implementation is only 120ns 18ns in Java on a Kaveri CPU. In all cases any lists required during building were collapsed to arrays of exactly the right size - arrays are much smaller than lists and iterate faster.

  time    memory               alg
  1.013                     0  linear search
  0.449                     2  linear split search
  0.157   (2+64*2)*2+84   344  0x3f index + linear search
  0.153   (2+128*2)*2+84  600  0x7f index + linear search
  0.180   6*(2+6)+72+84   204  multi-stage radix + linear search

The memory is the approximate extra words required to store the indices. An array is counted as 2 words + space for the contents (length+pointer+data). The multi-stage thing needs the bit number/mask and two arrays and I might have an error there.

Based on that I would probably go with the 0x3f indexed version. The building of the indices is easier than the multi-stage radix algorithm and it doesn't require an additional object (it just uses a 2d array) and memory requirements are modest compared to the 0x7f version with much the same runtime.

Still, it's only about 2.8x faster which is surprising given that the search is an order of magnitude less work, between 0 and 9 linear steps compared to 84 (or on average 4.5 vs 42). The benchmark is probably just not a good one.

Addendum

So there is a very cheap way to determine if the instruction is 16-bits or 32-bits. The first nybble is enough to determine this - i vaguely recall something like that but for some reason I thought it required a bigger table.

boolean isShortInstruction(int opcode) {
   return ((0x44ff >> (opcode&0x0f)) & 1) == 1;
}

I added it and re-ran the benchmarks. It makes a measurable but pretty insignificant difference to the indexed implementations. The lookup is already fast enough that loop overheads and other scaffolding must be the dominating factor.

The linear search gained a lot because for a 32-bit instruction all 16-bit instructions had to be scanned before you could determine it wasn't one of them.

  time    memory               alg
  1.013                     0  linear search (original)
  0.296                     2  split linear split search
  0.156   (2+64*2)*2+84   344  split 0x3f index + linear search
  0.149   (2+128*2)*2+84  600  split 0x7f index + linear search

TBH the linear search is fast for the purposes of a command-line tool but I rolled these improvements into the implementation anyway.

No comments: