After posting the last article I went and had a look at the instruction decoder I was working on. First I was hand-coding it all but then I realised how silly it was so I put it into a simple table. I was going to make a code-generator from that but it's really not necessary.
Here's a tiny bit of the table (it's only 84 lines long anyway). It has 3 fields, instruction name, addressing mode, bit format.
; branches b 7 i{7-0},c{3-0},v{0000} b 7 i{23-0},c{3-0},v{1000} ; load/store ldr 8 d{2-0},n{2-0},i{2-0},b{1-0},v{00100} str 8 d{2-0},n{2-0},i{2-0},b{1-0},v{10100} ; alu add 3 d{5-3},n{5-3},m{5-3},x{000},v{1010},d{2-0},n{2-0},m{2-0},v{0011111} ; etc.The bit format just defines the bits in order as they are displayed in the instruction decode table, so were easy enough to enter.
From this table it's only about 10 lines of code to decode an instruction and not much more to display it - most of it is just handling the different addressing modes (ok it's a lot more but it's all a simple switch statement). It just searches for the instruction that matches all bits in the v{} sections; first in 16-bit instructions and if none are found then reads another 16 bits and looks in the 32-bit instructions. I still have some sign extension stuff to handle properly but here's some example output.
09: 1 SHT_PROGBITS .text.2 strd.l r4,[r13],#-2 strd.l r6,[r13,#+1] strd.l r8,[r13,#+0] mov.l r12,#0x0000 ldrd.l r44,[r12,#+0] mov.s r4,#0x0001 ldrd.l r46,[r12,#+1] lsl.l r17,r4,r2 ldrd.l r48,[r12,#+2] sub.s r5,r1,r3 ldrd.l r50,[r12,#+3] sub.s r6,r5,#0x0002 ldrd.l r52,[r12,#+4] lsl.s r7,r4,r6Here's the output from objdump for comparison.
Disassembly of section .text.2: 00000000 <_e_build_wtable2>: 0: 957c 0700 strd r4,[sp],-0x2 4: d4fc 0400 strd r6,[sp,+0x1] 8: 147c 2400 strd r8,[sp] c: 800b 2002 mov r12,0x0 10: 906c a400 ldrd r44,[r12,+0x0] 14: 8023 mov r4,0x1 16: d0ec a400 ldrd r46,[r12,+0x1] 1a: 312f 400a lsl r17,r4,r2 1e: 116c c400 ldrd r48,[r12,+0x2] 22: a5ba sub r5,r1,r3 24: 51ec c400 ldrd r50,[r12,+0x3] 28: d533 sub r6,r5,2 2a: 926c c400 ldrd r52,[r12,+0x4] 2e: f32a lsl r7,r4,r6
Because I wrote this in Java, before I could even test it ... I had to write an elf library as well. But elf is simple so it was just a few 'struct' accessors for a memory mapped Java ByteBuffer and only took half an hour via some referencing of the code in ezesdk and elf.h.
A simple static analysis tool should be relatively straightforward at this point although to be useful it needs to do some more complicated things like determine dual-issue and so on. For that my guess is that i'll need a relatively complete pipeline simulator - it doesn't need to simulate the cpu instructions, just the register dependencies. A more dynamic analysis tool would require a simulator but I guess that's possible since the cpu is so simple (performance might be a factor at that point though).
But I don't really know and i'm just piss farting about - I haven't written tools like this for ... forever. Last time was probably a dissasembler I wrote in assembly language for the Commodore 64 about 25 years ago so I could dump the roms. Ahh those were the days. Actually these days aren't much different for me apart from different shit to be anxious about.
Productive enough afternoon anyway, I suppose i'd better go find some food and decide if i'm going to stay up to watch the soccer after watching some local footy and maybe the tour. 5am is a bit too late, or early, and tbh i don't really care too much who wins.
Update: Hacked a bit more last night, came up with a really shitty pipeline simulator.
From this code:
fmadd.l r0,r0,r0 fmadd.l r0,r0,r0 add r17,r16,r16 add r17,r16,r16 add r17,r16,r16 add r17,r16,r16 add r17,r16,r16 add r17,r16,r16 fmadd.l r0,r0,r0 rtsAssembled, then loaded from the elf:
de ra e1 de ra e1 e2 e3 e4 alu: - - - fpu: 0 fmadd - - - - - alu: 1 add - - fpu: 1 fmadd 0 fmadd - - - - alu: 2 add 1 add - fpu: 1 fmadd - 0 fmadd - - - alu: 3 add 2 add 1 add fpu: 1 fmadd - - 0 fmadd - - alu: 4 add 3 add 2 add fpu: 1 fmadd - - - 0 fmadd - alu: 5 add 4 add 3 add fpu: 1 fmadd - - - - 0 fmadd alu: 6 add 5 add 4 add fpu: 6 fmadd 1 fmadd - - - - alu: 7 jr 6 add 5 add fpu: 6 fmadd - 1 fmadd - - - alu: - 7 jr 6 add fpu: 6 fmadd - - 1 fmadd - - alu: - - 7 jr fpu: 6 fmadd - - - 1 fmadd - alu: - - - fpu: 6 fmadd - - - - 1 fmadd alu: - - - fpu: - 6 fmadd - - - - alu: - - - fpu: - - 6 fmadd - - - alu: - - - fpu: - - - 6 fmadd - - alu: - - - fpu: - - - - 6 fmadd - alu: - - - fpu: - - - - - 6 fmadd alu: - - - fpu: - - - - - -The number infront of the instruction is when it entered the pipeline.
Oops, so bit of a bug there, once it dual-issues the first add/fmadd pair it just keeps issuing the ialu ops, which shouldn't happen. I've go the register dependency test in the wrong spot. I can fiddle with the code to fix that up but I need to find out a bit more about how the pipeline works because there some other details the documentation doesn't really cover in enough detail.
Update: After a bit of work on the house I had another look at the pipeline and did some hardware tests. So it looks like as soon as an instruction sequence arrives which might dual-issue, it gets locked into a 'dual issue' pair which will stall both instructions until both are ready to proceed - regardless of the order of the instructions and whether the first could advance on it's own anyway.
So for example, these sequences all execute as dual-issue pairs (all else being equal, there are other alignment related things but I haven't worked them out yet).
fmadd.l r0,r0,r0 fmadd.l r0,r0,r0 mov r16,r16 fmadd.l r0,r0,r0 mov r16,r16 mov r16,r16 fmadd.l r0,r0,r0 mov r16,r16 fmadd.l r0,r0,r0 mov r16,r16 mov r16,r16 mov r16,r16 mov r16,r16 fmadd.l r0,r0,r0 fmadd.l r0,r0,r0 fmadd.l r0,r0,r0
Anyway, so re-running the timing tool with these new changes give a better result:
alu: - - - fpu: 0 fmadd - - - - - alu: 1 add - - fpu: 1 fmadd 0 fmadd - - - - alu: 1 add - - fpu: 1 fmadd - 0 fmadd - - - alu: 1 add - - fpu: 1 fmadd - - 0 fmadd - - alu: 1 add - - fpu: 1 fmadd - - - 0 fmadd - alu: 1 add - - fpu: 1 fmadd - - - - 0 fmadd alu: 6 add 1 add - fpu: - 1 fmadd - - - - alu: 7 add 6 add 1 add fpu: - - 1 fmadd - - - alu: 8 add 7 add 6 add fpu: - - - 1 fmadd - - alu: 9 add 8 add 7 add fpu: - - - - 1 fmadd - alu: 10 add 9 add 8 add fpu: 10 fmadd - - - - 1 fmadd alu: 11 jr 10 add 9 add fpu: - 10 fmadd - - - - alu: - 11 jr 10 add fpu: - - 10 fmadd - - - alu: - - 11 jr fpu: - - - 10 fmadd - - alu: - - - fpu: - - - - 10 fmadd - alu: - - - fpu: - - - - - 10 fmadd alu: - - - fpu: - - - - - -I also have another output format which is like the spu timing tool which shows each instruction in sequence with time horizontal. I don't have the correct labels yet but it shows the dual issue pairs more clearly. The register checking/writing might be in the wrong spot too but the delays look right.
fmadd dr1234 fmadd dr1234 add dr1 add dr1 add dr1 add dr1 add dr1 add dr1 fmadd dr1234 jr dr1
Still a few other details which can wait for another day.
No comments:
Post a Comment