Easiest to demonstrate in the epiphany instruction set.
A simple example:
extern const e_group_config_t e_group_config; int id[8]; void foo(void) { id[0] = e_group_config.core_row; id[1] = e_group_config.core_col; } --> 0: 000b 0002 mov r0,0x0 0: R_EPIPHANY_LOW _e_group_config+0x1c 4: 000b 1002 movt r0,0x0 4: R_EPIPHANY_HIGH _e_group_config+0x1c 8: 2044 ldr r1,[r0] a: 000b 0002 mov r0,0x0 a: R_EPIPHANY_LOW .bss e: 000b 1002 movt r0,0x0 e: R_EPIPHANY_HIGH .bss 12: 2054 str r1,[r0] 14: 000b 0002 mov r0,0x0 14: R_EPIPHANY_LOW _e_group_config+0x20 18: 000b 1002 movt r0,0x0 18: R_EPIPHANY_HIGH _e_group_config+0x20 1c: 2044 ldr r1,[r0] 1e: 000b 0002 mov r0,0x0 1e: R_EPIPHANY_LOW .bss+0x4 22: 000b 1002 movt r0,0x0 22: R_EPIPHANY_HIGH .bss+0x4 26: 2054 str r1,[r0]
Err, what?
It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.
Curious also ...
extern const e_group_config_t e_group_config; int id[8]; void foo(void) { int *idp = id; const e_group_config_t *ep = &e_group_config; idp[0] = ep->core_row; idp[1] = ep->core_col; } --> 0: 200b 0002 mov r1,0x0 0: R_EPIPHANY_LOW _e_group_config+0x1c 4: 200b 1002 movt r1,0x0 4: R_EPIPHANY_HIGH _e_group_config+0x1c 8: 2444 ldr r1,[r1] a: 000b 0002 mov r0,0x0 a: R_EPIPHANY_LOW .bss e: 000b 1002 movt r0,0x0 e: R_EPIPHANY_HIGH .bss 12: 2054 str r1,[r0] 14: 200b 0002 mov r1,0x0 14: R_EPIPHANY_LOW _e_group_config+0x20 18: 200b 1002 movt r1,0x0 18: R_EPIPHANY_HIGH _e_group_config+0x20 1c: 2444 ldr r1,[r1] 1e: 20d4 str r1,[r0,0x1]
This fixes the array references, but not the struct references.
If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).
#define e_group_config ((const e_group_config_t *)0x28) int id[8]; void foo(void) { int *idp = id; idp[0] = e_group_config->core_row; idp[1] = e_group_config->core_col;[/code] } --> 0: 2503 mov r1,0x28 2: 47c4 ldr r2,[r1,0x7] 4: 000b 0002 mov r0,0x0 4: R_EPIPHANY_LOW .bss 8: 000b 1002 movt r0,0x0 8: R_EPIPHANY_HIGH .bss c: 4054 str r2,[r0] e: 244c 0001 ldr r1,[r1,+0x8] 12: 20d4 str r1,[r0,0x1]
Bit of a throwing-hands-in-the-air moment.
Using -O3 on the original example gives something reasonable:
0: 200b 0002 mov r1,0x0 0: R_EPIPHANY_LOW _e_group_config+0x1c 4: 200b 1002 movt r1,0x0 4: R_EPIPHANY_HIGH _e_group_config+0x1c 8: 4444 ldr r2,[r1] a: 000b 0002 mov r0,0x0 a: R_EPIPHANY_LOW .bss e: 24c4 ldr r1,[r1,0x1] 10: 000b 1002 movt r0,0x0 10: R_EPIPHANY_HIGH .bss 14: 4054 str r2,[r0] 16: 20d4 str r1,[r0,0x1]
Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.
I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 62: R_X86_64_PC32 e_group_config+0x18 6: 89 05 00 00 00 00 mov %eax,0x0(%rip) # c 8: R_X86_64_PC32 .bss-0x4 c: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 12 e: R_X86_64_PC32 e_group_config+0x1c 12: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 18 14: R_X86_64_PC32 .bss+0x7c
It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.
Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).
Alternate runtime
I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.
I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.
I've come up with this group-info structure which leads to more compact code for a variety of reasons:
struct ez_config_t { uint16_t reserved0; uint16_t reserved1; uint16_t group_size; uint16_t group_rows; uint16_t group_cols; uint16_t core_index; uint16_t core_row; uint16_t core_col; uint32_t group_id; void *extmem; uint32_t reserved2; uint32_t reserved3; };
The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.
To test it I access every value once:
#define ez_configp ((ez_config_t *)(0x28)) int *idp = id; idp[0] = ez_configp->group_size; idp[1] = ez_configp->group_rows; idp[2] = ez_configp->group_cols; idp[3] = ez_configp->core_index; idp[4] = ez_configp->core_row; idp[5] = ez_configp->core_col; idp[6] = ez_configp->group_id; idp[7] = (int32_t)ez_configp->extmem; --> 0: 2503 mov r1,0x28 2: 000b 0002 mov r0,0x0 6: 4524 ldrh r2,[r1,0x2] 8: 000b 1002 movt r0,0x0 c: 4054 str r2,[r0] e: 45a4 ldrh r2,[r1,0x3] 10: 40d4 str r2,[r0,0x1] 12: 4624 ldrh r2,[r1,0x4] 14: 4154 str r2,[r0,0x2] 16: 46a4 ldrh r2,[r1,0x5] 18: 41d4 str r2,[r0,0x3] 1a: 4724 ldrh r2,[r1,0x6] 1c: 4254 str r2,[r0,0x4] 1e: 47a4 ldrh r2,[r1,0x7] 20: 42d4 str r2,[r0,0x5] 22: 4744 ldr r2,[r1,0x6] 24: 27c4 ldr r1,[r1,0x7] 26: 4354 str r2,[r0,0x6] 28: 23d4 str r1,[r0,0x7]
And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.
I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.
int *idp = id; idp[0] = e_group_config.group_rows * e_group_config.group_cols; idp[1] = e_group_config.group_rows; idp[2] = e_group_config.group_cols; idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col; idp[4] = e_group_config.group_row; idp[5] = e_group_config.group_col; idp[6] = e_group_config.group_id; idp[7] = (int32_t)e_emem_config.base; --> -Os with default fpu mode 0: 000b 0002 mov r0,0x0 4: 000b 1002 movt r0,0x0 8: 804c 2000 ldr r12,[r0,+0x0] c: 000b 0002 mov r0,0x0 10: 000b 1002 movt r0,0x0 14: 4044 ldr r2,[r0] 16: 000b 4002 mov r16,0x0 1a: 000b 0002 mov r0,0x0 1e: 000b 1002 movt r0,0x0 22: 010b 5002 movt r16,0x8 26: 2112 movfs r1,config 28: 0392 gid 2a: 411f 4002 movfs r18,config 2e: 487f 490a orr r18,r18,r16 32: 410f 4002 movts config,r18 36: 0192 gie 38: 0392 gid 3a: 611f 4002 movfs r19,config 3e: 6c7f 490a orr r19,r19,r16 42: 610f 4002 movts config,r19 46: 0192 gie 48: 0a2f 4087 fmul r16,r2,r12 4c: 80dc 2000 str r12,[r0,+0x1] 50: 800b 2002 mov r12,0x0 54: 800b 3002 movt r12,0x0 58: 4154 str r2,[r0,0x2] 5a: 005c 4000 str r16,[r0] 5e: 104c 4400 ldr r16,[r12,+0x0] 62: 800b 2002 mov r12,0x0 66: 800b 3002 movt r12,0x0 6a: 412f 0807 fmul r2,r16,r2 6e: 904c 2400 ldr r12,[r12,+0x0] 72: 025c 4000 str r16,[r0,+0x4] 76: 82dc 2000 str r12,[r0,+0x5] 7a: 4a1f 008a add r2,r2,r12 7e: 41d4 str r2,[r0,0x3] 80: 400b 0002 mov r2,0x0 84: 400b 1002 movt r2,0x0 88: 4844 ldr r2,[r2] 8a: 4354 str r2,[r0,0x6] 8c: 400b 0002 mov r2,0x0 90: 400b 1002 movt r2,0x0 94: 48c4 ldr r2,[r2,0x1] 96: 43d4 str r2,[r0,0x7] 98: 0392 gid 9a: 611f 4002 movfs r19,config 9e: 6c8f 480a eor r19,r19,r1 a2: 6ddf 480a and r19,r19,r3 a6: 6c8f 480a eor r19,r19,r1 aa: 610f 4002 movts config,r19 ae: 0192 gie b0: 0392 gid b2: 011f 4002 movfs r16,config b6: 008f 480a eor r16,r16,r1 ba: 01df 480a and r16,r16,r3 be: 008f 480a eor r16,r16,r1 c2: 010f 4002 movts config,r16 c6: 0192 gie --> -O3 with -mfp-mode=int 0: 000b 0002 mov r0,0x0 4: 000b 1002 movt r0,0x0 8: 2044 ldr r1,[r0] a: 000b 0002 mov r0,0x0 e: 000b 1002 movt r0,0x0 12: 6044 ldr r3,[r0] 14: 000b 0002 mov r0,0x0 18: 000b 1002 movt r0,0x0 1c: 804c 2000 ldr r12,[r0,+0x0] 20: 4caf 4007 fmul r18,r3,r1 24: 000b 4002 mov r16,0x0 28: 000b 5002 movt r16,0x0 2c: 000b 0002 mov r0,0x0 30: 662f 4087 fmul r19,r1,r12 34: 000b 1002 movt r0,0x0 38: 204c 4800 ldr r17,[r16,+0x0] 3c: 000b 4002 mov r16,0x0 40: 4044 ldr r2,[r0] 42: 000b 5002 movt r16,0x0 46: 000b 0002 mov r0,0x0 4a: 00cc 4800 ldr r16,[r16,+0x1] 4e: 000b 1002 movt r0,0x0 52: 491f 480a add r18,r18,r2 56: 605c 4000 str r19,[r0] 5a: 80dc 2000 str r12,[r0,+0x1] 5e: 2154 str r1,[r0,0x2] 60: 41dc 4000 str r18,[r0,+0x3] 64: 6254 str r3,[r0,0x4] 66: 42d4 str r2,[r0,0x5] 68: 235c 4000 str r17,[r0,+0x6] 6c: 03dc 4000 str r16,[r0,+0x7]
Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.
Oh blast, and the absolute loads are back anyway!
My hands might not stay attached if i keep throwing them up in the air at this point.
For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.
No comments:
Post a Comment