Easiest to demonstrate in the epiphany instruction set.
A simple example:
extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
id[0] = e_group_config.core_row;
id[1] = e_group_config.core_col;
}
-->
0: 000b 0002 mov r0,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 000b 1002 movt r0,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 2044 ldr r1,[r0]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 000b 1002 movt r0,0x0
e: R_EPIPHANY_HIGH .bss
12: 2054 str r1,[r0]
14: 000b 0002 mov r0,0x0
14: R_EPIPHANY_LOW _e_group_config+0x20
18: 000b 1002 movt r0,0x0
18: R_EPIPHANY_HIGH _e_group_config+0x20
1c: 2044 ldr r1,[r0]
1e: 000b 0002 mov r0,0x0
1e: R_EPIPHANY_LOW .bss+0x4
22: 000b 1002 movt r0,0x0
22: R_EPIPHANY_HIGH .bss+0x4
26: 2054 str r1,[r0]
Err, what?
It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.
Curious also ...
extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
int *idp = id;
const e_group_config_t *ep = &e_group_config;
idp[0] = ep->core_row;
idp[1] = ep->core_col;
}
-->
0: 200b 0002 mov r1,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 200b 1002 movt r1,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 2444 ldr r1,[r1]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 000b 1002 movt r0,0x0
e: R_EPIPHANY_HIGH .bss
12: 2054 str r1,[r0]
14: 200b 0002 mov r1,0x0
14: R_EPIPHANY_LOW _e_group_config+0x20
18: 200b 1002 movt r1,0x0
18: R_EPIPHANY_HIGH _e_group_config+0x20
1c: 2444 ldr r1,[r1]
1e: 20d4 str r1,[r0,0x1]
This fixes the array references, but not the struct references.
If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).
#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
int *idp = id;
idp[0] = e_group_config->core_row;
idp[1] = e_group_config->core_col;[/code]
}
-->
0: 2503 mov r1,0x28
2: 47c4 ldr r2,[r1,0x7]
4: 000b 0002 mov r0,0x0
4: R_EPIPHANY_LOW .bss
8: 000b 1002 movt r0,0x0
8: R_EPIPHANY_HIGH .bss
c: 4054 str r2,[r0]
e: 244c 0001 ldr r1,[r1,+0x8]
12: 20d4 str r1,[r0,0x1]
Bit of a throwing-hands-in-the-air moment.
Using -O3 on the original example gives something reasonable:
0: 200b 0002 mov r1,0x0
0: R_EPIPHANY_LOW _e_group_config+0x1c
4: 200b 1002 movt r1,0x0
4: R_EPIPHANY_HIGH _e_group_config+0x1c
8: 4444 ldr r2,[r1]
a: 000b 0002 mov r0,0x0
a: R_EPIPHANY_LOW .bss
e: 24c4 ldr r1,[r1,0x1]
10: 000b 1002 movt r0,0x0
10: R_EPIPHANY_HIGH .bss
14: 4054 str r2,[r0]
16: 20d4 str r1,[r0,0x1]
Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.
I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 62: R_X86_64_PC32 e_group_config+0x18 6: 89 05 00 00 00 00 mov %eax,0x0(%rip) # c 8: R_X86_64_PC32 .bss-0x4 c: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 12 e: R_X86_64_PC32 e_group_config+0x1c 12: 89 05 00 00 00 00 mov %eax,0x0(%rip) # 18 14: R_X86_64_PC32 .bss+0x7c
It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.
Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).
Alternate runtime
I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.
I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.
I've come up with this group-info structure which leads to more compact code for a variety of reasons:
struct ez_config_t {
uint16_t reserved0;
uint16_t reserved1;
uint16_t group_size;
uint16_t group_rows;
uint16_t group_cols;
uint16_t core_index;
uint16_t core_row;
uint16_t core_col;
uint32_t group_id;
void *extmem;
uint32_t reserved2;
uint32_t reserved3;
};
The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.
To test it I access every value once:
#define ez_configp ((ez_config_t *)(0x28))
int *idp = id;
idp[0] = ez_configp->group_size;
idp[1] = ez_configp->group_rows;
idp[2] = ez_configp->group_cols;
idp[3] = ez_configp->core_index;
idp[4] = ez_configp->core_row;
idp[5] = ez_configp->core_col;
idp[6] = ez_configp->group_id;
idp[7] = (int32_t)ez_configp->extmem;
-->
0: 2503 mov r1,0x28
2: 000b 0002 mov r0,0x0
6: 4524 ldrh r2,[r1,0x2]
8: 000b 1002 movt r0,0x0
c: 4054 str r2,[r0]
e: 45a4 ldrh r2,[r1,0x3]
10: 40d4 str r2,[r0,0x1]
12: 4624 ldrh r2,[r1,0x4]
14: 4154 str r2,[r0,0x2]
16: 46a4 ldrh r2,[r1,0x5]
18: 41d4 str r2,[r0,0x3]
1a: 4724 ldrh r2,[r1,0x6]
1c: 4254 str r2,[r0,0x4]
1e: 47a4 ldrh r2,[r1,0x7]
20: 42d4 str r2,[r0,0x5]
22: 4744 ldr r2,[r1,0x6]
24: 27c4 ldr r1,[r1,0x7]
26: 4354 str r2,[r0,0x6]
28: 23d4 str r1,[r0,0x7]
And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.
I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.
int *idp = id;
idp[0] = e_group_config.group_rows * e_group_config.group_cols;
idp[1] = e_group_config.group_rows;
idp[2] = e_group_config.group_cols;
idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
idp[4] = e_group_config.group_row;
idp[5] = e_group_config.group_col;
idp[6] = e_group_config.group_id;
idp[7] = (int32_t)e_emem_config.base;
--> -Os with default fpu mode
0: 000b 0002 mov r0,0x0
4: 000b 1002 movt r0,0x0
8: 804c 2000 ldr r12,[r0,+0x0]
c: 000b 0002 mov r0,0x0
10: 000b 1002 movt r0,0x0
14: 4044 ldr r2,[r0]
16: 000b 4002 mov r16,0x0
1a: 000b 0002 mov r0,0x0
1e: 000b 1002 movt r0,0x0
22: 010b 5002 movt r16,0x8
26: 2112 movfs r1,config
28: 0392 gid
2a: 411f 4002 movfs r18,config
2e: 487f 490a orr r18,r18,r16
32: 410f 4002 movts config,r18
36: 0192 gie
38: 0392 gid
3a: 611f 4002 movfs r19,config
3e: 6c7f 490a orr r19,r19,r16
42: 610f 4002 movts config,r19
46: 0192 gie
48: 0a2f 4087 fmul r16,r2,r12
4c: 80dc 2000 str r12,[r0,+0x1]
50: 800b 2002 mov r12,0x0
54: 800b 3002 movt r12,0x0
58: 4154 str r2,[r0,0x2]
5a: 005c 4000 str r16,[r0]
5e: 104c 4400 ldr r16,[r12,+0x0]
62: 800b 2002 mov r12,0x0
66: 800b 3002 movt r12,0x0
6a: 412f 0807 fmul r2,r16,r2
6e: 904c 2400 ldr r12,[r12,+0x0]
72: 025c 4000 str r16,[r0,+0x4]
76: 82dc 2000 str r12,[r0,+0x5]
7a: 4a1f 008a add r2,r2,r12
7e: 41d4 str r2,[r0,0x3]
80: 400b 0002 mov r2,0x0
84: 400b 1002 movt r2,0x0
88: 4844 ldr r2,[r2]
8a: 4354 str r2,[r0,0x6]
8c: 400b 0002 mov r2,0x0
90: 400b 1002 movt r2,0x0
94: 48c4 ldr r2,[r2,0x1]
96: 43d4 str r2,[r0,0x7]
98: 0392 gid
9a: 611f 4002 movfs r19,config
9e: 6c8f 480a eor r19,r19,r1
a2: 6ddf 480a and r19,r19,r3
a6: 6c8f 480a eor r19,r19,r1
aa: 610f 4002 movts config,r19
ae: 0192 gie
b0: 0392 gid
b2: 011f 4002 movfs r16,config
b6: 008f 480a eor r16,r16,r1
ba: 01df 480a and r16,r16,r3
be: 008f 480a eor r16,r16,r1
c2: 010f 4002 movts config,r16
c6: 0192 gie
--> -O3 with -mfp-mode=int
0: 000b 0002 mov r0,0x0
4: 000b 1002 movt r0,0x0
8: 2044 ldr r1,[r0]
a: 000b 0002 mov r0,0x0
e: 000b 1002 movt r0,0x0
12: 6044 ldr r3,[r0]
14: 000b 0002 mov r0,0x0
18: 000b 1002 movt r0,0x0
1c: 804c 2000 ldr r12,[r0,+0x0]
20: 4caf 4007 fmul r18,r3,r1
24: 000b 4002 mov r16,0x0
28: 000b 5002 movt r16,0x0
2c: 000b 0002 mov r0,0x0
30: 662f 4087 fmul r19,r1,r12
34: 000b 1002 movt r0,0x0
38: 204c 4800 ldr r17,[r16,+0x0]
3c: 000b 4002 mov r16,0x0
40: 4044 ldr r2,[r0]
42: 000b 5002 movt r16,0x0
46: 000b 0002 mov r0,0x0
4a: 00cc 4800 ldr r16,[r16,+0x1]
4e: 000b 1002 movt r0,0x0
52: 491f 480a add r18,r18,r2
56: 605c 4000 str r19,[r0]
5a: 80dc 2000 str r12,[r0,+0x1]
5e: 2154 str r1,[r0,0x2]
60: 41dc 4000 str r18,[r0,+0x3]
64: 6254 str r3,[r0,0x4]
66: 42d4 str r2,[r0,0x5]
68: 235c 4000 str r17,[r0,+0x6]
6c: 03dc 4000 str r16,[r0,+0x7]
Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.
Oh blast, and the absolute loads are back anyway!
My hands might not stay attached if i keep throwing them up in the air at this point.
For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.
No comments:
Post a Comment