Tuesday, 25 March 2014

compiler strangeness

So I hit a strange issue with gcc. Well i don't know ... not 'strange', just unexpected. It probably doesn't matter much on x86 because it has so few registers and such a shitty breadth of addressing modes but on arm and epiphany it generates some pretty shit load/store code outside of an unexpected optimisation flag (and -O3, and even then only sometimes?).

Easiest to demonstrate in the epiphany instruction set.

A simple example:

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 id[0] = e_group_config.core_row;
 id[1] = e_group_config.core_col;
}

 -->

   0:   000b 0002       mov r0,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   000b 1002       movt r0,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   000b 0002       mov r0,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   000b 1002       movt r0,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2044            ldr r1,[r0]
  1e:   000b 0002       mov r0,0x0
                        1e: R_EPIPHANY_LOW      .bss+0x4
  22:   000b 1002       movt r0,0x0
                        22: R_EPIPHANY_HIGH     .bss+0x4
  26:   2054            str r1,[r0]

Err, what?

It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.

Curious also ...

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 int *idp = id;
 const e_group_config_t *ep = &e_group_config;

 idp[0] = ep->core_row;
 idp[1] = ep->core_col;
}

 -->
   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2444            ldr r1,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   200b 0002       mov r1,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   200b 1002       movt r1,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2444            ldr r1,[r1]
  1e:   20d4            str r1,[r0,0x1]

This fixes the array references, but not the struct references.

If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).

#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
  int *idp = id;
  idp[0] = e_group_config->core_row;
  idp[1] = e_group_config->core_col;[/code]
}

 -->

   0:   2503            mov r1,0x28
   2:   47c4            ldr r2,[r1,0x7]
   4:   000b 0002       mov r0,0x0
                        4: R_EPIPHANY_LOW       .bss
   8:   000b 1002       movt r0,0x0
                        8: R_EPIPHANY_HIGH      .bss
   c:   4054            str r2,[r0]
   e:   244c 0001       ldr r1,[r1,+0x8]
  12:   20d4            str r1,[r0,0x1]

Bit of a throwing-hands-in-the-air moment.

Using -O3 on the original example gives something reasonable:

   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   4444            ldr r2,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   24c4            ldr r1,[r1,0x1]
  10:   000b 1002       movt r0,0x0
                        10: R_EPIPHANY_HIGH     .bss
  14:   4054            str r2,[r0]
  16:   20d4            str r1,[r0,0x1]

Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.

I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 
                        2: R_X86_64_PC32        e_group_config+0x18
   6:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # c 
                        8: R_X86_64_PC32        .bss-0x4
   c:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 12 
                        e: R_X86_64_PC32        e_group_config+0x1c
  12:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 18 
                        14: R_X86_64_PC32       .bss+0x7c

It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.

Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).

Alternate runtime

I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.

I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.

I've come up with this group-info structure which leads to more compact code for a variety of reasons:

struct ez_config_t {
    uint16_t reserved0;
    uint16_t reserved1;

    uint16_t group_size;
    uint16_t group_rows;
    uint16_t group_cols;

    uint16_t core_index;
    uint16_t core_row;
    uint16_t core_col;

    uint32_t group_id;
    void *extmem;

    uint32_t reserved2;
    uint32_t reserved3;
};

The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.

To test it I access every value once:

        #define ez_configp ((ez_config_t *)(0x28))
        int *idp = id;
        idp[0] = ez_configp->group_size;
        idp[1] = ez_configp->group_rows;
        idp[2] = ez_configp->group_cols;
        idp[3] = ez_configp->core_index;
        idp[4] = ez_configp->core_row;
        idp[5] = ez_configp->core_col;

        idp[6] = ez_configp->group_id;
        idp[7] = (int32_t)ez_configp->extmem;

 -->
   0:   2503            mov r1,0x28
   2:   000b 0002       mov r0,0x0
   6:   4524            ldrh r2,[r1,0x2]
   8:   000b 1002       movt r0,0x0
   c:   4054            str r2,[r0]
   e:   45a4            ldrh r2,[r1,0x3]
  10:   40d4            str r2,[r0,0x1]
  12:   4624            ldrh r2,[r1,0x4]
  14:   4154            str r2,[r0,0x2]
  16:   46a4            ldrh r2,[r1,0x5]
  18:   41d4            str r2,[r0,0x3]
  1a:   4724            ldrh r2,[r1,0x6]
  1c:   4254            str r2,[r0,0x4]
  1e:   47a4            ldrh r2,[r1,0x7]
  20:   42d4            str r2,[r0,0x5]
  22:   4744            ldr r2,[r1,0x6]
  24:   27c4            ldr r1,[r1,0x7]
  26:   4354            str r2,[r0,0x6]
  28:   23d4            str r1,[r0,0x7]

And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.

I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.

    int *idp = id;
    idp[0] = e_group_config.group_rows * e_group_config.group_cols;
    idp[1] = e_group_config.group_rows;
    idp[2] = e_group_config.group_cols;
    idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
    idp[4] = e_group_config.group_row;
    idp[5] = e_group_config.group_col;

    idp[6] = e_group_config.group_id;
    idp[7] = (int32_t)e_emem_config.base;

 --> -Os with default fpu mode
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   804c 2000       ldr r12,[r0,+0x0]
   c:   000b 0002       mov r0,0x0
  10:   000b 1002       movt r0,0x0
  14:   4044            ldr r2,[r0]
  16:   000b 4002       mov r16,0x0
  1a:   000b 0002       mov r0,0x0
  1e:   000b 1002       movt r0,0x0
  22:   010b 5002       movt r16,0x8
  26:   2112            movfs r1,config
  28:   0392            gid
  2a:   411f 4002       movfs r18,config
  2e:   487f 490a       orr r18,r18,r16
  32:   410f 4002       movts config,r18
  36:   0192            gie
  38:   0392            gid
  3a:   611f 4002       movfs r19,config
  3e:   6c7f 490a       orr r19,r19,r16
  42:   610f 4002       movts config,r19
  46:   0192            gie
  48:   0a2f 4087       fmul r16,r2,r12
  4c:   80dc 2000       str r12,[r0,+0x1]
  50:   800b 2002       mov r12,0x0
  54:   800b 3002       movt r12,0x0
  58:   4154            str r2,[r0,0x2]
  5a:   005c 4000       str r16,[r0]
  5e:   104c 4400       ldr r16,[r12,+0x0]
  62:   800b 2002       mov r12,0x0
  66:   800b 3002       movt r12,0x0
  6a:   412f 0807       fmul r2,r16,r2
  6e:   904c 2400       ldr r12,[r12,+0x0]
  72:   025c 4000       str r16,[r0,+0x4]
  76:   82dc 2000       str r12,[r0,+0x5]
  7a:   4a1f 008a       add r2,r2,r12
  7e:   41d4            str r2,[r0,0x3]
  80:   400b 0002       mov r2,0x0
  84:   400b 1002       movt r2,0x0
  88:   4844            ldr r2,[r2]
  8a:   4354            str r2,[r0,0x6]
  8c:   400b 0002       mov r2,0x0
  90:   400b 1002       movt r2,0x0
  94:   48c4            ldr r2,[r2,0x1]
  96:   43d4            str r2,[r0,0x7]
  98:   0392            gid
  9a:   611f 4002       movfs r19,config
  9e:   6c8f 480a       eor r19,r19,r1
  a2:   6ddf 480a       and r19,r19,r3
  a6:   6c8f 480a       eor r19,r19,r1
  aa:   610f 4002       movts config,r19
  ae:   0192            gie
  b0:   0392            gid
  b2:   011f 4002       movfs r16,config
  b6:   008f 480a       eor r16,r16,r1
  ba:   01df 480a       and r16,r16,r3
  be:   008f 480a       eor r16,r16,r1
  c2:   010f 4002       movts config,r16
  c6:   0192            gie

 --> -O3 with -mfp-mode=int
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
   e:   000b 1002       movt r0,0x0
  12:   6044            ldr r3,[r0]
  14:   000b 0002       mov r0,0x0
  18:   000b 1002       movt r0,0x0
  1c:   804c 2000       ldr r12,[r0,+0x0]
  20:   4caf 4007       fmul r18,r3,r1
  24:   000b 4002       mov r16,0x0
  28:   000b 5002       movt r16,0x0
  2c:   000b 0002       mov r0,0x0
  30:   662f 4087       fmul r19,r1,r12
  34:   000b 1002       movt r0,0x0
  38:   204c 4800       ldr r17,[r16,+0x0]
  3c:   000b 4002       mov r16,0x0
  40:   4044            ldr r2,[r0]
  42:   000b 5002       movt r16,0x0
  46:   000b 0002       mov r0,0x0
  4a:   00cc 4800       ldr r16,[r16,+0x1]
  4e:   000b 1002       movt r0,0x0
  52:   491f 480a       add r18,r18,r2
  56:   605c 4000       str r19,[r0]
  5a:   80dc 2000       str r12,[r0,+0x1]
  5e:   2154            str r1,[r0,0x2]
  60:   41dc 4000       str r18,[r0,+0x3]
  64:   6254            str r3,[r0,0x4]
  66:   42d4            str r2,[r0,0x5]
  68:   235c 4000       str r17,[r0,+0x6]
  6c:   03dc 4000       str r16,[r0,+0x7]

Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.

Oh blast, and the absolute loads are back anyway!

My hands might not stay attached if i keep throwing them up in the air at this point.

For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.

No comments: