The functionality is basically the same as e-lib but uses a different runtime e_config and the apis were modified slightly in order to gain efficiency and sometimes ease of use. For example the dma code takes a the lower 16-bits of the dma config register base address rather than an index which makes no real difference to the user but shaves a few precious bytes off the implementation. Another example is the irq functions which take take a mask and a set of bits so any combination of bits can be modified in a single invocation rather than having to call multiple times for different ones. Some of the other functions like the global address mapping functions i de-parameterised into separate special purpose functions since generally you know what type of address you have and it makes quite a difference if you specialise and this is the sort of stuff the compiler needs some help with. I just noticed I missed some of the coreid manipulation stuff though.
The ultimate goal is actually to put a lot of stuff in inline blocks which will actually generate less code and help the compiler but to compare the total code sizes I just generated a file with every function in it and mirrored the functionality identically between assembly and c.
Anyway ... (C compiled with -O2 -mshort-calls -std=gnu99)
notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny-c.o text data bss dec hex filename 932 0 0 932 3a4 ez-lib-tiny-c.o notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny.o text data bss dec hex filename 696 0 0 696 2b8 ez-lib-tiny.o
As a comparison the current e-lib.a contains 3072 bytes of code (and a few more trivial functions). Also related is the c startup/support code (crt0 etc) which takes about another 2k above a minimal start-up implementation I am now using (none of which is included in the above numbers).
The smaller one is of course in assembly language. The biggest pieces were the timer stuff because it can't really be parameterised, and particularly the barrier implementation. I think i spotted some bugs in the timer implementation with it's handling of the config register (sigh, why did everything have to go through the one register) but i'm not sure they're bugs. In actual code this will actually shrink even in worst case because quite a few of these will be inlined instead, and generate code fragments which are smaller than a function invocation and without side effects like clobbering work registers.
I managed to squeeze the barrier implementation into only the lowest 8 registers which helped reduce the code-size quite a bit. The whole implementation is 88 bytes - it's amazing how quickly they add up (only 38 instructions). Of course I haven't verified it yet so it could all be terribly broken too.
I made some changes to the api to simplify the code and usage. It only takes a single array pointer which must be a local-core address of group-size bytes of memory. The memory must be at the same address in every core which allows the implementation to implicitly calculate any address required.
This is the C version, and the assembly is just a straightforward hand-compilation of that.
void ez_barrier_init(ez_barrier_t *barrier_array) { for (int i=0;i<ez_config->group_size;i++) barrier_array[i] = 0; }Rather than add code to special-case core0, just have every core clear their barrier block. Only one byte of the non-control core is actually used but why add the extra code to the library if it doesn't hurt.
void ez_barrier(ez_barrier_t *barrier_array) { volatile ez_barrier_t *ba = barrier_array; int index = ez_config->core_index; if (index == 0) { // Wait for all others (not us, we know we're here) for (int i=ez_config->group_size-1;i>0;i--) { while (ba[i] == 0) ; // We can reset the local flag immediately ba[i] = 0; } // Notify (do us because it doesn't hurt and simplifies the loops) for (int r=ez_config->group_rows;r>0;r--) { for (int c=ez_config->group_cols;c>0;c--) { volatile ez_barrier_t *rb = ez_global_core(barrier_array, r-1, c-1); rb[0] = 0; } } } else { volatile ez_barrier_t *root = ez_global_core(barrier_array, 0, 0); // Mark local signal and notify root ba[0] = 1; root[index] = 1; // Await clear while (ba[0]) ; } }
In contrast to the assembly version this compiles into 140 bytes (-O2). I know it's pretty much pissing in the wind at this point but, you know, hobbies generally are. The one in e-lib hits 252 but a sizeable chunk of that is the floating point unit config manipulation needed for an integer multiply - not needed in my case because I tweaked the workgroup config to include the same info pretty much exactly for this purpose (and wanting a flat index for the current core is extremely common in practice, even in 2d kernels). The e-lib version does require more run-time memory though: 16 + 64*4 = 80 bytes vs just 16 for a 4x4 epiphany.
(as an aside, it's a real bummer the hardware barrier has bugs which probably prevent it from working on anything but a whole-chip workgroup. That would've been much faster and taken much less code to implement. I will add a separate api entry point for hardware barriers; a whole-chip barrier is certainly useful for many applications.
I might be able to apply some of the optimisation techniques I employed in the assembly version to further improve it too - for example the assembly version basically collapses all 'calls' to ez_global_core() into a single calculation in the prologue. This feedback from coding in assembly and trying to work out what the compile is doing has happened a few times although it isn't very reliable and might make no difference. Note that the compiler has in-lined almost all of the local calls like ez_global_core() already which helps code-size quite a bit (this is a version of e_get_global_address() that doesn't have special case code for global addresses or E_SELF).
I removed the separate reset loop on the controller core code path since each slot in the array has a dedicated user and there's no need to synchronise with the others before clearing the local flags.
Oh, and in an attempt to help improve the latency it starts writing from the farthest away core first. If one takes a single row it should end up having all the writes arrive at approximately the same time as the train of writes arrives at each destination in lock-step. That's the idea anyway. I need to check it against the routing algorithm to see if it should be by columns instead of rows although it might not make any difference.
I should've really been out enjoying another unusually warm day but ... yeah i didn't. Mowed the lawn though and nearly headed into the city for an afternoon drink, but somehow got distracted for about 8 hours and now it's dark and the lights are still off (Hmmm. Just fixed). Better go hunt for food I guess.
No comments:
Post a Comment