Monday, 31 March 2014

a (slightly) better barrier, signals

So I kept thinking about the last point on the last post. That is, having a job-based runtime similar to hsa queuing.

One feature that employs is a signalling mechanism that I think lets waiters wait on multiple events before continuing. I came up with a possible epiphany-friendly primitive to support that.

First there is the sending object - this is what each sender gets to signal against. It requires a static allocation of slots by the runtime so that they can signal with a single asynchronous write without needing an atomic index update that a semaphore like object would require.

struct ez_signal_t {
 uint slot;
 ez_signal_bank_t *bank;
};

void ez_signal(ez_signal_t s) {
 s.bank->s.signals[s.slot] = 0xff;
}

The signal bank is basically just an array of bytes per sender. But also allows for word access for a more efficient wait.

struct ez_signal_bank_t {
 uint size;
 uint reserved;
 union {
  ubyte signals[8];
  uint block[2];
 } s;
 // others follow
 // must be rounded up to 4 bytes
};
(the structure is aligned to 8 because of the union, it's probably not worth it and i will use an assembly version anyway)

The bank is initialised by setting every byte within size to 0, but those outside of that range which might fall into a word slot are pre-filled with the signal value of 0xff. This allows the waiter to execute a simple uint based loop without having to worry about edge conditions.

void ez_signal_wait(ez_signal_bank_t *bank) {
        uint size = bank->size;
        volatile uint *si = bank->s.block;

        size = ((size+3) >> 2);
        while (size > 0) {
                while (si[size-1] != 0xffffffff)
                        ;
                size -= 1;
        }
}

The same technique could be applied to barrier() as well - a tiny improvement but one nonetheless. It does need some edge case handling for the reset and init though which will take more code.

(Hmm, on closer inspection it isn't clear whether the hsa signalling objects actually support multiple writers or not - it seems perhaps not. However they do operate at a higher level of abstraction and may need multiple writers internally in order to implement the public interface when multiple cores are involved. Update: On further inspection it looks like they are basically signal semaphores but with a flexible condition check on the counter and the ability to use opencl atomics on the index.)

Actually another thought I had was that since the barrier is tied directly to the workgroup why is the user code even calling init on it's own structure anyway? It may as well be a parameter-less function that just uses a pre-defined area as the barrier memory which is setup by the loader and/or the init routine. I think the current barrier implementation mechanism may have an issue because you can't initialise the barrier data structures in parallel without it being a race condition and so even the setup needs another synchronisation mechanism to implement it properly. By placing it in an area that can be initialised by the loader/runtime that race goes away quietly to die. And if the barrier is chip-wide or the WAND routing ever gets fixed it could just use a hardware barrier implementation instead.

This thought may be extended to other data structures and facilities that are needed at runtime. Why have the error prone and bulky code to setup a dma structure when the logic could be encoded in a routine with simple arguments that covers the most common cases with just two functions: 1D and 2D? And if you're doing that why not add an async copy capability with interrupt driven queuing as well? Well it may not be worth it in the end but its worth exploring to find out.

gdb sim

So I actually used the gdb simulator for the first time yesterday to step through some assembly language to verify some code. So far I've just debugged using the run-till-it-stops-crashing method. Bit of a pain it doesn't seem to think assembly language is actually a thing though:

$ e-gdb
...

(gdb) target sim
Connected to the simulator.
(gdb) file a.out
Reading symbols from /home/notzed/src/elf-loader/a.out...done.
(gdb) load
Loading section ivt_reset, size 0x4 lma 0x0
Loading section .reserved_crt0, size 0xc lma 0x58
Loading section NEW_LIB_RO, size 0x170 lma 0x64
Loading section NEW_LIB_WR, size 0x450 lma 0x1d8
Loading section GNU_C_BUILTIN_LIB_RO, size 0x6 lma 0x628
Loading section .init, size 0x24 lma 0x62e
Loading section .text, size 0x268 lma 0x660
Loading section .fini, size 0x1a lma 0x8c8
Loading section .ctors, size 0x8 lma 0x8e4
Loading section .dtors, size 0x8 lma 0x8ec
Loading section .jcr, size 0x4 lma 0x8f4
Loading section .data, size 0x4 lma 0x8f8
Loading section .rodata, size 0x8 lma 0x900
Start address 0x0
Transfer rate: 17632 bits in <1 sec.
(gdb) b ez_signal_init
Breakpoint 1 at 0x85a
(gdb) r
Starting program: /home/foo/src/elf-loader/a.out 

Breakpoint 1, 0x0000085a in ez_signal_init ()
(gdb) stepi
0x0000085e in ez_signal_init ()
(gdb) 
0x00000860 in ez_signal_init ()
(gdb) 
0x00000862 in ez_signal_init ()
(gdb) 
0x00000864 in ez_signal_init ()

Yay? Had better tools on the C=64.

But I came across this article which provides something of a workaround. Not the best but it will suffice.

(gdb) display /3i $pc
1: x/3i $pc
=> 0x864 <ez_signal_init+10>:   lsl r1,r1,0x2
   0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
(gdb) stepi
0x00000866 in ez_signal_init ()
1: x/3i $pc
=> 0x866 <ez_signal_init+12>:   lsl r3,r3,0x3
   0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
(gdb) 
0x00000868 in ez_signal_init ()
1: x/3i $pc
=> 0x868 <ez_signal_init+14>:   beq 0x880 <ez_signal_init+38>
   0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
(gdb) 
0x0000086a in ez_signal_init ()
1: x/3i $pc
=> 0x86a <ez_signal_init+16>:   mov r2,0xffff
   0x86e <ez_signal_init+20>:   movt r2,0xffff
   0x872 <ez_signal_init+24>:   lsr r2,r2,r3
(gdb) 

set confirm off may also reduce offence, but otherwise it was all a bit easier than I remember it seemed when I read about it as the few commands above show.

I tend to only use debuggers these days as a tool of absolutely last resort. Most 'bugs' I work on are 'known bugs' like missing or incomplete features where such a tool isn't any help. And debugger features which should be useful like variable watches are often just too much of a pain to setup and work with or trying to step to the 5000th iteration of some loop which also tends to be painful with debuggers even if they have support for it. Obviously i'm not having to debug production code in-place these days, thankfully.

But I remember using it heavily in the bad old days when working on Evolution which uses threads a fair bit - whether each updated revision of gdb would even be able to produce usable stack traces of a threaded application was a bit of a lucky dip. I remember not upgrading for years at a time once I got a combination of kernel and gdb that actually worked. It's probably the single most influential experience that lead me to become very conservative with my tool updates. There's nothing more frustrating than having to fix a breakage for something that wasn't broken in the first place.

Sunday, 30 March 2014

what happens after the cpu starts

So I started thinking about the startup code that you might want if you were executing code on the epiphany as a kernel processor as opposed to running a "standard" C interface. One problem is you can't be loading arbitrary code while the cpu is not in a halted state so there has to be some fixed block of code which is always there which can contain that state.

data

To get my bearings I created a full memory map of the entire epiphany 16-core device. Below is an excerpt from the lowest addresses on each core as I had it initially. Yeah ... didn't really feel like watching tv.

 +----------+---- local ram start for each core
 |     0000 | sync
 |     0004 | swe
 |     0008 | memfault
 |     000c | timer0
 |     0010 | timer1
 |     0014 | message
 |     0018 | dma0
 |     001c | dma1
 |     0020 | wand
 |     0024 | swi
 +----------+------
 |     0028 | extmem
 |     002c | group_id
 |     0030 | group_rows
 |     0034 | group_cols
 |     0038 | core_row
 |     003c | core_col
 |     0040 | group_size
 |     0044 | core_index
 +----------+------
 |     0048 | 
 |     004c | 
 |     0050 | 
 |     0054 | 
 +----------+------
 |     0058 | .text initial code
 |          |

The base load address matched the current sdk because I was originally compatible with e-lib.

I tried a couple of ideas but settled on leaving room for 8 words (easy indexing) and found a good use for every slot.

 |          |
 +----------+------
 |     0048 | argv0
 |     004c | argv1
 |     0050 | argv2
 |     0054 | argv3
 |     0058 | entry
 |     005c | sp
 |     0060 | imask
 |     0064 | exit code
 +----------+------
 |     0068 | .text.init
 |          |

Being able to load r0-r3 and set the stack pointer from the host allows any argument list to be created - i.e. the 'main' routine takes native types as arguments rather than nothing at all. This can help simplify some startup issues because you don't need to synchronise with anything to get the data on startup and for simple cases might be all you need.

Being able to set the entry point allows multiple 'kernels' to be stored in the same binary and be invoked without having to reload the code. It also 'fixes' the problem of having to trampoline to a 32-bit address via the startup routine if you want an off-core main (which may have limited uses).

And the imask just allows some initial system state to be configured. There may need to be more.

And finally the exit code allows the kernel to return some value. Actually i'm not sure how useful it is but there was an empty slot and it was easy to add.

code

So to the reset interrupt handler. This is just a rough sketch of where my thoughts are at the moment and i haven't tried it on the machine so it may contain some silly mistakes or other thinkos.

        .section .text.ivt0
_ivt_sync:      
        b.l     _isr_sync       ;sync

Currently in ez-loader the section name .text.ivt0 sets the interrupt vector although because it can create the ivt entries on the fly it may be something that can be removed. Or have it work as an override in which case all the following code is never used. The compiler outputs various sections for interrupt handlers too and I intend to remain compatible with those even though interrupt handlers in c can be a bit nasty due to the size of the register file and lack of multi push/pull instructions.

        .section .text.init
_isr_sync:
        ;; load initial state
        mov     r7,#tcb_off
        ldrd    r0,[r7,#tcb_argv0_1]      ; argv0, argv1
        ldrd    r2,[r7,#tcb_argv2_3]      ; argv2, argv3
        ldrd    r12,[r7,#tcb_entry_sp]    ; entry, sp
        ldr     r4,[r7,#tcb_imask]        ; imask

Then we come to the meat and potatoes. First the state is loaded from the 'task control block'. Double-word loads are used where possible and most values are loaded directly into their final desired location.

        ;; set frame pointer
        mov     fp,sp

        ;; set imask
        movts   imask,r4

A couple of values then need to be set by hand. There may need to be other init work done here such as clearing ipend.

        ;; set entry point
        movts   iret,r12

Rather than drop down to user-mode to run setup routines as required by a full c runtime this just jumps straight to the kernel entry point via iret. This saves the need to resolve the pre-main 'start' trampoline routine too.

        ;; link to exit routine
        mov     lr,%low(_exit)

Similarly because main isn't being called via a user-mode stub using bl or jalr the link register needs to be set manually. This can go also directly to _exit without passing go. Because this is intended to be on-core it only needs to load the bottom 16 bits of the address.

        ;; launch main directly
        rti

And after that ... there's nothing more to do. rti will clear the interrupt state, enable interrupts and jump to the chosen kernel entry point. r0-r3 and the stack will contain any args required, and when it finishes it will return from the function ...

_exit:  mov     r7,#tcb_off
        str     r0,[r7,#tcb_exit_code]
1:      idle
        b.s     1b
        .size   _exit, .-_exit

... and end up directly at exit. This can save out the return code from the function and then 'shuts down' via an idle and repeat loop. And at this point new code can (relatively) safely be loaded, or just the entry and arguments changed to relaunch the code with a sync signal. It may need another field for to indicate the core state such as running or exited and that might be more useful than having an exit code.

The nice thing about the dynamic loader I have is that I can separate this startup mechanism from the code completely - with a bit more work it doesn't even need to be linked to it. Because code is relocated at runtime I can change the startup code or tcb structure without forcing a recompile and only the workgroup config stuff needs to remain fixed.

For example another option may be to have a job dispatch loop replace most of the above and avoid the need to have to add it externally. It could even potentially be loading in the code itself from asynchronously en-queued jobs like hsa does. Possibly even using the hsa dispatch packet or some subset of it. Hmm, thoughts.

Saturday, 29 March 2014

ez-lib-tiny

So I just finished writing two versions of the 'tiny' elib equivalent - but completely untested so it will have bugs and may change once those are fixed. One is in C, and one is all in assembly.

The functionality is basically the same as e-lib but uses a different runtime e_config and the apis were modified slightly in order to gain efficiency and sometimes ease of use. For example the dma code takes a the lower 16-bits of the dma config register base address rather than an index which makes no real difference to the user but shaves a few precious bytes off the implementation. Another example is the irq functions which take take a mask and a set of bits so any combination of bits can be modified in a single invocation rather than having to call multiple times for different ones. Some of the other functions like the global address mapping functions i de-parameterised into separate special purpose functions since generally you know what type of address you have and it makes quite a difference if you specialise and this is the sort of stuff the compiler needs some help with. I just noticed I missed some of the coreid manipulation stuff though.

The ultimate goal is actually to put a lot of stuff in inline blocks which will actually generate less code and help the compiler but to compare the total code sizes I just generated a file with every function in it and mirrored the functionality identically between assembly and c.

Anyway ... (C compiled with -O2 -mshort-calls -std=gnu99)

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny-c.o
   text    data     bss     dec     hex filename
    932       0       0     932     3a4 ez-lib-tiny-c.o

notzed@minized:~/src/elf-loader$ e-size ez-lib-tiny.o
   text    data     bss     dec     hex filename
    696       0       0     696     2b8 ez-lib-tiny.o

As a comparison the current e-lib.a contains 3072 bytes of code (and a few more trivial functions). Also related is the c startup/support code (crt0 etc) which takes about another 2k above a minimal start-up implementation I am now using (none of which is included in the above numbers).

The smaller one is of course in assembly language. The biggest pieces were the timer stuff because it can't really be parameterised, and particularly the barrier implementation. I think i spotted some bugs in the timer implementation with it's handling of the config register (sigh, why did everything have to go through the one register) but i'm not sure they're bugs. In actual code this will actually shrink even in worst case because quite a few of these will be inlined instead, and generate code fragments which are smaller than a function invocation and without side effects like clobbering work registers.

I managed to squeeze the barrier implementation into only the lowest 8 registers which helped reduce the code-size quite a bit. The whole implementation is 88 bytes - it's amazing how quickly they add up (only 38 instructions). Of course I haven't verified it yet so it could all be terribly broken too.

I made some changes to the api to simplify the code and usage. It only takes a single array pointer which must be a local-core address of group-size bytes of memory. The memory must be at the same address in every core which allows the implementation to implicitly calculate any address required.

This is the C version, and the assembly is just a straightforward hand-compilation of that.

void ez_barrier_init(ez_barrier_t *barrier_array) {
        for (int i=0;i<ez_config->group_size;i++)
                barrier_array[i] = 0;
}
Rather than add code to special-case core0, just have every core clear their barrier block. Only one byte of the non-control core is actually used but why add the extra code to the library if it doesn't hurt.
void ez_barrier(ez_barrier_t *barrier_array) {
        volatile ez_barrier_t *ba = barrier_array;
        int index = ez_config->core_index;

        if (index == 0) {
                // Wait for all others (not us, we know we're here)
                for (int i=ez_config->group_size-1;i>0;i--) {
                        while (ba[i] == 0)
                                ;
                        // We can reset the local flag immediately
                        ba[i] = 0;
                }

                // Notify (do us because it doesn't hurt and simplifies the loops)
                for (int r=ez_config->group_rows;r>0;r--) {
                        for (int c=ez_config->group_cols;c>0;c--) {
                                volatile ez_barrier_t *rb = ez_global_core(barrier_array, r-1, c-1);

                                rb[0] = 0;
                        }
                }
        } else {
                volatile ez_barrier_t *root = ez_global_core(barrier_array, 0, 0);

                // Mark local signal and notify root
                ba[0] = 1;
                root[index] = 1;

                // Await clear
                while (ba[0])
                        ;
        }
}

In contrast to the assembly version this compiles into 140 bytes (-O2). I know it's pretty much pissing in the wind at this point but, you know, hobbies generally are. The one in e-lib hits 252 but a sizeable chunk of that is the floating point unit config manipulation needed for an integer multiply - not needed in my case because I tweaked the workgroup config to include the same info pretty much exactly for this purpose (and wanting a flat index for the current core is extremely common in practice, even in 2d kernels). The e-lib version does require more run-time memory though: 16 + 64*4 = 80 bytes vs just 16 for a 4x4 epiphany.

(as an aside, it's a real bummer the hardware barrier has bugs which probably prevent it from working on anything but a whole-chip workgroup. That would've been much faster and taken much less code to implement. I will add a separate api entry point for hardware barriers; a whole-chip barrier is certainly useful for many applications.

I might be able to apply some of the optimisation techniques I employed in the assembly version to further improve it too - for example the assembly version basically collapses all 'calls' to ez_global_core() into a single calculation in the prologue. This feedback from coding in assembly and trying to work out what the compile is doing has happened a few times although it isn't very reliable and might make no difference. Note that the compiler has in-lined almost all of the local calls like ez_global_core() already which helps code-size quite a bit (this is a version of e_get_global_address() that doesn't have special case code for global addresses or E_SELF).

I removed the separate reset loop on the controller core code path since each slot in the array has a dedicated user and there's no need to synchronise with the others before clearing the local flags.

Oh, and in an attempt to help improve the latency it starts writing from the farthest away core first. If one takes a single row it should end up having all the writes arrive at approximately the same time as the train of writes arrives at each destination in lock-step. That's the idea anyway. I need to check it against the routing algorithm to see if it should be by columns instead of rows although it might not make any difference.

I should've really been out enjoying another unusually warm day but ... yeah i didn't. Mowed the lawn though and nearly headed into the city for an afternoon drink, but somehow got distracted for about 8 hours and now it's dark and the lights are still off (Hmmm. Just fixed). Better go hunt for food I guess.

add/or, or, add?

Another micro-optimisation that gcc isn't grabbing.

Example code:

 unsigned int a;
 unsigned int b;

 unsigned int foo(void *dma) {
     return ((unsigned int)dma << 16) | 1;
 }

 -->
00000000 _foo:
   0:   0216            lsl r0,r0,0x10
   2:   2023            mov r1,0x1
   4:   00fa            orr r0,r0,r1
   6:   194f 0402       rts

This is part of the calculation to form a dma start code.

This compiles literally as expected - but orr requires a register argument so it needs an additional register and the load (unfortunately, would be nice to have a fully orthogonal instruction set on such matters). Because the lower 5 bits will always be clear thanks to the shift one can just use an add instead.

 unsigned int fooa(void *dma) {
     return ((unsigned int)dma << 16) + 1;
 }

 -->
00000000 _fooa:
   0:   0216            lsl r0,r0,0x10
   2:   0093            add r0,r0,1
   4:   194f 0402       rts

If the constant is greater than 3 bits (or something that can be made with a negative 3-bit number) then the code-size will grow by two bytes - however it is still only 2 instructions and requires no auxiliary register and allows setting up to 10 bits with a constant.

Actually trying to inline some dma setting stuff hit some interesting issues. Because the base pointer to the dma control packet is only turned into an integer it isn't referenced - it's possible for the compiler to completely optimise the initialisation code out of existence.

static dma_start(int chan, ez_dma_desc_t *dma) {
   uint start = ((uint)dma << 16) | 1;

   set_reg(E_DMA0CONFIG, start);
}

static dma_run(int chan, ez_dma_desc_t *dma) {
   dma_wait(chan);
   dma_start(chan, dma);
   dma_wait(chan);
}

void ez_dma_memcpy(int chan, void *dst, void *src, size_t size) {
 uint align = ((uint) dst | (uint)src | (uint)size) & 0x7;
 uint shift = dma_shift[align];
 ez_dma_desc_t dma;

 dma.config = 3 + (shift << 5);
 dma.inner_stride = 0x00010001 << shift;
 dma.count = 0x10000 | (size >> shift);
 dma.outer_stride = 0x00010001 << shift;
 dma.src_addr = src;
 dma.dst_addr = dst;

 dma_run(chan, &dma);
}
If this code is in the same file optimisation may decide to inline all the functions even they weren't marked as such. Since the only use of dma is as an integer it "loses" it's pointedness and thus it's reference to the object. I dunno, I suppose it's a valid optimisation but an interesting gotcha nonetheless. It is fixed by making the dma declaration volatile.

Actually I had some other strange behaviour with this routine. Originally I was using an initialiser to set the content of dma. As in:

    ez_dma_desc_t dma = {
        .config = 3 + (shift << 5),
        .inner_stride = 0x00010001 << shift,
        etc
    };

But yeah, this did weird shit. It seemed to build the full content of the structure on the stack in a staging buffer, and then copy it to the actual structure, also on the stack. In -Os mode this even uses a call memcpy? *shrug* I'm really not sure why it should do that unless it's trying to implement some alignment restriction but I'm pretty sure structs are 8-byte aligned anyway (they would have to be). It's a syntax I use all the time because it's so handy ...

ez-lib-tiny

I've been working on more compact 'e-lib' for epiphany, this is mostly gained by generous use of inline and inline asm where appropriate in many cases converting a function call into fewer instructions than the invocation sequence. But I have also investigated the code size of almost every routine compared to hand crafted code.

The compiler always loses, sometimes by over a 100% increase over the hand-crafted code size.

Bullshit quote of the day: "compilers can create better code than you can".

So obviously I've been looking at a lot of compiler output in the last few days and that's clearly far from the truth.

Actually i'm pretty bummed out that compilers are still not able to do a really great job "in this day and age", even with cpu that has such a simple instruction set and a ton of registers which should be very compiler friendly. I think it just shows what a complex optimisation problem translating high level text into machine code really is.

I'm not having a go at compiler writers but those who claim how good they are usually from a position of ignorance and because they read it somewhere and it sounded authoritative. Optimising compilers are dreadfully complex things and trying to convert expert knowledge into an algorithm isn't an easy task at all.

Thursday, 27 March 2014

unix or gnu: sh n awk. finding the next lun.

Had a query from a mate about writing a sh script for a specific purpose. He had one or more files containing a natural number on each line (luns) and wanted to find out where the next whole was.

Apart from being a bit ... i dunno, surprised that a "unix sysadmin" of 25 years (i.e. a lifer) wasn't a total Bourne Shell fiend with lashings of awk and perl for dessert ... it seemed a simple lunch-break 'challenge' so I came up with a couple of solutions.

First I just used bash. It's a bit clumsy though.

#!/bin/sh

# usage: nextlun file ...

n=0
cat $* | sort -n | while read c; do
    if [ $c -ne $n ] ; then
        echo $n
        exit 1;
    fi
    n=`expr $n + 1`
done
# above runs in sub-shell so doesn't update n
if [ $? -eq 0 ]; then
    last=`cat $* | sort -n | tail -1`
    echo `expr $last + 1`
fi

I probably should've known because I use it quite often but I didn't realise (or forgot) "while read" runs in a sub-shell.

However, awk makes this much easier and is the sort of thing it's really good at. The algorithm is identical though.

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }' 

Neither handles blank lines properly, so an easy fix should that be necessary in the awk:

#!/bin/sh

cat $* | sort -n \
 | awk -e 'BEGIN { n=0; } /[0-9]+/ { if (n != $0) { exit 0; } else { n=n+1; } } END { print n; }' 

A grep could perform the same duty in the bash version as well.

This is the core of what makes "unix" actually something worth using. The whole system itself is the "integrated development environment". And all that power is available to any user who wants it without having to buy some overpriced application to do it.

Imagine how many people would use a spreadsheet for such a simple task - and then have to do it manually every time to rub salt into the wound. So not only do you have to pay real cash for the privilege, you have to keep on paying with your own time - which is something you can never buy back for any price.

Update: As a further emphasis on the last point I was relaying this anecdote to a mate of mine, one who is also in an area where scripts should be a comfortable notion (dba on unix systems). He actually suggested using "excel" to do the same task. But then again he did earn his wings as a dba being paid by the hour ... so perhaps can be forgiven for not minding a bit of menial busy work ;-)

Wednesday, 26 March 2014

inlining register reads

42.

I've been looking at the way the epiphany on-core library implements a couple of functions in order to improve them. One of the simplest are the special register read/write functions used for dma and other purposes.

 unsigned e_reg_read(e_core_reg_id_t reg_id);

 unsigned e_reg_read(e_core_reg_id_t reg_id)
 {
        volatile register unsigned reg_val;
        unsigned *addr;

        // TODO: function affects integer flags. Add special API for STATUS
        switch (reg_id)
        {
        case E_REG_CONFIG:
                __asm__ __volatile__ ("MOVFS %0, CONFIG" : "=r" (reg_val) : );
                return reg_val;
        case E_REG_STATUS:
                __asm__ __volatile__ ("MOVFS %0, STATUS" : "=r" (reg_val) : );
                return reg_val;
        default:
                addr = (unsigned *) e_get_global_address(e_group_config.core_row,
                                     e_group_config.core_col, (void *) reg_id);
                return *addr;
        }
 }

As alluded to in the comments this actually breaks reading the status register anyway ... and it is incomplete. There are 42 special registers but because they are not contiguous and the actual memory address is passed to the function the compiler generates either a giant jump table or a long sequence of nested branches searching for the switch target.

And apart from this any calling code needs to go via the a function invocation which may be as simple as a branch but is more likely to be a 32-bit load followed by a jsr, and then the function itself needs to implement a switch.

Update: I added some markup to the output. Bold is the code that is required to do the actual job, for the first examples it includes the e_reg_read function implementation but in the ideal case it is a single instruction. Italic is bad code either due to incorrect implementation or the compiler going whacko the didlio for whatever reason.

00000000 _e_reg_read:
   0:   40e2            mov r2,r0
   2:   000b 0042       mov r0,0x400
   6:   01eb 1002       movt r0,0xf
   a:   d65c 2700       str lr,[sp],-0x4
   e:   283a            sub r1,r2,r0
  10:   2800            beq 60 <_e_reg_read+0x60>
  12:   008b 0042       mov r0,0x404
  16:   01eb 1002       movt r0,0xf
  1a:   283a            sub r1,r2,r0
  1c:   1700            beq 4a <_e_reg_read+0x4a>
  1e:   000b 0002       mov r0,0x0
  22:   200b 0002       mov r1,0x0
  26:   000b 1002       movt r0,0x0
  2a:   200b 1002       movt r1,0x0
  2e:   600b 0002       mov r3,0x0
  32:   0044            ldr r0,[r0]
  34:   2444            ldr r1,[r1]
  36:   600b 1002       movt r3,0x0
  3a:   0d52            jalr r3
  3c:   d64c 2400       ldr lr,[sp,+0x4]
  40:   0044            ldr r0,[r0]
  42:   b41b 2402       add sp,sp,16
  46:   194f 0402       rts
  4a:   0512            movfs r0,status
  4c:   15dc 0400       str r0,[sp,+0x3]
  50:   15cc 0400       ldr r0,[sp,+0x3]
  54:   d64c 2400       ldr lr,[sp,+0x4]
  58:   b41b 2402       add sp,sp,16
  5c:   194f 0402       rts
  60:   0112            movfs r0,config
  62:   15dc 0400       str r0,[sp,+0x3]
  66:   15cc 0400       ldr r0,[sp,+0x3]
  6a:   d64c 2400       ldr lr,[sp,+0x4]
  6e:   b41b 2402       add sp,sp,16
  72:   194f 0402       rts
  76:   01a2    nop
And an example call:
unsigned a;
void foo(void) {
 a = e_reg_read(E_REG_STATUS);
}
 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo:
   0:   008b 0042       mov r0,0x404
   4:   200b 0002       mov r1,0x0
   8:   d55c 2700       str lr,[sp],-0x2
   c:   200b 1002       movt r1,0x0
  10:   01eb 1002       movt r0,0xf
  14:   0552            jalr r1
  16:   400b 0002       mov r2,0x0
  1a:   400b 1002       movt r2,0x0
  1e:   0854            str r0,[r2]
  20:   d54c 2400       ldr lr,[sp,+0x2]
  24:   04e2            mov r0,r1
  26:   b41b 2401       add sp,sp,8
  2a:   194f 0402       rts
  2e:   01a2    nop

Can we do better?

inline

So the solution is to inline it. Simply moving e_reg_read to an inline function in a header helps. Well sort of helps.

 static inline unsigned ex_reg_read(e_core_reg_id_t reg_id)
 .. exactly the same

int foo_inline(void) {
        a = ex_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000000 _foo_inline:
   0:   b41b 24ff       add sp,sp,-8
   4:   0512            movfs r0,status
   6:   15dc 0400       str r0,[sp,+0x3]
   a:   35cc 0400       ldr r1,[sp,+0x3]
   e:   000b 0002       mov r0,0x0
  12:   000b 1002       movt r0,0x0
  16:   2054            str r1,[r0]
  18:   810b 2002       mov r12,0x8
  1c:   b61f 248a       add sp,sp,r12
  20:   194f 0402       rts

Yeah not sure what's going on there to make it go through the stack. Maybe the type or something. What the hell? (oh I later worked it out: the unnecessary volatile on the reg_val value is forcing a store to and read from the stack which is not desirable at all, i filed a bug in prickhub).

Actually I'm going backwards here, I actually already wrote this and wanted to compare to what currently happens, so lets just forget all that and see what I came up with.

static inline uint32_t
ez_reg_read(e_core_reg_id_t id) {
        register uint32_t v;

        switch (id) {
        case E_REG_CONFIG:
                asm volatile ("movfs %0,config" : "=r" (v));
                break;
        case E_REG_STATUS:
                asm volatile ("movfs %0,status" : "=r" (v));
                break;
        default:
                v = *((volatile uint32_t *) e_get_global_address(e_group_config.core_row,
                     e_group_config.core_col, (void *) id));
                break;
        }
        return v;
}
And an example.
void fooz_inline(void) {
    a = ez_reg_read(E_REG_STATUS);
}

 --> e-gcc -std=gnu99 -O2  -c -o e-foo.o e-foo.c

00000024 _fooz_inline:
   0:   2512            movfs r1,status
   2:   000b 0002       mov r0,0x0
   6:   000b 1002       movt r0,0x0
   a:   2054            str r1,[r0]
   c:   194f 0402       rts

Ok that's what I wanted. This helps the compiler generate much better code too since the result can go in any register, no scratch registers need to be saved, etc.

So I took this and proceeded to add all 42 of the special registers ... and then compiled an example that had more than one register read and ... damn. Suddenly it decides that it doesn't want to inline it anymore and turns every get into a function call to a 912 byte function. Oops. __attribute__((always_inline)) fixed that at least. Although it's different depending on the compiler - on the device I don't need to add the always_inline thing (or maybe some tiny detail was different).

However, things get a bit nasty when a non-constant expression is passed. Suddenly it inlines as much of that gigantic switch statement as it needs - potentially the whole lot if it doesn't know the range. Obviously not much use.

gcc has one last trick ... __builtin_constant_p(x). This returns true if the parameter is probably a constant. Since for this particular case on the epiphany any special register can just be read by a global memory access (as it uses already for a fallback) this can be used to decide the path to use.

#define ez_reg_read(x) (__builtin_constant_p(x) \
    ? _ez_reg_read(x) \
    : (*((volatile uint32_t *) e_get_global_address( \
       e_group_config.core_row, e_group_config.core_col, (void *) x))))

The macro decides whether to call _ez_reg_read() which will compile into a single movfs instruction, or fall-back to the memory load path (this may have some nasty unrolled-loop cases, hopefully unlikely). Although ... It's probably not terribly important to support non-constant parameters because any tools that need it can do it themselves and the api could just be for known registers (the enum implies that already).

Given how much code the current implementation compiles into it doesn't seem worth special casing any registers at all and it could just fall back to a global memory load every time. I'm lead to believe that movfs/movts goes through the same physical path (unfortunately) and since the required address is the key to the function there's little more to do.

I think elib can be shrunk significantly by changing some of the parameterisation and judicious use of inlining.

So ... whilst playing with a version of e_dma_wait() using this ... I think I found another gcc bug when __builtin_constant_p() is used, so yeah, non-constant args are in the bin.

Update: So it seems I missed the brackets on the macro which someone kindly pointed out on the forums ... oops. Actually before that I had written up the implementation and realised that the macro wasn't doing anything magical and __builtin_constant_p() can just go into the inline function itself.

Below is the code so far which results in identical output.

I thought it wouldn't work with no optimisation turned on but it seems to do the right thing. It still in-lines the ez_reg_read() but drops to the memory access path. This wont work for the config and status registers because they need special handling according to Andreas but I just realised I had separate entry points for those anyway so it should be ok. I'm not sure why C would ever be reading status anyway.

ez_regval_t 
EZ_ALWAYS_INLINE
ez_reg_read(ez_regid_t id) {
 register uint32_t v;

 if (__builtin_constant_p(id)) {
  switch (id) {
   // bank 0
  case E_REG_CONFIG:
   asm volatile ("movfs %0,config" : "=r" (v));
   break;
  case E_REG_STATUS:
   asm volatile ("movfs %0,status" : "=r" (v));
   break;
  case E_REG_PC:
   asm volatile ("movfs %0,pc" : "=r" (v));
   break;
  case E_REG_DEBUGSTATUS:
   asm volatile ("movfs %0,debug" : "=r" (v));
   break;
  ... every other named register ...
  default:
   // unknown register, who cares
   v = 0;
   break;
  }
 } else {
  v = *(volatile uint32_t *)ez_global_core_self((void *)id);
 }

 return v;
}

Tuesday, 25 March 2014

epiphany stack frame

I came across this months ago but forgot some of the finer details.

I thought that 8-byte empty area looked odd in generated code but I think it's there to simplify stacking saved registers due to the lack of pre-decrement addressing modes (otherwise it doesn't make much sense - leaf function use wouldn't matter if it had such addressing modes).

It's in the gcc source.

/* EPIPHANY stack frames look like:

             Before call                       After call
        +-----------------------+       +-----------------------+
        |                       |       |                       |
   high |  local variables,     |       |  local variables,     |
   mem  |  reg save area, etc.  |       |  reg save area, etc.  |
        |                       |       |                       |
        +-----------------------+       +-----------------------+
        |                       |       |                       |
        |  arguments on stack.  |       |  arguments on stack.  |
        |                       |       |                       |
  SP+8->+-----------------------+FP+8m->+-----------------------+
        | 2 word save area for  |       |  reg parm save area,  |
        | leaf funcs / flags    |       |  only created for     |
  SP+0->+-----------------------+       |  variable argument    |
                                        |  functions            |
                                 FP+8n->+-----------------------+
                                        |                       |
                                        |  register save area   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  local variables      |
                                        |                       |
                                  FP+0->+-----------------------+
                                        |                       |
                                        |  alloca allocations   |
                                        |                       |
                                        +-----------------------+
                                        |                       |
                                        |  arguments on stack   |
                                        |                       |
                                  SP+8->+-----------------------+
   low                                  | 2 word save area for  |
   memory                               | leaf funcs / flags    |
                                  SP+0->+-----------------------+

compiler strangeness

So I hit a strange issue with gcc. Well i don't know ... not 'strange', just unexpected. It probably doesn't matter much on x86 because it has so few registers and such a shitty breadth of addressing modes but on arm and epiphany it generates some pretty shit load/store code outside of an unexpected optimisation flag (and -O3, and even then only sometimes?).

Easiest to demonstrate in the epiphany instruction set.

A simple example:

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 id[0] = e_group_config.core_row;
 id[1] = e_group_config.core_col;
}

 -->

   0:   000b 0002       mov r0,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   000b 1002       movt r0,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   000b 0002       mov r0,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   000b 1002       movt r0,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2044            ldr r1,[r0]
  1e:   000b 0002       mov r0,0x0
                        1e: R_EPIPHANY_LOW      .bss+0x4
  22:   000b 1002       movt r0,0x0
                        22: R_EPIPHANY_HIGH     .bss+0x4
  26:   2054            str r1,[r0]

Err, what?

It's basically going to the linker to resolve every memory reference (all those R_* reloc records), even for the array array. At first I thought this was just an epiphany-gcc thing but i cross checked on amd64 and arm with the same result. Curious.

Curious also ...

extern const e_group_config_t e_group_config;
int id[8];
void foo(void) {
 int *idp = id;
 const e_group_config_t *ep = &e_group_config;

 idp[0] = ep->core_row;
 idp[1] = ep->core_col;
}

 -->
   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   2444            ldr r1,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   000b 1002       movt r0,0x0
                        e: R_EPIPHANY_HIGH      .bss
  12:   2054            str r1,[r0]
  14:   200b 0002       mov r1,0x0
                        14: R_EPIPHANY_LOW      _e_group_config+0x20
  18:   200b 1002       movt r1,0x0
                        18: R_EPIPHANY_HIGH     _e_group_config+0x20
  1c:   2444            ldr r1,[r1]
  1e:   20d4            str r1,[r0,0x1]

This fixes the array references, but not the struct references.

If one hard-codes the pointer address (which is probably a better idea anyway - yes it really is) and uses the pointer-to-array trick, then things finally reach the most-straightforward-compilation I get by just looking at the code and thinking in assembly (which is how i always look at memory-accessing code).

#define e_group_config ((const e_group_config_t *)0x28)
int id[8];
void foo(void) {
  int *idp = id;
  idp[0] = e_group_config->core_row;
  idp[1] = e_group_config->core_col;[/code]
}

 -->

   0:   2503            mov r1,0x28
   2:   47c4            ldr r2,[r1,0x7]
   4:   000b 0002       mov r0,0x0
                        4: R_EPIPHANY_LOW       .bss
   8:   000b 1002       movt r0,0x0
                        8: R_EPIPHANY_HIGH      .bss
   c:   4054            str r2,[r0]
   e:   244c 0001       ldr r1,[r1,+0x8]
  12:   20d4            str r1,[r0,0x1]

Bit of a throwing-hands-in-the-air moment.

Using -O3 on the original example gives something reasonable:

   0:   200b 0002       mov r1,0x0
                        0: R_EPIPHANY_LOW       _e_group_config+0x1c
   4:   200b 1002       movt r1,0x0
                        4: R_EPIPHANY_HIGH      _e_group_config+0x1c
   8:   4444            ldr r2,[r1]
   a:   000b 0002       mov r0,0x0
                        a: R_EPIPHANY_LOW       .bss
   e:   24c4            ldr r1,[r1,0x1]
  10:   000b 1002       movt r0,0x0
                        10: R_EPIPHANY_HIGH     .bss
  14:   4054            str r2,[r0]
  16:   20d4            str r1,[r0,0x1]

Which is what it should've been doing to start with. After testing every optimisation flag different between -O3 and -O2 I found that it was -ftree-vectorize that activates this 'optimisation'.

I can only presume the cost model of offset address calculations is borrowing too much from x86 where the lack of registers and addressing modes favours pre-calculation every time. -O[s23] compile this the same on amd64 as one would expect.

   0:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 6 
                        2: R_X86_64_PC32        e_group_config+0x18
   6:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # c 
                        8: R_X86_64_PC32        .bss-0x4
   c:   8b 05 00 00 00 00       mov    0x0(%rip),%eax        # 12 
                        e: R_X86_64_PC32        e_group_config+0x1c
  12:   89 05 00 00 00 00       mov    %eax,0x0(%rip)        # 18 
                        14: R_X86_64_PC32       .bss+0x7c

It might seem insignificant but the initial code size is 40 bytes vs 24 for the optimised (or 20 using hard address) - these minor things can add up pretty fast.

Looks like epiphany will need a pretty specific set of optimisation flags to get decent code (just using -O3 on it's own usually bloats the code too much).

Alternate runtime

I'm actually working toward an alternate runtime for epiphany cores. Just the e-lib stuff and loader anyway.

I was looking at creating a more epiphany optimised version of e_group_config and e_mem_config, both to save a few bytes and make access more efficient. I was just making sure every access could fit into a 16-bit instruction when a test build surprised me.

I've come up with this group-info structure which leads to more compact code for a variety of reasons:

struct ez_config_t {
    uint16_t reserved0;
    uint16_t reserved1;

    uint16_t group_size;
    uint16_t group_rows;
    uint16_t group_cols;

    uint16_t core_index;
    uint16_t core_row;
    uint16_t core_col;

    uint32_t group_id;
    void *extmem;

    uint32_t reserved2;
    uint32_t reserved3;
};

The layout isn't random - shorts are all within a 3-bit offset so a single 16-bit instruction can load them. The whole structure supports some expansion slots all which fit in with the 3-bit offset constraint for the data-type, and there is room for some bytes if necessary.

To test it I access every value once:

        #define ez_configp ((ez_config_t *)(0x28))
        int *idp = id;
        idp[0] = ez_configp->group_size;
        idp[1] = ez_configp->group_rows;
        idp[2] = ez_configp->group_cols;
        idp[3] = ez_configp->core_index;
        idp[4] = ez_configp->core_row;
        idp[5] = ez_configp->core_col;

        idp[6] = ez_configp->group_id;
        idp[7] = (int32_t)ez_configp->extmem;

 -->
   0:   2503            mov r1,0x28
   2:   000b 0002       mov r0,0x0
   6:   4524            ldrh r2,[r1,0x2]
   8:   000b 1002       movt r0,0x0
   c:   4054            str r2,[r0]
   e:   45a4            ldrh r2,[r1,0x3]
  10:   40d4            str r2,[r0,0x1]
  12:   4624            ldrh r2,[r1,0x4]
  14:   4154            str r2,[r0,0x2]
  16:   46a4            ldrh r2,[r1,0x5]
  18:   41d4            str r2,[r0,0x3]
  1a:   4724            ldrh r2,[r1,0x6]
  1c:   4254            str r2,[r0,0x4]
  1e:   47a4            ldrh r2,[r1,0x7]
  20:   42d4            str r2,[r0,0x5]
  22:   4744            ldr r2,[r1,0x6]
  24:   27c4            ldr r1,[r1,0x7]
  26:   4354            str r2,[r0,0x6]
  28:   23d4            str r1,[r0,0x7]

And the compiler's done exactly what you would expect here. Load the object base address and then simply access everything via an indexed access taking advantage of the hand-tuned layout to use a 16-bit instruction for all of them too.

I've included a couple of pre-calculated flat index values because these things are often needed in practical code and certainly to implement any group-wide primitives. This is somewhat better than the existing api which must calculate them on the fly.

    int *idp = id;
    idp[0] = e_group_config.group_rows * e_group_config.group_cols;
    idp[1] = e_group_config.group_rows;
    idp[2] = e_group_config.group_cols;
    idp[3] = e_group_config.group_row * e_group_config.group_cols + e_group_config.group_col;
    idp[4] = e_group_config.group_row;
    idp[5] = e_group_config.group_col;

    idp[6] = e_group_config.group_id;
    idp[7] = (int32_t)e_emem_config.base;

 --> -Os with default fpu mode
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   804c 2000       ldr r12,[r0,+0x0]
   c:   000b 0002       mov r0,0x0
  10:   000b 1002       movt r0,0x0
  14:   4044            ldr r2,[r0]
  16:   000b 4002       mov r16,0x0
  1a:   000b 0002       mov r0,0x0
  1e:   000b 1002       movt r0,0x0
  22:   010b 5002       movt r16,0x8
  26:   2112            movfs r1,config
  28:   0392            gid
  2a:   411f 4002       movfs r18,config
  2e:   487f 490a       orr r18,r18,r16
  32:   410f 4002       movts config,r18
  36:   0192            gie
  38:   0392            gid
  3a:   611f 4002       movfs r19,config
  3e:   6c7f 490a       orr r19,r19,r16
  42:   610f 4002       movts config,r19
  46:   0192            gie
  48:   0a2f 4087       fmul r16,r2,r12
  4c:   80dc 2000       str r12,[r0,+0x1]
  50:   800b 2002       mov r12,0x0
  54:   800b 3002       movt r12,0x0
  58:   4154            str r2,[r0,0x2]
  5a:   005c 4000       str r16,[r0]
  5e:   104c 4400       ldr r16,[r12,+0x0]
  62:   800b 2002       mov r12,0x0
  66:   800b 3002       movt r12,0x0
  6a:   412f 0807       fmul r2,r16,r2
  6e:   904c 2400       ldr r12,[r12,+0x0]
  72:   025c 4000       str r16,[r0,+0x4]
  76:   82dc 2000       str r12,[r0,+0x5]
  7a:   4a1f 008a       add r2,r2,r12
  7e:   41d4            str r2,[r0,0x3]
  80:   400b 0002       mov r2,0x0
  84:   400b 1002       movt r2,0x0
  88:   4844            ldr r2,[r2]
  8a:   4354            str r2,[r0,0x6]
  8c:   400b 0002       mov r2,0x0
  90:   400b 1002       movt r2,0x0
  94:   48c4            ldr r2,[r2,0x1]
  96:   43d4            str r2,[r0,0x7]
  98:   0392            gid
  9a:   611f 4002       movfs r19,config
  9e:   6c8f 480a       eor r19,r19,r1
  a2:   6ddf 480a       and r19,r19,r3
  a6:   6c8f 480a       eor r19,r19,r1
  aa:   610f 4002       movts config,r19
  ae:   0192            gie
  b0:   0392            gid
  b2:   011f 4002       movfs r16,config
  b6:   008f 480a       eor r16,r16,r1
  ba:   01df 480a       and r16,r16,r3
  be:   008f 480a       eor r16,r16,r1
  c2:   010f 4002       movts config,r16
  c6:   0192            gie

 --> -O3 with -mfp-mode=int
   0:   000b 0002       mov r0,0x0
   4:   000b 1002       movt r0,0x0
   8:   2044            ldr r1,[r0]
   a:   000b 0002       mov r0,0x0
   e:   000b 1002       movt r0,0x0
  12:   6044            ldr r3,[r0]
  14:   000b 0002       mov r0,0x0
  18:   000b 1002       movt r0,0x0
  1c:   804c 2000       ldr r12,[r0,+0x0]
  20:   4caf 4007       fmul r18,r3,r1
  24:   000b 4002       mov r16,0x0
  28:   000b 5002       movt r16,0x0
  2c:   000b 0002       mov r0,0x0
  30:   662f 4087       fmul r19,r1,r12
  34:   000b 1002       movt r0,0x0
  38:   204c 4800       ldr r17,[r16,+0x0]
  3c:   000b 4002       mov r16,0x0
  40:   4044            ldr r2,[r0]
  42:   000b 5002       movt r16,0x0
  46:   000b 0002       mov r0,0x0
  4a:   00cc 4800       ldr r16,[r16,+0x1]
  4e:   000b 1002       movt r0,0x0
  52:   491f 480a       add r18,r18,r2
  56:   605c 4000       str r19,[r0]
  5a:   80dc 2000       str r12,[r0,+0x1]
  5e:   2154            str r1,[r0,0x2]
  60:   41dc 4000       str r18,[r0,+0x3]
  64:   6254            str r3,[r0,0x4]
  66:   42d4            str r2,[r0,0x5]
  68:   235c 4000       str r17,[r0,+0x6]
  6c:   03dc 4000       str r16,[r0,+0x7]

Unless the code has no flops the fpumode=int is probably not very useful but this probably represents the best it could possibly do. And there's some real funky config register shit going on there in the -Os version but that just has to be a bug.

Oh blast, and the absolute loads are back anyway!

My hands might not stay attached if i keep throwing them up in the air at this point.

For the hexadecimal challenged (i.e. me) each fragment is 42, 200, and 112 bytes long respectively. And each uses 3, 9, or 9 registers.

Sunday, 23 March 2014

Saturday evening elf-loader hackery

Wasn't much on TV so I kept poking fairly late last night. I had a look at a Java binding to the code.

It's kind of looking a bit like OpenCL but without any of the queuing stuff.

I tried creating an empty demo to try out the api and I'm going to need a bit more runtime support to make it practical. So at present this is how the api might work for a mandelbrot painter.

First the communication structures that live in the epiphany code.

#define WIDTH 512
#define HEIGHT 512

struct shared {
 float left, top, right, bottom;
 jbyte status[16]; 
};

// Shared comm block
struct shared shared EZ_SHARED;
// RGBA pixels
byte pixels[WIDTH * HEIGHT * 4] EZ_SHARED;

And then an example main.

        EZPlatform plat = EZPlatform.init("system.hdf", EZ_SHARED_POINTERS);
        EZWorkgroup wg = plat.createWorkgroup(0, 0, 4, 4);

        EZProgram eprog = EZProgram.load("emandelbrot.elf");

        // Halt the cores
        wg.reset();

        // Bind program to all cores
        wg.bind(eprog, 0, 0, 4, 4);

        // Link/load the code
        wg.load();

        // Access comms structures
        ByteBuffer shared = wg.mapSymbol("_shared");
        ByteBuffer pixels = wg.mapSymbol("_pixels");

        // Job parameters
        shared.putFloat(0).putFloat(0).putFloat(1).putFloat(1).put(new byte[16]).rewind();

        // Start calculation
        wg.start();

        // Wait for all jobs to finish
        for (int i = 0; i < 16; i++) {
            while (shared.get(i + 16) == 0)
                try {
                    Thread.sleep(1);
                } catch (InterruptedException ex) {
                    Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
                }
        }
        
        // Use pixels.
        // ...

It's all pretty straightforward and decent until the job completion stuff. I probably want some way of abstracting that to something re-usable. Perhaps one day there will be some hardware support as well negating the need to poll the result. But just being able to look up structures by name is a big plus over the way you have to do it with the existing tools.

This (non-existent) example is just a one-shot execution but it already supports a persistent server mode. Perhaps it would also be useful to be able to support multi-kernel one-shot operation, e.g. choose the kernel and then a SYNC will launch a different main. If I do that then supporting kernel arguments would become useful although it's only worth it if the latency is ok versus the code size of a dispatch loop approach.

At the moment the .load() function is probably the interesting one. Internally this first relocates and links all the code to an arm-local buffer. Then it just memcpy's this to each core they are bound to. This state is remembered so it is possible to switch the functionality of a whole workgroup with a relatively cheap call. I don't think there's enough memory to do anything sophisticated like double-buffer the code though and given the alu to bandwidth mismatch as it is it probably wouldn't be much help anyway.

I do already have an 'EPort' primitive I included in the Java api. It's basically a non-locking cyclic counter which can be used to implement single writer / single reader queues very efficiently on the epiphany just using local memory reads and remote memory writes (i.e. non-blocking if not full and no mesh impact if it is). It's a bit limited though as for example you can only reserve or consume one slot at a time. Still useful mind you and it works with host-core as well as core-core.

I need to brush up again on some of the hardware workgroup support to see what other efficient primitives can be implemented (weird, the 4.13.x revision of the arch reference has vanished from the parallella site). Should be able to get a barrier at least although it's a bit more work having it work off-chip. Personally I think a mutex has no place in massive parallel hardware, although without a hardware atomic counter or mailboxes options are limited.

But maybe another day. I thought i'd had enough beer on Thursday (pretty much the last day of summer, 32 degrees and a warm balmy evening - absolutely awesome) but after finding out what the new contract is focussed on I'm ready for a Sunday Session even if it's just in my own back yard.

Saturday, 22 March 2014

Saturday arvo elf-loader hack-a-thon

So I hacked a bit more of the elf loader today. Initially it was just documenting where I got to so I could work out where to go next. I wrote up a bit more background and detail on how it works. Then I cleared out the out of date experimental stuff and focussed on the ez_ interfaces.

But then I got a bit side-tracked ...

Compact startup

First I thought i'd see if i could replace crt0 with my own. Apart from an initialisation issue with bss and data there isn't really anything wrong with the bundled one ... apart from it dragging in a bunch of C support stuff which isn't necessary if writing small kernel code that doesn't need the full libc.

So I took the crt0.s and stripped out a bunch of stuff. The trampoline to a potentially off-core start routine (it's going to be on-core). The atexit init (not necessary). The constructors init (hopefully not necessary). The .bss clearing (it's problematic when you have possibly more than one block of bss such as shared and given that .data isn't re-initialised anyway the C language behaviour is already broken). And the argument setting (three zero's isn't useful for anything and isn't even correct).

I toyed with the idea of passing arguments to kernel but decided to just have a void main instead.

I just pass -nostartfiles e-crt0minimal.o to the link line to replace the standard start-up code.

 e-gcc -Wl,-r -o e-test-reloc.elf -nostartfiles e-crt0minimal.o e-test-reloc.o -le-lib

Worth it? Probably ...

Minimal crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
    548      64     120     732     2dc e-test-workgroup-a.elf

Standard crt0:

$ e-size e-test-workgroup-a.elf
   text    data     bss     dec     hex filename
   1418    1196     128    2742     ab6 e-test-workgroup-a.elf

When you've only got 32K to play with saving 2K isn't to be sniffed at.

Matching addresses (aka fake hsa)

Next I looked at just using mmap to map the shared memory so pointers can be shared between the epiphany and the host arm directly. I tried to use e_get_platform_info() to get the list of memory blocks but for some odd reason that zeros out the memory array pointer? Odd. So ... I just access the struct directly via an extern instead.

This is just an implementation of stuff from a previous post but using the platform_info to find the addresses.

I have no idea whether this will work on a multi-epiphany setup but since I don't have one it's not something i'll lose sleep over :)

*COM* symbols

About this point I noticed that some symbols weren't being allocated any location in the output file and thus could not be resolved during the loader-linker execution. These were symbols marked with the section id of COMMON. I had hit this before but I had forgotten all about it. Last time I solved it using a linker script but I found I can just pass '-d' to the linker to achieve the same result which puts any such values into bss.

Automatic remote-core on-chip symbol resolution

Then I had a look into implementing fully automatic resolution of remote but on-chip symbols. The options are limited but the desired target core can be indicated by the symbol name.

For example:

program a:
   extern int buffer[12];

program b:
   // current cell relative +0, +1
   extern int buffer_0_1[12];

   // group relative 0,1
   extern int buffer$0$1[12];

Or some variation thereof. This is easy enough to parse and implement, and not too ugly to use.

But it would mean that the binary for every core would need to be linked individually and thus it wouldn't be possible to just copy the same code across to multiple cores when they share the implementation. For this reason I've dropped the idea for now. Having to use e_get_global_address() on a weak symbol isn't too difficult.

Friday, 21 March 2014

'easy' elf loader for parallella

I prettied up some of the stuff I did a few months ago on the parallella code loader and uploaded it to the home page.

It is still very much work in progress (just a bunch of experiments) but it is currently able to take several distinct epiphany-core programs and relocate and cross-link them to any on-core topology - at runtime. Remote addresses in other cores can be partially resolved automatically (to the local core address offset - suitable for e_get_global_address()) by using weak symbols and the host can resolve symbols by name. By default sections go on-core but .section directives can redirect individual records to specific banks or to global memory. bss/text/data are all supported for any such section using standard names (no 'code' sections!).

Linker scripts are not needed for any of this and the only 'special sauce' is that the epiphany binaries be linked with -r. I mention this because this was the primary driving factor for me to write any of this. I would probably like to replace crt0 as well but that is something for the future (basically remove the bss init stuff).

When I next poke at it I want to work toward wrapping it in an accessible Java API. There is a bit to be done before that though (I think - it's been a while since I looked at it).

thoughts on opencl + array methods

I was going to have a quick look at removing the erroneous asynchronous Get/SetPrimitiveArrayCritical() stuff from zcl this morning but I've hit a complication too far for my tired brain.

I changed to just allocating a staging buffer and using Get/Set*ArrayRegion() for the read/writeBuffer commands. I only allocate enough memory for the transfer and copy the transfer size around and so on. It's a bit bulky but it's fairly straightforward.

Then I started looking at the image interfaces and realised doing the same thing is somewhat more complex - either I have to copy the whole array to/from the staging buffer to/from each time (if the get/set updates on a portion of the image for example) or I have to flatten the transfer myself. The former pretty much makes the function pointless and the latter bulks out the binding and may require lots of jvm calls.

So now i'm deciding whether I just force synchronous transfers for all array interfaces because they probably have some use despite synchronisation being the mind-killer, or just deleting them altogether to drop a ton of code. Since i've already got all the code the former will probably be the approach I take. The event callback stuff i'm using to finish up the transaction seems pretty expensive anyway so there may not be much net difference (against a net use count of zero for the library, at that - it's just something to pass the time).

On another note I decided to use cvs as my local repository backend to store this stuff. I think the all-day-never-finished checkout of gcc finally tipped me over but I never liked subversion because it's too slow and is just shit at merging. I was surprised netbeans detected it and offered to install the cvs plugin automatically (and i was a little surprised it was already installed too). I don't need or want to use tools that weren't designed for my use-case.

Ho hum, back to work Tuesday. My boss actually apologised for taking so long to get the contract sorted but yeah, i'm not complaining! Looks like it'll mostly be a continuation of one of the projects i'm not terribly keen on too. Bummer I guess. All I really care about right now is sleep though.

Update: (I kept poking) I just removed the async handling code and force a blocking call. Get/SetPrimitiveArrayCritical() is used to access the arrays directly. I'll do a release another day though.

Monday, 17 March 2014

sumatra, graal, etc.

I didn't really wanna get stuck building stuff all day but that seems to have happened.

Sumatra uses auto* so it built easily no problem. Bummer there's no javafx but hopefully that isn't far off.

graal is a bit of a pain because it uses a completely custom build/update/everything mega-tool written in a single 4KLOC piece of python. Ok when it works but a meaningless backtrace if it doesn't. Well it is early alpha software I guess. Still ... why?

Anyway ... I tried building against 'make install' from sumatra but that doesn't work, you need to point your JAVA_HOME at the Sumatra tree as the docs tell you to. My mistake there.

So it turns out getting the hsail tools to build had some point after-all. The hsailasm downloaded by the "build" tool (in lib/okra-1.8-with-sim.jar) wont work against the libelf 0.x included in slackware (might be easier just building libelf 1.x). So I added the path to the hsailasm I built myself ... and ...

 export PATH=/home/notzed/hsa/HSAIL-Instruction-Set-Simulator/build/HSAIL-Tools:$PATH
 ./mx.sh  --vm server unittest -XX:+TraceGPUInteraction \
   -XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest
[HSAIL] library is libokra_x86_64.so
[HSAIL] using _OKRA_SIM_LIB_PATH_=/tmp/okraresource.dir_7081062365578722856/libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8
.[thread:1] scope: 
  [thread:1] scope: GraalCompiler
    [thread:1] scope: GraalCompiler.CodeGen
    Nothing to do here
    Nothing to do here
    Nothing to do here
    version 0:95: $full : $large;
// static method HotSpotMethod
kernel &run (
        align 8 kernarg_u64 %_arg0,
        align 8 kernarg_u64 %_arg1,
        align 8 kernarg_u64 %_arg2
        ) {
        ld_kernarg_u64  $d0, [%_arg0];
        ld_kernarg_u64  $d1, [%_arg1];
        ld_kernarg_u64  $d2, [%_arg2];
        workitemabsid_u32 $s0, 0;
                                           
@L0:
        cmp_eq_b1_u64 $c0, $d0, 0; // null test 
        cbr $c0, @L1;
@L2:
        ld_global_s32 $s1, [$d0 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L12;
@L3:
        cmp_eq_b1_u64 $c0, $d2, 0; // null test 
        cbr $c0, @L4;
@L5:
        ld_global_s32 $s1, [$d2 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L11;
@L6:
        cmp_eq_b1_u64 $c0, $d1, 0; // null test 
        cbr $c0, @L7;
@L8:
        ld_global_s32 $s1, [$d1 + 12];
        cmp_ge_b1_u32 $c0, $s0, $s1;
        cbr $c0, @L10;
@L9:
        cvt_s64_s32 $d3, $s0;
        mul_s64 $d3, $d3, 4;
        add_u64 $d1, $d1, $d3;
        ld_global_s32 $s1, [$d1 + 16];
        cvt_s64_s32 $d1, $s0;
        mul_s64 $d1, $d1, 4;
        add_u64 $d2, $d2, $d1;
        ld_global_s32 $s2, [$d2 + 16];
        add_s32 $s2, $s2, $s1;
        cvt_s64_s32 $d1, $s0;
        mul_s64 $d1, $d1, 4;
        add_u64 $d0, $d0, $d1;
        st_global_s32 $s2, [$d0 + 16];
        ret;
@L1:
        mov_b32 $s0, -7691;
@L13:
        ret;
@L4:
        mov_b32 $s0, -6411;
        brn @L13;
@L10:
        mov_b32 $s0, -5403;
        brn @L13;
@L7:
        mov_b32 $s0, -4875;
        brn @L13;
@L12:
        mov_b32 $s0, -8219;
        brn @L13;
@L11:
        mov_b32 $s0, -6939;
        brn @L13;
};

[HSAIL] heap=0x00007f47a8017a40
[HSAIL] base=0x95400000, capacity=108527616
External method:com.oracle.graal.compiler.hsail.test.IntAddTest.run([I[I[II)V
installCode0: ExternalCompilationResult
[HSAIL] sig:([I[I[II)V  args length=3, _parameter_count=4
[HSAIL] static method
[HSAIL] HSAILKernelArguments::do_array, _index=0, 0xdd563828, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=1, 0xdd581718, is a [I
[HSAIL] HSAILKernelArguments::do_array, _index=2, 0xdd581778, is a [I
[HSAIL] HSAILKernelArguments::not pushing trailing int

Time: 0.213

OK (1 test)

Yay? I think?

Maybe not ... it seems that it's only using the simulator. I tried using LD_LIBRARY_PATH and -Djava.library.path to redirect to the libokra from the Okra-Interface-to-HSA-Device library but that just hangs after the "base=0x95..." line after dumping the hsail. strace isn't showing anything obvious so i'm not sure what's going on. Might've hit some ubuntu compatibility issue at last or just a mismatch in versions of libokra.

On the other hand ... it was noticeable that something was happening with the gpu as the mouse started to judder, yet a simple ctrl-c killed it cleanly. Just that alone once it makes it into OpenCL will be worth it's weight in cocky shit rather than just hard locking X as is does with catalyst.

Having just typed that ... one test too many and it decides to crash into an unkillable process and do weird stuff (and not long after I had to reboot the system). But at least that is to be expected for alpha software and i've been pretty surprised by the overall system stability all things considered.

I think next time I'll just have a closer look at aparapi because at least I have that working with the APU and i'm a bit sick of compiling other peoples code and their strange build systems. Sumatra and graal are very large and complex projects and a bit more involved than I'm really interested in right now. I haven't used aparapi before anyway so I should have a look.

If the slackware vs ubuntu thing becomes too much of a hassle I might just go and buy another hdd and dual-boot; I already have to multi-boot to switch between opencl+accelerated javafx vs apu.

Update: Actually there may be something more to it. I just tried creating my own aparapi thing and it crashed in the same way so maybe i was missing some env variable or it was due to a suspend/resume cycle.

So I just had another go at getting the graal test running on hsail and I think it worked:

 export JAVA_HOME=/home/notzed/hsa/sumatra-dev/build/linux-x86_64-normal-server-release/images/j2sdk-image
 export PATH=${JAVA_HOME}/bin:${PATH}
 export LD_LIBRARY_PATH=/home/notzed/hsa/Okra-Interface-to-HSA-Device/okra/dist/bin
 ./mx.sh  --vm server unittest  -XX:+TraceGPUInteraction -XX:+GPUOffload -G:Log=CodeGen hsail.test.IntAddTest

[HSAIL] library is libokra_x86_64.so
[GPU] registered initialization of Okra (total initialized: 1)
JUnit version 4.8

...

Time: 0.198

OK (1 test)

Still, i'm not sure what to do with it yet ...

I was going to have a play today but I just got the word on work starting again (I can probably push it out to Monday) so I might just go to the pub or just for a ride - way too nice to be inside getting monitor burn. I foolishly decided to walk into the city yesterday for lunch with a mate I haven't seen for years (and did a few pubs on the way home - i was in no rush!) but just ended up with a nice big blister and sore feet for my troubles. It's about a 45 minute walk into the city but I don't do much walking.

I want see if anything interesting comes out of the Sony's GDC talk first though (the vr one? - yes the VR one, it's just started).

And ... done. Interesting, but still early days. Even if they release a model for the public at a mass-market price it's still going to have to be a long term project. Likely the first 5-10 years will just be experimentation and getting the technology the point where it is good and cheap enough.

Update: Yep, i'm pretty sure it's just a problem with suspend/resume. I just tried running it after a resume and it panicked the kernel.

Building the hsail tools.

I had a lot more trouble than this post suggests getting this stuff to compile which is probably why i sound pissed off (hint: i am, actually pro tip: i often am), but this summarises the results.

I started with the instructions in README.hsa from the hsa branch of gcc. See the patches below though.

I installed libelf from slackware 14.1 but had to build libdwarf manually. I got libdwarf-20140208 from the libdwarf home page. I actually created a SlackBuild for it but i'm uncertain about providing it to slackbuilds.org right now (mostly because it seems a bit too niche).

Then this should probably work:

  mkdir hsa
  cd hsa
  git clone --depth 1 https://github.com/HSAFoundation/HSAIL-Instruction-Set-Simulator.git
  git clone --depth 1 https://github.com/HSAFoundation/HSAIL-Tools
  cd HSAIL-Instruction-Set-Simulator/src/
  ln -s ../../HSAIL-Tools
  cd ..
  mkdir build
  cd build
  cmake -DCMAKE_BUILD_TYPE=Debug ..
  make -j3 _DBG=1 VERBOSE=1

Actually I originally built it from the Okra-Interface-to-HSAIL-Simulator thing, but that checks out both of these tools as part of it's android build process together with the llvm compiler this does. Ugh. And then proceeds to compile whilst hiding the details of what it's actually doing and simultaneously ignoring-all-fatal-errors-along-the-way through the wonders of an ant script. Actually I could add more but I might offend for no purpose - afterall I did get it compiled in the end. I'm just a bit more impatient than I once was :)

Patches

It doesn't look for libelf.h anywhere other than /usr/include so I had to add this to HSAIL-Instruction-Set-Simulator. Of course if you have lib* in /usr/local you have to change it to that instead. Sigh. I thought this sort of auto-discovery was a solved problem.


diff --git a/CMakeLists.txt b/CMakeLists.txt
index 083e9c6..c86192c 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -25,8 +25,9 @@ set(CMAKE_CXX_FLAGS "-DTEST_PATH=${PROJECT_SOURCE_DIR}/test ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-DBIN_PATH=${PROJECT_BINARY_DIR}/HSAIL-Tools ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-DOBJ_PATH=${PROJECT_BINARY_DIR}/test ${CMAKE_CXX_FLAGS}")
 set(CMAKE_CXX_FLAGS "-I/usr/include/libdwarf ${CMAKE_CXX_FLAGS}")
+set(CMAKE_CXX_FLAGS "-I/usr/include/libelf ${CMAKE_CXX_FLAGS}")
 
-set(CMAKE_CXX_FLAGS_RELEASE "-O3 -UNDEBUG")
+set(CMAKE_CXX_FLAGS_RELEASE "-O2 -UNDEBUG")
 
 # add a target to generate API documentation with Doxygen
 find_package(Doxygen)

And because I used a newer version of libdwarf than ubuntu uses it has a world-shattering fatal error causing difference in one of the apis.

Apply this to HSAIL-Tools.


diff --git a/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp b/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
index 7e2d28d..fe2bc51 100644
--- a/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
+++ b/libHSAIL/libBRIGdwarf/BrigDwarfGenerator.cpp
@@ -261,7 +261,7 @@ public:
     //
     bool storeInBrig( HSAIL_ASM::BrigContainer & c ) const;
 
-    int DwarfProducerCallback2( char * name,
+    int DwarfProducerCallback2( const char * name,
                                 int    size,
                                 Dwarf_Unsigned type,
                                 Dwarf_Unsigned flags,
@@ -459,7 +459,7 @@ bool BrigDwarfGenerator_impl::generate( HSAIL_ASM::BrigContainer & c )
 // *sec_name_index is an OUT parameter, a pointer to the shared string table
 // (.shrstrtab) offset for thte name of the section
 //
-static int DwarfProducerCallbackFunc( char * name,
+static int DwarfProducerCallbackFunc( const char * name,
                                       int    size,
                                       Dwarf_Unsigned type,
                                       Dwarf_Unsigned flags,
@@ -476,7 +476,7 @@ static int DwarfProducerCallbackFunc( char * name,
 }
 
 int
-BrigDwarfGenerator_impl::DwarfProducerCallback2( char * name,
+BrigDwarfGenerator_impl::DwarfProducerCallback2( const char * name,
                                                  int    size,
                                                  Dwarf_Unsigned type,
                                                  Dwarf_Unsigned flags,

I had to use the 'git registry' to disable coloured diffs so I could just see what I was even doing here.

  git config --global color.ui false
Whomever thought that sort of bloat was a good idea in such a tool ... well I would only offend if i called him or her a fuckwit wouldn't i?

hsa, gcc hsail + brig

I didn't spend much time on it yesterday but most of it was just reading up on hsa. There's only a few slides plus a so-shittily-typeset-is-has-to-be-microsoft-word reference manual on hsail and brig. As a bonus it includes some rather poor typeface choices (too stout - the trifecta of short, thick, and fat), colours, and tables to complement the word-wrap that all say to me 'not for printing' (not that I was going to). The link to the amd IOMMUv2 spec is broken - not that i'm likely to need that. Bit of a bummer as amd docs are usually formatted pretty well and they seem to be the primary driver at this point. In general the documentation 'needs work' (seems to be my catchphrase of 2014 so far) although new docs and tools seem to be appearing in drips and drops over time.

I'd already skimmed some of it and had a general understanding but I picked up a few more details.

The queuing mechanism looks pretty nice - very simple yet able to do everything one needs in a multi-core system. I had been under the impression that the queue system had / needed some hardware support on the CPU too but looking at it but it doesn't look necessary so was just a misunderstanding. Or ... maybe it's not since any work signalling mechanism would ideally avoid kernel interactions and/or busy waiting - either or both of which would be required for a purely software implementation. But maybe it's simpler than that

To be honest i would have preferred that hsail was a proper assembly language rather than wrapping the meta-data in a pseudo-C++ syntax. And brig seems a bit unnecessarily on the bulky side for what is essentially a machine code encoding. At the end of the day neither are deal breakers.

The language itself is kind of interesting. Again I thought it was a slightly higher level virtual-processor that it is, something like llvm's intermediate representation or PTX. But it has a fixed maximum number of registers and the register assignment and optimisation occurs at the compiler stage and not in the finaliser - looks a lot more like say DEX than IR or Java bytecode. This makes a lot of sense unless you have a wildly different programming model. Seems a pretty reasonable and pragmatic approach to a universal machine code for modern processors.

The programming and queuing model looks like something that should fit into Epiphany reasonably well. And something that can be used to implement OpenCL with little work (beyond the compiler, but there's few of those already).

GCC

I managed to get gcc checked out to build. The hsa tools page just points to the subversion branch with no context at all ... but after literally 8 hours trying to check it out and only being part-way through the fucking C++ standard library test suite, I gave up (I detest git but I'm no fan of subversion by any stretch of the imagination). I had to resort to the git mirror. Unfortunately gcc takes a lot longer to build than last time I had to despite having faster hardware, but that's 'progress' for you (no it's not).

I'm not sure how useful it is to me as it just generates brig directly (actually a mash up of elf with amd64 + brig) and there's no binutils to play with hsail that I can tell. But i'll document the steps I used here.

  git clone --depth 1 -b hsa git://gcc.gnu.org/git/gcc.git

  mkdir build
  cd build
  ../gcc/configure --disable-bootstrap --enable-languages=c,c++ --disable-multilib
  make

Slackware 64 is only 64-bit so I had to disable multilib support.

The example from gcc/README.hsa can then be compiled using:

  cd ..
  mkdir demo
  cd demo
  cat > hsakernel.c
extern void square (int *ip, int *rp) __attribute__((hsa, noinline));
void __attribute__((hsa, noinline)) square (int *in, int *out)
{
  int i = *in;
  *out = i * i;
}
CTRL-D
  ../build/gcc/xgcc -m32 -B../build/gcc -c hsakernel.c  -save-temps -fdump-tree-hsagen

Using -fdump-tree-hsagen outputs a dump of the raw HSAIL instructions generated.

[...]

------- After register allocation: -------

HSAIL IL for square
BB 0:
  ld_kernarg_u32 $s0, [%ip]
  ld_kernarg_u32 $s1, [%rp]
  Fall-through to BB 1

BB 1:
  ld_s32 $s2, [$s0]
  mul_s32 $s3, $s2, $s2
  st_s32 $s3, [$s1]
  ret_none 

[...]

Went through the gcc source and found a couple of useful bits. To get the global work-id I found you can use: __builtin_omp_get_thread_num() which compiles into workitemabsid_u32 ret,0. And __builtin_omp_get_num_threads() which compiles into gridsize_u32 ret,0. Both only work on dimension 0. And that seems to be about it for work-group functions.

I'm not really sure how useful it is and unless the git mirror is out of sync there hasn't been a commit for a few months so it's hard to know it's future - but it's there anyway.

My understanding is that a reference implementation of a finaliser will be released at some point which will make BRIG a bit more interesting (writing one myself, e.g. for epiphany, is a bigger task than i'm interested in right now). I'm probably going to have more of a look at aparapi and the other java stuff for the time being but eventually get the llvm based tools built as well. But ugh ... CMake.

Sunday, 16 March 2014

Aparapi on HSA on Slackware on Kavaeri on ASROCK

Although i've been waiting with bated breath for HSA to arrive ... the last I heard about a month ago via the aparapi mailing list was that the drivers weren't quite ready yet. So I was content to wait patiently a bit longer. Then somehow the first I heard that the alpha became available from one of the few comments on this blog and apparently it's been out for a few weeks. I couldn't find any announcement about it?

So yesterday before I went out and this morning I followed Linux-HSA-Drivers-And-Images-AMD and SettingUpLinuxHSAMachineForAparapi trying to get something working. As i'm using a different motherboard and OS it was a little more involved although I made it more involved than it should've been by making a complete pigs breakfast out of every step along the way due to being a bit out of practice.

But after getting a working kernel built and X sorted I just ran the test example a few seconds ago:

$ ./runSquares.sh 
using source from Squares.hsail
0->0, 1->1, 2->4, 3->9, 4->16, 5->25, 6->36,
     ;7->49, 8->64, 9->81, 10->100, 11->121,
     ;12->144, 13->169, 14->196, 15->225, 16->256,
     ;17->289, 18->324, 19->361, 20->400, 21->441,
     ;22->484, 23->529, 24->576, 25->625, 26->676,
     ;27->729, 28->784, 29->841, 30->900, 31->961,
     ;32->1024, 33->1089, 34->1156, 35->1225, 36->1296,
     ;37->1369, 38->1444, 39->1521,
PASSED
$

I'm presuming 'PASSED' means it worked.

I'm not sure how much i'll do today but i'll next look at the hsa branch of aparapi, sumatra?, and then I want to look a bit closer. I haven't been able find much detailed technical documentation yet but there is the kernel driver at least now and hopefully it's coming soon.

On Slackware

I'm using the ASROCK FM2A88X-ITX+ motherboard with Slackware64 14.1 and using the DVI and HDMI outputs in a dual-head configuration. Just getting Slackware 14.1 working on it reliably required a BIOS upgrade but i'm not sure what version it is right now.

To compile a fresh checkout of the correct kernel I tried the supplied kernel config file 3.13.0-config at first but that didn't work it just hung on the loading kernel line from elilo. After a couple of aborted attempts I managed to get a working kernel by starting with /boot/config-generic-3.10.17 as the .config file, running make oldconfig and holding down return until it finished to accept all the defaults, then using make xconfig to make sure my filesystem driver wasn't a module (which i of course forgot the first time).

Getting dual-screen X was a bit confusing - searches for xorg.conf configuration is pretty much a waste of time I think mostly because every config file is filled with non-important junk. But I finally managed to get it going even if for whatever reason it comes up in cloned mode but I can fix it manually running xrandr after i login. Because I'm not ready to make this permanent is good enough for me. As I was previously using the fglrx driver I had initially forgotten to de-blacklist the radeon kernel module but that was an easy fix.

This is how I set up the screen config.

$ xrandr --output HDMI-0 --right-of DVI-0

I'm not ready to make this my system yet because afaik OpenCL isn't available for this driver interface yet. Although the Okra stuff includes libamdhsacl64.so so presumably it isn't too far away.

Aparapi

I got aparapi going quite easily.

But beware, don't run '. ./env.sh' directly to start with - any error and it just closes your shell window! So test with 'sh ./env.sh' until it passes it's checks.

I used the ant that comes with netbeans and I already had AMD APP SDK 2.9 and Java 8 installed.

Not sure if it's needed but I noticed a couple of variables were blank so I set them in env.sh.

export APARAPI_JNI_HOME=${APARAPI_HOME}/com.amd.aparapi.jni
export APARAPI_JAR_HOME=${APARAPI_HOME}/com.amd.aparapi
Once env.sh was sorted it built in a few seconds and the mandelbrot demo ran in suitably impressive fashion.

Well this should all keep me busy for a while ...

Wednesday, 12 March 2014

JNI, memory, etc.

So a never-ending hobby has been to investigate micro-optimisations for dealing with JNI memory transfer. I think this is at least the 4th post dedicated soley to the topic.

I spent most of the day just experimenting and kinda realised it wasn't much point but I do have some nice plots to look at.

This is testing 10M calls to a JNI function which takes an array - either byte[] or a ByteBuffer. In the first case these are pre-allocated outside of the loop.

The following tests are performed:

Elements

Uses Get/SetArrayElements, which on hotspot always copies the memory to a newly allocated block.

Range alloc

Uses Get/SetArrayRegion, and inside the JNI code always allocates a new block to store the transferred data and frees it on exit.

Critical

Uses Get/ReleasePrimitiveArrayCritical to access the JVM memory directly.

ByteBuffer

Uses the JNIEnv entry points to retrieve the memory base location and size.

Range

Uses Get/SetArrayRegion but uses a pre-allocated (bss) buffer.

ByteBuffer field

Uses GetLongField and GetIntField to retrieve the package/private address and size values directly from the Buffer object. This makes it non portable.

I'm running it on a Kaveri APU with JDK 1.8.0-b129 with default options. All plots are generated using gnuplot.

Update: I came across this more descriptive summary of the problem at the time, and think it's worth a read if you're ended up here somehow.

Small arrays

The first plot shows a 'no operation' JNI call - the pointer to the memory and the size is retrieved but it is not accessed. For the Range cases only the length is retrieved.

What can be seen is that the "ByteBuffer field" implementation has the least overhead - by quite a bit compared to using the JNIEnv entry points. From the hotspot source it looks like they perform type checks which are adding to the cost.

Also of interest is the "Range alloc" plot which only differs from the "Range" operation by a malloc()/free() pair. i.e. the JNI call invocation overhead is pretty much insignificant compared to how willy-nilly C programmers throw these around. This is also timing the Java loop as well of course. The "Range" call only retrieves the array size in this case although interestingly that is slower than retrieving the two fields.


The next series of plots are for implementing a dummy 'load'. The read load is to add up every byte in the array, and the write load is to write the array index to the array. It's not particularly important just that it accesses the memory.

Well, they're all pretty close and follow the overhead plot as you would expect them to. The only real difference is between the implementations that need to allocate the memory first - but small arrays can be stored on the stack 'for free'.

The only real conclusion is: don't use GetArrayElements() or malloc space for short arrays!


Larger arrays

This is the upper area of the same plots above.

Here we see that by 8K the overhead of the malloc() is so insignificant to the small amount of work being performed that it vanishes from the time - although GetArrayElements() is still a bit slower. The Critical and field-peeking ByteBuffer edge out the rest.

And now some strange things start to happen which don't seem to have an obvious reason. Writing the data to bss and then copying it using SetArrayRegion() has become the slowest ... yet if the memory is allocated first it is nearly the fastest?

And even though the only difference between the ByteBuffer variants is how it resolves Buffer.address and Buffer.capacity ... there is a wildly different performance profile.

And now even more weirdness. Performing a read and then a write ... results in by far the worst performance from accessing a ByteBuffer using direct field access, yet just about the best when going through the JNIEnv methods. BTW the implementation rules out most cache effects - this is exactly the same memory block at exactly the same location in each case, and the linearity of the plot shows it isn't size related either.

And now GetArrayElements() beats GetArrayRetion() ...

I have no idea on this one. I re-ran it a couple of times and checked the code but perhaps I missed something.


Dynamic memory

Perhaps it's just not a very good benchmark. I also tried an extreme case of allocating the Java memory inside the loop - which is another extreme case. At least these should give some bracket.

Here we see Critical running away with it, except for the very small sizes which will be due to cache effects. The ByteBuffer results show "common knowledge" these things are expensive to allocate (much more so than malloc) so are only suitable for long-lived buffers.

Again with the SetArrayRegion + malloc stealing the show. Who knows.

It only gets worse for the ByteBuffer the more work that gets done.


The zoomed plots look a bit noisy so i'm not sure they're particularly valid. They are similar to the pre-allocated version except the ByteBuffer versions are well off the scale at that size.

After all this i'm not sure what conclusions to draw. Well for one OpenCL has so many other overheads I don't think any of these won't even be a rounding error ...

Invocation

I also did some playing around with native method invocation. The goal is just to get a 'pointer' to a native resource in the JNI and just to compare the relative overheads. The calls just return it so it isn't optimised out. Each case is executed for 100M times and this is the result of a fourth run.

call

This is what I used in zcl. An object method is invoked and the instance method retrieves the pointer from 'this.p'.

calle

The same but the call is wrapped in a try { } catch { } with in the loop and the method declares it throws an exception.

callp

An instance method where an anonymous pointer is passed to the JNI.

calls

A static method which takes the object as a parameter. The JNI retrieves 'this.p'.

callsp

This is the commonly used approach whereby an anonymous pointer is passed as a parameter to a static method.

The three types are the type of pointer. I was going to test this on a 32-bit platform but ran out of steam so the integers don't make much difference here. int and long are just a simple type and buffer stores a 'struct' as a ByteBuffer. This latter is how I originally implemented jjmpeg but clearly that was a mistake.

Results

    type    call    calle   callp   calls   callsp

    int     1.062   1.124   0.883   1.100   0.935
    long    1.105   1.124   0.883   1.101   0.936
    buffer  5.410   5.401   2.639   5.365   2.631
The results seemed pretty sensitive to compilation - each function is so small so there may be some margin of error.

Anyway the upshot is that there's no practical performance difference across all implementations and so the decision on which to use can be based on other factors. e.g. just pass objects to the JNI rather than the mess that passing opaque pointers create.

And ... I think that it might be time for me to leave this stuff behind for good.