Monday, 17 March 2014

hsa, gcc hsail + brig

I didn't spend much time on it yesterday but most of it was just reading up on hsa. There's only a few slides plus a so-shittily-typeset-is-has-to-be-microsoft-word reference manual on hsail and brig. As a bonus it includes some rather poor typeface choices (too stout - the trifecta of short, thick, and fat), colours, and tables to complement the word-wrap that all say to me 'not for printing' (not that I was going to). The link to the amd IOMMUv2 spec is broken - not that i'm likely to need that. Bit of a bummer as amd docs are usually formatted pretty well and they seem to be the primary driver at this point. In general the documentation 'needs work' (seems to be my catchphrase of 2014 so far) although new docs and tools seem to be appearing in drips and drops over time.

I'd already skimmed some of it and had a general understanding but I picked up a few more details.

The queuing mechanism looks pretty nice - very simple yet able to do everything one needs in a multi-core system. I had been under the impression that the queue system had / needed some hardware support on the CPU too but looking at it but it doesn't look necessary so was just a misunderstanding. Or ... maybe it's not since any work signalling mechanism would ideally avoid kernel interactions and/or busy waiting - either or both of which would be required for a purely software implementation. But maybe it's simpler than that

To be honest i would have preferred that hsail was a proper assembly language rather than wrapping the meta-data in a pseudo-C++ syntax. And brig seems a bit unnecessarily on the bulky side for what is essentially a machine code encoding. At the end of the day neither are deal breakers.

The language itself is kind of interesting. Again I thought it was a slightly higher level virtual-processor that it is, something like llvm's intermediate representation or PTX. But it has a fixed maximum number of registers and the register assignment and optimisation occurs at the compiler stage and not in the finaliser - looks a lot more like say DEX than IR or Java bytecode. This makes a lot of sense unless you have a wildly different programming model. Seems a pretty reasonable and pragmatic approach to a universal machine code for modern processors.

The programming and queuing model looks like something that should fit into Epiphany reasonably well. And something that can be used to implement OpenCL with little work (beyond the compiler, but there's few of those already).

GCC

I managed to get gcc checked out to build. The hsa tools page just points to the subversion branch with no context at all ... but after literally 8 hours trying to check it out and only being part-way through the fucking C++ standard library test suite, I gave up (I detest git but I'm no fan of subversion by any stretch of the imagination). I had to resort to the git mirror. Unfortunately gcc takes a lot longer to build than last time I had to despite having faster hardware, but that's 'progress' for you (no it's not).

I'm not sure how useful it is to me as it just generates brig directly (actually a mash up of elf with amd64 + brig) and there's no binutils to play with hsail that I can tell. But i'll document the steps I used here.

  git clone --depth 1 -b hsa git://gcc.gnu.org/git/gcc.git

  mkdir build
  cd build
  ../gcc/configure --disable-bootstrap --enable-languages=c,c++ --disable-multilib
  make

Slackware 64 is only 64-bit so I had to disable multilib support.

The example from gcc/README.hsa can then be compiled using:

  cd ..
  mkdir demo
  cd demo
  cat > hsakernel.c
extern void square (int *ip, int *rp) __attribute__((hsa, noinline));
void __attribute__((hsa, noinline)) square (int *in, int *out)
{
  int i = *in;
  *out = i * i;
}
CTRL-D
  ../build/gcc/xgcc -m32 -B../build/gcc -c hsakernel.c  -save-temps -fdump-tree-hsagen

Using -fdump-tree-hsagen outputs a dump of the raw HSAIL instructions generated.

[...]

------- After register allocation: -------

HSAIL IL for square
BB 0:
  ld_kernarg_u32 $s0, [%ip]
  ld_kernarg_u32 $s1, [%rp]
  Fall-through to BB 1

BB 1:
  ld_s32 $s2, [$s0]
  mul_s32 $s3, $s2, $s2
  st_s32 $s3, [$s1]
  ret_none 

[...]

Went through the gcc source and found a couple of useful bits. To get the global work-id I found you can use: __builtin_omp_get_thread_num() which compiles into workitemabsid_u32 ret,0. And __builtin_omp_get_num_threads() which compiles into gridsize_u32 ret,0. Both only work on dimension 0. And that seems to be about it for work-group functions.

I'm not really sure how useful it is and unless the git mirror is out of sync there hasn't been a commit for a few months so it's hard to know it's future - but it's there anyway.

My understanding is that a reference implementation of a finaliser will be released at some point which will make BRIG a bit more interesting (writing one myself, e.g. for epiphany, is a bigger task than i'm interested in right now). I'm probably going to have more of a look at aparapi and the other java stuff for the time being but eventually get the llvm based tools built as well. But ugh ... CMake.

No comments: