It's kind of looking a bit like OpenCL but without any of the queuing stuff.
I tried creating an empty demo to try out the api and I'm going to need a bit more runtime support to make it practical. So at present this is how the api might work for a mandelbrot painter.
First the communication structures that live in the epiphany code.
#define WIDTH 512 #define HEIGHT 512 struct shared { float left, top, right, bottom; jbyte status[16]; }; // Shared comm block struct shared shared EZ_SHARED; // RGBA pixels byte pixels[WIDTH * HEIGHT * 4] EZ_SHARED;
And then an example main.
EZPlatform plat = EZPlatform.init("system.hdf", EZ_SHARED_POINTERS); EZWorkgroup wg = plat.createWorkgroup(0, 0, 4, 4); EZProgram eprog = EZProgram.load("emandelbrot.elf"); // Halt the cores wg.reset(); // Bind program to all cores wg.bind(eprog, 0, 0, 4, 4); // Link/load the code wg.load(); // Access comms structures ByteBuffer shared = wg.mapSymbol("_shared"); ByteBuffer pixels = wg.mapSymbol("_pixels"); // Job parameters shared.putFloat(0).putFloat(0).putFloat(1).putFloat(1).put(new byte[16]).rewind(); // Start calculation wg.start(); // Wait for all jobs to finish for (int i = 0; i < 16; i++) { while (shared.get(i + 16) == 0) try { Thread.sleep(1); } catch (InterruptedException ex) { Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex); } } // Use pixels. // ...
It's all pretty straightforward and decent until the job completion stuff. I probably want some way of abstracting that to something re-usable. Perhaps one day there will be some hardware support as well negating the need to poll the result. But just being able to look up structures by name is a big plus over the way you have to do it with the existing tools.
This (non-existent) example is just a one-shot execution but it already supports a persistent server mode. Perhaps it would also be useful to be able to support multi-kernel one-shot operation, e.g. choose the kernel and then a SYNC will launch a different main. If I do that then supporting kernel arguments would become useful although it's only worth it if the latency is ok versus the code size of a dispatch loop approach.
At the moment the .load() function is probably the interesting one. Internally this first relocates and links all the code to an arm-local buffer. Then it just memcpy's this to each core they are bound to. This state is remembered so it is possible to switch the functionality of a whole workgroup with a relatively cheap call. I don't think there's enough memory to do anything sophisticated like double-buffer the code though and given the alu to bandwidth mismatch as it is it probably wouldn't be much help anyway.
I do already have an 'EPort' primitive I included in the Java api. It's basically a non-locking cyclic counter which can be used to implement single writer / single reader queues very efficiently on the epiphany just using local memory reads and remote memory writes (i.e. non-blocking if not full and no mesh impact if it is). It's a bit limited though as for example you can only reserve or consume one slot at a time. Still useful mind you and it works with host-core as well as core-core.
I need to brush up again on some of the hardware workgroup support to see what other efficient primitives can be implemented (weird, the 4.13.x revision of the arch reference has vanished from the parallella site). Should be able to get a barrier at least although it's a bit more work having it work off-chip. Personally I think a mutex has no place in massive parallel hardware, although without a hardware atomic counter or mailboxes options are limited.
But maybe another day. I thought i'd had enough beer on Thursday (pretty much the last day of summer, 32 degrees and a warm balmy evening - absolutely awesome) but after finding out what the new contract is focussed on I'm ready for a Sunday Session even if it's just in my own back yard.
No comments:
Post a Comment