Wednesday, 11 December 2013

Java, C, SSE, poor mans lambdas

So after being able to avoid them for decades I got sucked in to having to do some matrix maths this week. I'm mostly just using a library but there were some memory and performance problems so I had to investigate my own matrix multiply routine.

Java vs gcc

Mostly just because of curiosity I tried comparing C to Java ... and the performance difference was negligble, actually sometimes the java cpu time is neglibly less. And i'm timing executing the programme from the command line - so that includes jvm startup and just-in-time compilation. I expected more of a difference in the obvious direction given the jvm overheads so that was a nice surprise. I suppose I shouldn't really be surprised by the performance anymore ...

And then further curiosity led me to attempting a vector based implementation - just using the gcc vector types. This was only about 2.4x faster - it would be worth it worth it, but I just used threads and it works fast enough anyway.

The vector implementation in gcc is simple, one just defines a vector type and then it's much the same as OpenCL, although one must ensure the data is aligned properly otherwise performance is pants.

TBH i'm a little dissapointed hotspot isn't doing any SSE optimisations here automatically (or maybe it is, but the execution time compared to C is too close for it to be a coincidence).


To implement the multi-threaded code I started using a poor-mans implementation of lambdas, or at least the parallel foreach part of it which is imho the main point of interest. I'm waiting for the jdk8 ga before pushing any java8 requirement onto my customer. It doesn't work as well as the lambda code in jdk8 but it isn't too far off and the syntax is fine as far as i'm concerned.

For simplicity I just have a static class which manages one thread per cpu with a simple static foreach call:

  public class WorkPool {
     ... threads stuff ...

  public interface WorkItem {
    public void accept(int i);

  public static synchronized void foreach(int start, int end, WorkItem job) {
     ... statically partition work across threads ...
     ... pass job to threads ...
     ... await completion ...

With obvious usage:

    WorkPool.foreach(0, N, new WorkItem() {
       public void accept(int i) {
           array[i] = blah ...;
It's a bit clumsy without the 'effectively final' of Java8, and arguably clumsy due to the syntax but if each WorkItem does a sizable amount of work the overheads are acceptable. More importantly it just gets me thinking about how to solve problems in ways that will map immediately to Java8 when I start using it.

There's some other interesting stuff about the parallel execution model that I want to talk about but i'll leave that for another post. It's about mapping non-rectilinear work loads to a linear index and job execution order.

On another note, I keep running into problems with the thread job management on Java and keep end up having to write custom solutions. Executors seem like a great idea ... and they probably are for enterprise workloads but they basically suck for interactive desktop programmes - where you may be getting many many more update requests than you have time to process but you only need to keep (and must keep) the last one. Managing this with Future handles gets clumsy very fast. JavaFX comes with a new set of 'worker thread' primitives which I haven't really looked into much - they seemed to be light wrappers over existing functionality anyway and only seemed to muddy the waters last time I had a look.

One current implementation of WorkPool uses a ThreadPool executor but I will look at custom thread code too (curiosity again). Currently i'm also implementing static scheduling but it will be worth investigating something more dynamic.


Solerman Kaplon said...

Dunno in what environment you need SIMD vector, but maybe worth a look:

NotZed said...

Is! That! Big! Loud! Web! Page! Is! A! Bit! Of! A! Turn! Off? Yepp!

Sorry ...

Seems an interesting approach to the code generation (from what i can tell) although i'm not sure of it's scalability - by processing only primitives across a whole vector at a time you're basically throwing away all the parts of the cpu above the L1 cache (including registers) which function as memory bandwidth multipliers.

Pity the site doesn't have something more useful as benchmarks than some mega-ops of log().

Also seems kinda strange to be launching a new high performance api without direct support for multi-core or bigger?

Solerman Kaplon said...

From the library class reference, it seems they expect you to use threads and its is recommended to bind them to specific cores. The library can do the whole array of vector it seems, having hand-asm version of the stuff in many ARM versions plus x64. They got a bundle for jave with all the binaries in 800K, individual .so are around 200-300K (400-600K for 64 bits)

NotZed said...

By whole vector i meant the array not the internal cpu representation.

Having mostly simple operations on arrays just means memory is going to be the bottleneck.