Monday, 18 December 2017

`parallel' streams

I had a task which I thought naturally fitted the Java streams stuff so tried it out. Turns out it isn't so hot for this case.

The task is to load a set of data from files, process the data, and collate the results. It's quite cpu intensive so is a good fit for parallelisation on modern cpus. Queuing theory would suggest the most efficient processing pipeline would be to run each processing task on it's own thread rather than trying to break the tasks up internally.

I tried a couple of different approaches:

  • Files.find().forEach() (serial to compare)
  • Files.find().parallel().collector(custom concurrent collector)
  • Files.find().parallel().flatMap().collect(toList())

The result was a bit pants. At best they utilised 2 whole cores and the total execution times were 1.0x, 0.77x, and 0.76x respectively of the serial case. The machine is some intel laptop with 4 HT cores (i.e. 8x threads).

I thought maybe it just wasn't throwing enough threads at it and stalling on the i/o, so I tried a separate flatMap() stage to just load the data.

  • Files.find().parallel().flatMap(load).flatMap(process).collect(toList())

But that made no difference and basically ran the same as the custom collector implementation.

So I hand-rolled a trivial multi-thread processing graph:

  • I/O x 1: Files.find().forEach(load | queue)
  • Processing x 9: queue | process | outqueue
  • Collator x 1: outqueue | List.add()
With a few sentinel messages to handle finishing off and cleanup.

Result was all 8x "cores" fully utilised and a running time 0.30x of the serial case.

I didn't record the numbers but I also had a different implementation that parallelised parts of the numerical calculation instead. Also using streams via IntStream.range().parallel() (depending on the problem size). Surprisingly this had much better CPU utilisation (5x cores?) and improved runtime. It's surprising because that is a much finer-grained concurrency with higher overheads and not applied to the full calculation.

I've delved into the stream implementation a bit trying to understand how to implement my own Spliterators and whatnot, and it's an extraordinarily large amount of code for these rather middling results.

Not that it isn't a difficult problem to solve in a general way; the stream "executor" doesn't know that I have tasks and i/o which are slow and with latency compared to many small cpu-bound tasks which it seems to be tuned for.

Still a bit disappointing.

No comments: