Sunday 1 June 2014

ipc primitive ideas

Had a couple of ideas for parallella IPC mechanisms whilst in the shower this morning. I guess i'm not out of ideas yet then.

atomic counter

These are really useful for writing non-blocking algorithms ... and would best be implemented in hardware. But since they aren't (yet - it could go into fpga too) here's one idea for a possible solution.

The idea is to implement it as an interrupt routine on each core - each core provides a single atomic counter. This avoids the problem of having special code on some cores and could be integrated into the runtime since it would need so little code.

Each core maintains a list of atomic counter requests and the counter itself.

  unsigned int *atomic_counter_requests[MAX_CORES];
  unsigned int atomic_counter;

To request an atomic number:

  unsigned int local_counter;

  local_counter = ~0;
  remote atomic_counter = &local_counter;
  remote ILATST = set interrupt bit;
  while local_counter = ~0
    // nop
  wend

The ISR is straightforward:

atomic_isr:
  counter = atomic_counter;
  for i [get_group_size()]
    if (atomic_counter_requests[i])
       *atomic_counter_requests[i] = counter++;
       atomic_counter_requests[i] = NULL;
    fi
  done
  atomic_counter = counter;

Why not just use a mutex? Under high contention - which is possible with many cores - they cause a lot of blockage and flood the mesh with useless traffic of retries. Looking at the mesh traffic required:

do {
  -> mutex write request 
  <- mutex read request return
} while (!locked)
  -> counter read request 
  <- counter read request return
  -> counter write request 
  -> mutex write request 
Even in the best-case scenario of no contention there is two round trips and then two writes which don't block the callee.

In comparison the mesh traffic for the interrupt based routine is:

  -> counter location write request 
  -> ILATST write request 
  <- counter result write request

So even if the interrupt takes a while to scan the list of requests the mesh traffic is much lower and more importantly - fixed and bounded. There is also only a single data round-trip to offset the loop iteration time.

By having just one counter on each core the overhead is minimal on a given core but it still allows for lots of counters and for load balancing ideas (e.g. on a 4x4 grid, use one of the central cores if they all need to use it). Atomic counters let you implement a myriad of (basically) non-blocking multi-core primitives such as multi reader/multi-writer queues.

This is something i'll definitely look at adding to the ezecore runtime. Actually if it's embedded into the runtime it can size the request array to suit the workgroup or hardware size and it can include one or more slots for external requests (e.g. allow the host to participate).

Broadcast port

Another idea was for a broadcast 'port'. The current port implementation only works with a single reader / single writer at a time. Whilst this can be used to implement all sorts of variations by polling one idea might be to combine them into another structure to allow broadcasts - i.e. single producer and multiple readers which read the same entries.

If the assumption is that every receiver is at the same memory location then the port only needs to track the offset into the LDS for the port. The sender then has to write the head updates to every receiver and the receivers have to have their own tail slots. The sender still has to poll multiple results but they could for example be stored as bytes for greater efficiency.

This idea might need some more cooking - is it really useful enough or significantly more efficient enough compared to just having a port-pair for each receiver?

No comments: