Thursday, 17 September 2015

empirically corrected-approximate integer division

Although i haven't been posting about it i've been continuing to poke around in bits and pieces of code. Well, a little bit.

I did a bit of OpenCL last week and that was pretty fun. I had enough time to really dig into optimising a particular routine and was down to inspecting the ISA output from the driver. Good stuff. The GCN isa is pretty foreign to me so I had to use small snippets to isolate operations of interest.

For example one construct that comes up repeatedly when parallelising code is using a divide and modulus operator when splitting up a non-work-sized job into work-group sized blocks.

  int block_size = info.block_size;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = id / block_size;
    int block_index = id % block_size;
     // do work
  }

Where possible one just chooses a power of 2 so this is a simple shift and mask, or integer divide by a constant isn't too bad as it can usually be optimised by the compiler. But this problem required a dynamic block size that wasn't a power of 2.

The solution? Use floating point multiply the reciprocal which can be calculated efficiently or here off-line. The problem is that this introduces enough rounding error to be worthless without some more work.

I must admit I just found the solution empirically here: i had a limited range of possible values so I just exhaustively tested them all against a couple of guesses. Hey it works, i'm no scientist.

  float block_size_1 = info.block_size_1;
  for (int id=get_local_id(0); id < limit; id+=64) {
    int block_no = (int)(id * block_size_1 + 1.0f / 16384);
    int block_index = id - (block_no * block_size);
     // do work
  }

This replaces the many instruction integer division decomposition with a convert+mad+convert.

On some work-loads this was a 25% improvement to the entire routine and these 2 lines are in an inner loop of about 50 lines of code.

Well it's been fun to play at this level again - its ... mostly ... pointless going to this level but just adds to the toolkit and I enjoy poking. Maybe one day i'll have a job where it's useful.

I gave zcl a go on this as originally I was thinking of trying some OpenCL 2 stuff but I may not bother now. Given the lack of use/testing it was pretty much bug free but I started filling out the API with some more convenient entry points. I also decided to add some more java-array interfaces here and there: they're just too convenient and it hides the mess in the C even if they might not be the most efficient in all cases.

This is the sort of thing i'm talking about:

  float[] data = new float[] { 1, 2, 3, 4, 5 };
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data);
vs
  float[] data = new float[] { 1, 2, 3, 4, 5 };
  ByteBuffer bb = ByteBuffer.allocateDirect(data.length * 4).order(ByteOrder.nativeOrder());
  bb.asFloatbuffer().put(data);
  CLBuffer buffer = cl.createBuffer(CL_MEM_COPY_HOST_PTR, data.length * 4, bb);
It's only two fewer lines of code ... but yeah that shit gets old fast. The first is more efficient too because this is a native method and it avoids the copy. In the first case CL_MEM_USE_HOST_PTR throws an exception though, and in the second it works (library call permitting).

The main downside is adding these convenience calls blows out the method count very quickly if you support all the primitive types - which detracts from the ease of use they're supposed to increase.

Another release? Who knows when.

And this week i've been poking at some OpenGL. My it's grown. I'm experimenting using JOGL for this although i'm not a fan of some of it's binding choices. It's crossed my mind but i'm pretty sure i don't want to create yet another binding as in a 'ZGL'. Hmm, I wonder if vulkan will clean up the cross platform junk from opengl.

Unfortunately my home workstation seems to have developed a faulty DIMM or something (unrelated note of note).

No comments: