And so it goes ...
I've got a fairly convoluted convolution algorithm for performing a complex wavelet transform and I was looking to re-do it. Part of that re-doing is to move to using arrays rather than image types.
I got a bit side-tracked whilst revisiting convolutions again ... I started with the generator from socles for separable convolution and modified it to work with arrays too. Then I tried a couple of ideas and timed a whole bunch of runs.
One idea I wanted to try was using a rolling buffer to reduce the memory load for the Y convolution. I also wanted to see if using more work-items in a local workgroup to simplify the local memory load would help or hinder. Otherwise it was pretty much just getting an array implementation working. As is often the case I haven't fully tested these actually work, but i'm reasonably confident they should as i fixed a few bugs along the way.
The candidates
- convolvex_a
- This is a simple implementation which uses local memory and a work-group size of 64x4. 128x4 words of data are loaded into the local memory, and then 64x4 results are generated in parallel purely from the local memory.
- convolvey_a
- This uses no local memory, and just steps through the addresses vertically, producing 64x4 results concurrently. As all memory loads are coalesced it runs quite well.
- convolvex_b
- This version tries to use extra work-items just to load the memory, after wards only using 64x4 threads. In some testing I had for small jobs this seemed to be a win, but for larger jobs it is a big hit to concurrency.
- convolvey_b
- This version uses a 64x4 `rolling buffer' to cache image values for all items in the work-group. For each row of the convolution, the data is loaded once rather than 4x.
- imagex, imagey
- Is from the socles implementation in ConvolveXYGenerator which uses local memory to cache input data.
- simplex, simpley
- Is from the socles implementation in ConvolveXYGenerator which relies on the texture cache only.
- convolvex_a(limit)
- Is a version of convolvex_a which attempts to only load the amount of memory it needs, rather than doing a full work-group width each time.
- convolvex_a(vec)
- Is a version of convolvex_a which uses simple vector types for the local cache, rather than flattening all access to 32-bits to avoid bank conflicts. It is particularly poor with 4-channel input.
The array code implements CLAMP_TO_EDGE for source reads. The image code uses a 16x16 worksize, the array code 64x4. The image data is FLOAT format, and 1, 2, or 4 channels wide. The array data is float, float2, or float4. Images and arrays represent a 512x512 image. GPU is Nvidia GTX 480.
Results
The timing results - all timings are in micro-seconds as taken fromcomputeprof
. Most were invoked for 1, 2, or 4 channels and a batch size of 1 or 4. Image batches are implemented by multiple invocations.batch=1 batch= 4
channels 1 2 4 1 2 4
convolvex_a 42 58 103 151 219 398
convolvey_a 59 70 110 227 270 429
convolvex_b 48 70 121 182 271 475
convolvey_b 85 118 188 327 460 738
imagex 61 77 110 239 303 433
imagey 60 75 102 240 301 407
simplex 87 88 169
simpley 87 87 169
convolvex_a (limit) 44 60 95 160 220 366
convolvex_a (vec) 58 141
Thoughts
- The rolling cache for the y convolution is a big loss. The address arithmetic and need for synchronisation seems to kill performance. So much for that idea. I guess there just isn't enough work to do each loop to make it work it (it only requires a single mad per thread).
- Using more threads for loading, then dropping back when doing arithmetic is also a loss for larger problems since it limits how many groups of workgroups can execute on an SM.
- Trying to reduce the memory accesses to only those required slows things down until you hit 4 element vectors. I guess for float and float2 the cached reads are effectively free, whereas the divergent branch is not.
- Even with the texture cache, images benefit significantly from using a local cache.
- Even with the local cache, images trail the array implementation - until one processes 4-element vectors, in which case they are even stevens for single images.
- Arrays can also be batched - processing 'n' separate images concurrently. This adds a slight extra benefit as it can more fully utilise the SM cores, and reduces the need for extra host interaction. For smaller problems this could be important although this problem size is already giving the GPU a good sized workout so the differences are minimal.
- Using single-channel data is under-utilising the GPU by quite a bit.
When I get time and work out how i want to do it i'll drop the array code into socles.
No comments:
Post a Comment