Not surprisingly perhaps the card seem to be designed for graphics workloads more than computational workloads.
Tests
The test is running a 31x31 separable convolution kernel over a 1024x768 image. Implemented using two passes - a horizontal and then vertical convolution.The image version is also executed over normalised unsigned byte data as well as float data (4x channel). The array version only uses single-channel float planes.
In both cases a single thread calculates each output pixel. Timings are from the NVidia Compute Visual Profiler and the card is an NVidia GTX 480.
Array version
For the X convolution it copies the kernel and 128 elements of the source array to local memory - which is then shared amongst the 64 threads in the work unit.For the Y convolution this makes things slower because of the way it accesses memory, so it just relies on memory coalescing for the accesses and also for the memory accesses to be interleaved with processing to hide the latency.
The code must manually handle the edges - it just clips to the boundary.
Timings: X=192μS Y=400μS Total=592μS (per plane) 1776μS (3x planes).
I also tried changing the array types to float4 and processing 4 packed planes at once. This pretty much scaled linearly - i'd expected it to scale better than linearly.
Timings: X=820μS Y=1460μS Total=2280μS (4x planes) 570μS (per plane)
Image version
The first image version was a very simple implementation that just reads pixels directly from the source image. Although the data is stored in UBYTE RGBA format it only calculates 3 channels (4 channels can be done for <10% extra time). The X and Y convolution code is more or less identical save for the direction it works in.Timings: X=618μS Y=618μS Total=1236μS (3x channels) 1269μS (4x channels)
A pretty clear win - but this is only with octet data.
I then tried using floating point as the storage, and things weren't so rosy for the image version.
Timings: X=1824μS Y=2541μS Total=4365μS (3x channels)
So I started moving some of the optimisations required for the array version into the image version. First I just copied the kernel to local memory first in both X and Y versions. Pretty major improvement.
Timings: X=1176μS Y=2117μS Total=3293μS
And finally I added the code which copies 128 elements of the data to local memory. To do this for the Y convolution I also had to change the local work size to be 64 in Y rather than X - and this probably explains why it ran faster since it creates more work groups.
Timings: X=770μS Y=732μS Total=1502μS
What is strange though that this version is slower on the byte data. I guess the extra complication and overhead of copying stuff locally slows it down too much.
Timings: X=712μS Y=731μS Total=1444μS
And if I remove the local copy of the image data the timings improve further.
Timings: X=677μS Y=725μS Total=1402μS
But they are still behind the naive version for BYTE data.
Conclusions
Storing data in array buffers, with properly written code can achieve similar performance to image storage - even though they have radically different data paths and cache characteristics. Array types can process individual planes separately - but can also process vector/multi-channel types fairly easily too.Although a trivial implementation worked well for 32-bit backed pixel types, non-byte image types require almost identical treatment to the array based implementation in order to gain good performance.
Even though it might not be the most efficient, the same code can also be executed for different image storage types - the image read/write methods just use floating point values in registers which is the most convenient for the arithmetic (and tuned for the GPU). For the array code it would require completely different code for each data type - e.g. normalising to float or using fixed point arithmetic.
In short, the NVidia GPU seems optimised for accessing data through image types. And particularly for typically screen-sized images stored in 32 bit packed format. Not so surprising for a graphics card.
It would be interesting to compare to the ATI card I have - I suspect it would be pretty much a similar result and perhaps even more so, since it doesn't have have any L1 cache for array accesses. But profiling that is somewhat more work and I can't be bothered right now. I have also yet to try it with single-channel images.
Update Actually I need to know about single-channel images so I tried that and it was a bit disappointing for BYTE data: X=593μS Y=600μS Total=1193μS, the texture cache probably stores all channels anyway and for all I know the image is being stored in memory at 32 bits per pixel. For the float data using the optimised version things are somewhat better - X=263μS Y=301μS Total=564μS. And bizarrely now the optimised version is faster for the BYTE data as well - X=242μS Y=295μS Total=537μS. Presumably this is because the smaller amount of processing isn't able to hide the memory latency but the manual caching is (and the smaller local array sizes are less of a limitation for concurrency - the minuscule local memory is the main bottleneck for optimising OpenCL).
I'm running into some memory stress for work and if the byte data were stored packed it might be a big benefit here - right now i'm using float arrays. Using images might simplify some of the code too, although it looks like the more memory heavy stuff will still need to use local memory - although at least in this example that extra work would make it run faster than array types.
3 comments:
Do you have any of this source code documenting your experiments? I was planning to perform these very same experiments on my own image data, and this would be a great headstart.
I don't think i could recreate the source to match the numbers without some effort as the code evolved over time and eventually got cleaned up/discarded.
socles has about the best implementation i could come up with for image data. It was based on the best code I could come up with for array data as well although the Y convolution is a bit different - there's no need to use local memory because of the way it is accessed. The only other difference is to use a work-width of 64 instead of 16.
e.g. see ConvolveXYGenerator and Convolve2D at:
http://code.google.com/p/socles/source/browse/#svn%2Ftrunk%2Fsocles%2Fsrc%2Fau%2Fnotzed%2Fsocle%2Ffilter
It also has a 'naive' version of the separable convolution (relies on texture cache only) too.
If you want to discuss this further, maybe post to socles-discuss? As can be seen from further posts, it's something i continue to revisit.
PS I'm thinking of adding some array implementations to socles as well, based on work i've been doing recently. But I haven't had the spare cycles to work on it for a while.
Post a Comment