Unfortunately when searching for 'opencl kill app' the first hit is my own post of the same title ... ahh well.
Anyway, along the way I found this this blog which has some interesting observations and ideas about OpenCL hacking.
For one I had missed that opencl 1.1 added a
shuffle
instruction: which is pretty much required for good performance for 'data streaming' applications: using SIMD to access non-aligned data. It can also be used to implement a vector index lookup. However, I'm not sure if this is a single-instruction function on GPU hardware, and it was just added to the spec for the CELL BE backend (without shuffle, the SPU's are kneecapped). Although it isn't too hard to find out by dumping the source code - which I did, and found that it isn't efficient at all. Oh well.So back to the question of OpenCL applications: as from the last time I wrote about, enabling desktop performance on a hand-held computer is probably still the main thing we will all eventually see (although it's taking it's sweet time). This latest iteration of the thin client/server model (which harkens back to the original mainframe idea) wont necessarily last, as no previous iteration ever did. And although having an internet connectedness will be pervasive, you wont need to go to a back-end server to do the work when you have 100G/flops in your pocket.
The other thing, which I think is more interesting, is that OpenCL will enable cheaper design of complex systems - which means more competition and more products for the general public. So for example it has always been that even from the days of 8 bit computers you could design a hardware component to increase performance (sprite hardware, sound synth, etc), and this still allows advanced performance from portable or low-power devices. As OpenCL technology matures and die sizes continue to shrink, it will become feasible to replace the expensive custom silicon with more and more software, and so as we went from custom silicon to fpgas, we may well move toward more software-only solutions even in portable devices. OpenCL might be a prick to code in, but it's a lot easier than FPGAs are, and you can still get higher performance from purpose-built functional units (FPGA's big edge is in power requirements, and concurrency).
So I think OpenCL's killer application is not so much a consumer-facing one as a tool for system developers. Or at least, removing the need for system design input for a given application-specific device, and opening up similar facilities to all developers. This is pretty much the gist of one of the AMD talks from a couple of years ago (I think from the AMD developer summit, perhaps from 2010 as I can't see it in the 2011 talks - many of which look interesting enough to watch) about how they were able to produce results in less time and with fewer steps than going the hardware route. i.e. lower costs, higher payoffs, and so on. I.e. pretty much what FPGAs did for system design, OpenCL may do as well: it might not be as fast, but the reduced development costs and overheads for low-volume production, Moore's Law and so on more than make up for it.
As an aside, when I was playing around with the beagleboard I also looked up some arduino stuff from time to time. I thought it was amusing that so many 'old hands' would whine about how overpriced the arduino was and that you could get away with something simpler like a PIC and so on. It was amusing because we can all see where this is going with a computer such as the Raspberry PI coming out: these things are getting so cheap and ubuquitous, before too long those 'low cost' simple parts will only be made by niche manufacturers to service dying equipment: i.e. they will become too expensive for anything (not to mention the skills required). Before too long it wont be economically viable to use anything smaller than a 32-bit floating point cpu for anything requiring some calculations.
It's a pity that the whole patent issue is getting so ridiculously out of hand: now that we are on the verge of effectively commoditising the entire platform as it pertains to signal processing, we're all going to be thrown into a dark age of propping up the leeches and rentiers who seem to have captured governments the world over for their own benefit.
Update: So ... I did end up spending a couple of hours today watching some of the AMD fusion summit videos from 2011 (with a bit of a hangover i wasn't up for much more than this!). Quite interesting, I guess I should keep an eye out on this stuff a bit more, but when i'm busy with work or on leave I don't always keep up with it.
The first keynote was quite interesting (after the waffle from the first guy), the future direction of the 'HSA' (heterogeneous system architecture) platform. At one point there is a graphic showing a bunch of hardware driven queues where applications feed in job tasks directly from their process space and they are picked up and executed by the hardware directly ... without a cpu context switch or kernel-mode memory copy in sight.
Now THAT is interesting.
Although ... TBH what the diagramme most reminded me of was ... non-copying asynchronous message passing, and where have I seen that before? Oh ... AmigaDOS 1!? Well, it only took nearly 30 years to finally catch up ... sure there are differences in the capability, but given the technology of the time it's definitely the closest to the model proposed here. i.e. unified memory (CHIP ram!), separate processors working together, non-copying data flows, etc. It's more than just about the non-copying of data (although that is pretty important too), it's about keeping the ALUs fed with data, and avoiding the overheads of system interactions and exception/syscall processing (this is of course why AmigaOS worked in a similar way). Already with GPUs keeping them fed with data is basically impossible - they spend a lot of time waiting around twiddling their thumbs.
Still, it's funny how nice hardware design is when intel aren't involved ...
Speaking of not-intel, the talk from the ARM guy also was quite interesting. From their perspective the push for non-heterogeneous compute is not just about getting the job done faster, it's about getting it faster whilst using less power. With the process shrink there isn't enough power/heat sink to run all the transistors at once, so you have to divvy up the work in a way which gets it done the fastest without wasting the silicon, the power, and so on. And on a mobile device it's about trying to utilise all the extra gates are your disposal to keep pushing the edge of performance; whilst still keeping within heat and power budget. This obviously fits directly in with AMD's software stuff: how to actually programme such a chip in a practical way.
Although to me what is most interesting about this approach is that it acknowledges that there really is a hard physical limit on the amount of work a single silicon chip can do. And even though they're making them smaller and can fit more transistors they can't turn the new ones on at the same time. I had seen this in the OMAP processors but I thought that was just about power saving: not burning out the chips. Thus the only way to improve performance and utilise the extra silicon further is not by increasing the concurrency but by increasing the specialisation of the functional units, and having more of them. This is quite a fundamental change to hardware design: where normally one tries to maximise resource utilisation. (and yes, it contradicts my speculation above to some extent: but I wrote that before seeing the talk)
And it has pretty far reaching implications for software development too. And particularly for free software; if the specific functional units aren't available to everyone or use proprietary instruction sets or firmware.
I'm not sure i'm convinced of the whole cache coherency model though: I thought the CELL BE was a better idea for high performance since each CPU had it's own dedicated simple memory that runs fast (=== LDS), and knowing you have no cache coherency just means you code differently (and avoid a few pitfalls along the way). The main benefit you get is much higher effective bandwidth: if it has to go through a global cache you're still hitting a bottleneck. Still, it's all about the system performance not just the fpu, and also being able to practically achieve a good fraction of its theoretical peak in a reasonable time with a decent programmer: and these are system issues beyond the raw numbers and queue mathematics.
Well off to the GCN video: although from the last AMD roadmap there is a NEW GPU architecture headed our way by the end of the year, so I'm not so sure what's going on with GCN.
No comments:
Post a Comment