The code runs in a loop, and there is no communications from the CPU to the GPU and only runs about 10-15 loops anyway (depending on the settings): thus the loop overheads are fairly small. But it still does require a 'scale features' step, which is useful on a CPU to avoid excessive calculations but isn't so important on a GPU.
So I tried a slightly different approach - that is, to perform the scaling inside the detector kernel, which allows each kernel to then work on different scales. i.e. to do all scales in one step.
My first attempt at this wasn't much faster - but that's because I was invoking the kernel for too many probes. So then I tried changing the way it works: each work-group still works together solving a single feature test stage together. But instead of calculating it's location and scale from the 2d work coordinates I create a 4 element descriptor with some of the information required and it just uses that. This gives me a bit more flexibility in the work assignment, e.g. I can utilise persistent work-groups and tune the work size to fit the hardware more directly. It requires less temporary memory since the features are scaled in-situ.
This change was definitely worth it, for a given test on the webcamfx code, I got the face detection down to around 13ms total time, vs 19ms - about 4ms overhead is fixed. A stand-alone test of Lenna registers about 8ms vs 19ms, so over 100% improvement.
Comparisons with other hardware are difficult - mostly because it depends a great deal on the subject matter and the settings and i haven't kept track of those - but I was pretty disappointed with the AMD performance up until now and I think this gets it on par with the nvidia hardware at least. Although really the 7970 should do measurably better ...
My guess is that the performance gained is mostly because with the greater amount of work done, it can more efficiently fit the total problem onto the hardware. There is usually a small amount 'modulus' where a given problem wont fill all hardware units leaving some idle, and in this newer version it only happens once rather than 10-15 times. Actually I did some more timing (and updated the numbers above), and 100% seems too much for this. Maybe? Oh I also changed the parallel sum mechanism - but I changed it in both implementations and it made no difference anyway. I changed the region description to a float array too, although that only affects the scaling function in the first instance.
If I run this on a CPU the performance is very poor - around 1.5s for this test case. If I go back to a test CPU version I wrote it's a more reasonable 240ms so i'm still getting a good 30x speedup over an Intel i7 X 980. Given I was getting 90ms before with the cpu driver and the nvidia test case i'm not really sure what's going on there.
I haven't checked the code in yet as it's a bit hacked up.
Update: I checked some stuff in, although left both implementations in-tact for now.
Update 2: So I did some further analysis of the cascades I have: it turns out the way i'm splitting the work is very wasteful of GPU resources. I'm using at least 64 work items per stage - using one work item per feature. But the earlier stages have only a small number of features to test - and the vast majority of probes don't go past the first few stages. e.g. the default cascade only has 9 tests. I tried a few variations to address this but the overheads of multiple kernel calls and the global communication required outweighed any better utilisation.
Update 3: So curiosity kept me poking. First I realised that using fixed scheduling for persistent kernels might not be idea. So I use an atomic to dole out work in a first-some-first-served consumer way. Made a tiny difference.
Then I thought I would try to see if using fewer work-items per feature stage would help. In this case I use 4x16 or 2x32 thread groups to work on 4 or 2 tests concurrently - with all the necessary (messy) logic to ensure all barriers are hit by all threads, etc. This was measurable - the lenna test case I have is now down to around 7ms (unfortunately when using sprofile the algorithm fails for some unknown reason - so this is now time measured with System.nanoTime()).
One big thing left to try is to see if localising the wide work queue would help. e.g. rather than call multiple kernels for each stage and having each work-item busy working on a sub-set of problems, do it within the kernel. e.g. if the stage count is 9, 12, ... do stage 1 with 7 concurrent jobs, if any pass then add them to a local queue. Then do stage 2 with (64/12) = 5 concurrent jobs, if any pass add them to a local queue. etc. Once you get to a stage longer than 32 items, just use 64 threads for all the rest. This way I get good utilisation with small stages as well as with large stages. I'm not sure whether this will be worth all the hassle, and the extra addressing mathematics required (and it's already using a lot of registers); but as i'm really curious to know if it would help I might attempt it.
Given that I now use a work queue, another possibility open is to re-arrange the jobs to see if any locality of reference can be exploited. Given the huge memory load this might help: although the image cache is so small it might not.
Update 4: Curiosity got the better of me, it's been crappy cold weather and I hurt my foot (i don't know how) so I had another look at the complex version this morning ...
Cut a long story short, too many overheads, and although it isn't slow it isn't faster than just using 16 or 32 threads per feature test. Too many dynamic calculations which cannot be optimised out, and so on. It's around 9.5ms on my test case.
Structurally it's quite interesting however ....
- Find out how many concurrent tests can be executed for stage 0, dequeue that many jobs from the work queue and copy them to a local work queue.
- If we exceeded the job length, stop.
- Work out how many jobs can be done for the current stage
- Process one batch of jobs.
- Parallel sum the stage sum.
- If it advances to the next stage, copy to a next-stage queue.
- Go back to 4 unless finished the in queue.
- If any are in the next-stage queue, copy them over, advance to the next stage.
- Go back to 3 if we had any work remaining.
- If we ran the full stage count, copy any work jobs remaining in the queue to the result list.
- Go back to 1.
There's a lot of messy logic used to make sure every thread in the workgroup executes every barrier, and there are lots of barriers to make sure everything works properly (i'm using locals a lot to communicate global info like the stage and topology information). So the code runs on a CPU (i.e. I got the barriers correct), although very inefficiently.
As is often the case with GPU's, the simpler version works better even if on paper it is less efficient at filling the ALU slots. Although I haven't confirmed this is the case mathematically: apart from stage 0, the more complex method will also have un-even slot fillage - it's one of those discrete maths/Knuth style problems I simply give up on.
No comments:
Post a Comment