I changed the code to perform a simple doubling up of the U and V components without a separate pass, and changed to an RGB 565 output stage and embedded it into the code in another mess of crap. Then I did some profiling - comparing mainly to the frame-copying version.
Interestingly it is faster than sending the YUV planes to the GPU and using it to do the YUV conversion - and that is only including the CPU time for the frame copy/conversion, and the texture load. i.e. even using NEON it uses less CPU time (and presumably much less GPU time) even though it's doing more work. The volume of texture memory copied is also 33% more for the RGB565 case vs YUV420p one.
Still, 1ms isn't very much out of 10 or so.
The actual YUV420p to RGB565 conversion is only around 1/2 the speed of a simple AVFrame.copy() - ok considering it's writing 33% more data and I didn't try to optimise the scheduling.
Stop press Whilst writing this I thought i'd look at the scheduling and also using the saturating left shift to clamp the values implicitly. Got the inner loop down from 54 to 35 cycles (according to the cycle counter), although it only runs about 10% faster. Better than a kick in the nuts at any rate. Fortunately due to the way I already used registers I could decouple the input loading/formatting from the calculations, so i simply interleaved the next block of data load within the calculations wherever there were delay slots and only made the data loading conditional.
The (unscheduled) output stage now becomes:
@ saturating left shift automatically clamps to signed [0,0xffff] vqshlu.s16 q8,#2 @ red in upper 8 bits vqshlu.s16 q9,#2 vqshlu.s16 q10,#2 @ green in upper 8 bits vqshlu.s16 q11,#2 vqshlu.s16 q12,#2 @ blue in upper 8 bits vqshlu.s16 q13,#2 vsri.16 q8,q10,#5 @ insert green vsri.16 q9,q11,#5 vsri.16 q8,q12,#11 @ insert blue vsri.16 q9,q13,#11 vst1.u16 { d16,d17,d18,d19 },[r3]!
Which saves all those clamps.
As suspected, the 8 bit arithmetic leads to a fairly low quality result, although the non-dithered RGB565 can't help either. Perhaps using shorts could improve that without much impact on performance. Still, it's passable for a mobile device given the constraints (and source material), but it isn't much chop on a big tv.
Of course, all this wouldn't be necessary if one had access to the overlay framebuffer hardware present on pretty well all ARM SOCs ... but Android doesn't let you do that does it ...
Update: I've checked a couple of variations of this into yuv-neon.s, although i'm not using it in the released JJPlayer yet.
Mele vs Ainol Elf II
The Elf is much faster than the Mele at almost everything - particularly video decoding (which uses multiple threads), but pretty much everything else is faster (Better memory? The Cortex-A9? The GPU?) and with the dual-cores means it just works a lot better. Can't be good for the battery though.
(as an aside, someone who spoke english should've told the guys in China that "anal elf 2" is probably not a good name for a computer!)
But the code is written with multiple cores in mind - demux, decoding of video and audio, and presentation is all executed on separate threads. Having all of the cpu-bound tasks executed in a single thread may help on the Mele, although by how much I will only know if and when I do it ...
2 comments:
Hi there,
Got any numbers for the size of image you were using for the 1ms time mentioned? And the device being used?
I'm looking for a rough performance estimate of RGB->YUV conversion, and imagine your YUV->RGB565 has a similar cost.
I think from the previous post, my test case was vga - i.e. 640x480.
It gets some of it's performance from some tricks that might not be applicable to mathematically-correct conversion such as clamping (0,255) which may or may not be possible in the other direction.
Post a Comment