A Hacker's Craic: September 2013

Saturday, 28 September 2013

More ELF n stuff

Elf loader

I posted a simple elf loader for the parallella to the boards, but here's another link to it: elf-loader-0.0.tar.gz

This just handles the single-core case with nothing 'extra', it still requires special linker scripts and other foo.

It was really just a test to check some assumptions and fortunately they checked out.

A multi-loader?

I spent a few hours the night before trying to nut out a multi-core relocating loader but although I think I can pull it off it's kind of messy and it can't be made to do what I want it to.

The problem is that the linker will pull in all the library functions as static and although i can relocate per-core code to the per-cores, I need to copy all the static library functions used across all cores; and that just isn't going to be good enough given the tight memory.

What I really want is to have the linker create a specific binary for each core, ... but somehow handle multple cores too. But the problem is the linker will either whine or create redundant copies of any shared data structures.

So this morning I had a bit of an epiphany (pun intended) with the idea of using weak symbols. I can then define 'undefined' labels and just resolve them at load time. It does force me to use relocable code, but I need to anyway. Resolving and relocating a few undefined labels isn't that difficult and it's easy to tell if you failed.

That got me thinking a bit further and the way the linker is currently used isn't really that useful on the epiphany because it doesn't have an MMU. You can write code that sits on one core, or the same code that runs on multiple cores, but if you want to have multiple codes running on multiple cores each using private global memory - you're stuck with a huge headache of having to manually assign private memory ranges in the linker script! Might be acceptable for embedded programmers, but yeah, way too Commodore 64 for me these days.

But AmigaOS never had a problem running multiple code blocks it just relocated everything, even sparsely if that so happened. Simply ignoring all the elf-features designed for MMU capability makes a lot more sense than trying to shoe-horn them to fit. So does finding a better approach than using compile-time-fixed linker-defined memory map.

The solution

This is my first thoughts on how to solve this problem. I drop the notion of being able to target multiple cores within the one file but I get a pile of benefits.

A single executable will map to a single core.
The same executable can be loaded onto multiple cores.
Multiple executables can be defined for loading a whole work-group.
Code is linked using -r to create a relocatable object. This is the only tool-chain specific option required.
Any code can reference code or data in other cores using weak references.
Platform specifics like the stack base or end of the bss section are also weak references.
The loader relocates the on-core data onto the core directly.
The loader could potentially automatically spill sections to external memory if they don't fit.
The loader resolves all weak references at the workgroup level.
Specific section names are used for global or private external mememory.
Platform specifics like the ISR table offset and the total on-core ram can be handled in the loader.

I still have to nut out some C runtime issues and where the weak reference's concrete instantiations are defined, but I'm pretty confident this will work and moreover that it is a good solution.

Since all the code is relocable there's no reason to use position independent code (which the compiler doesn't support yet) and there's no reason to include platform-specific details inside the linker script. The linker script can be very simple and is just used to partition external sections between on-core and off-core memory.

Note that even though the code is statically linked this will allow multiple separate instances of the the c (or other) library to be loaded across multiple cores. This might waste a bit of space but is much easier than dynamic linking and solves all the concurrency issues.

Relocating is more work than not having to, but it's not much work. And if a 7mhz 68000 can do it fast enough to be practical, surely a 1Ghz ARM can too.

Update: Moved the location of the elf-loader distribution.

Thursday, 26 September 2013

EPU elf loader, reloc, etc.

I've come to the point where I need to start looking at placing different code on different epu's and having them talk to each other via on-chip writes ...

But the current SDK is a bit clunky here. One basically has to write custom linker scripts, custom loader calls, and then manually link various bits together either with custom compile-time steps or manual linking (or even hardcoded absolute addresses).

So ... i've been looking into writing my own loader which will take care of some of the issues:

Allow symbolic lookup from host code, a-la OpenGLES;
Allow standard C resolution of symbols across cores;
Allow multi-core code to be loaded into different topologies/different hardware setups automatically.

Symbolic lookup

This is relatively straightforward and just involves a bit of poking around the ELF file. It's pretty straightforward and since ELF is designed for this kind of thing it takes very little code in the simple case.

Cross-core symbols

Fortunately the linker can do most of this, I just need a linker script but one that can be shared across multiple implementations.

My idea is to have the linker work against "virtual" cores which are simply 1MB apart in the address space. Section attributes can place code or data blocks into individual cores or shared memory or tls blocks.

Relocating loader

Because the cores are "virtual" the loader can then re-arrange them to suit the target topology and/or work-group. I'm going to rely on the linker creating relocatable code so i'm able to do this - basically it retains the reloc hunks in the final binary.

I'm not relying on position independent code for this - and actually that would just make life harder.

Linker too?

The problem is that the linker is going to spew if i try to put the same code into local blocks on different cores ... you know simple things like maths routines that really need to be local to the core. The alternative is to build a different complete binary for each core ... but then you're stuck with no way to automatically resolve addresses across cores and you're back where you started.

So it's going to have to get a lot more involved than just a simple load and reloc.

I'm just hoping i can somehow leverage/trick the linker into creating a single executable that has most of the linking work done, and then I'm able to finish it off at runtime without having to do everthing. Perhaps just duplicate all the sections common to all cores and then relocate and link in the per-core blocks.

Hmm, i think i need to think about this a bit more first.

Wednesday, 25 September 2013

Hmm, Valve, AMD, Nvidia?

Hmm, so have Valve and the Gabester finally managed to do what common sense and economics couldn't?

That is, get AMD and perhaps even Nvidia to start working on proper GPU drivers for Linux?

Nvidia just announced that they're going to start helping the GPL driver effort all of a sudden. AMD are teasing about a GNU/Linux and game related announcement in under 12 hours. And Valve's "SteamStation" is being announced one way or another in under 12 hours too.

I guess we'll know soon enough ... it'll give me something to read in the morning unless I wake up at 4am again ...

I'm most interested in what AMD have to announce. The best we can hope for is a properly-free reference implementation of GPU + HSA for AMD APU machines - this is probably in the realm of dreaming but you never know because it makes a hell of a lot of sense economically. And fits some of their HSA related rumblings. Add in a range of "desktop" parts from low to high powered to match and it could be an interesting day. HSA has the potential to be the biggest leap in IBM compatible PC architecture in history - even if it is just all the way back to 1995 (Amiga).

SteamOS is interesting to me beyond the GameOS potential. Having an 'under tv' option which isn't Sony, or XBMC, or Google has to be a good thing. Android is a pretty sucky 'spin' of GNU/Linux. The optimistic part of me also looks forward to the announcement of some sort of OpenGL based display mechanism that would finally fuck The X Window System and it's other shitty replacements right off into to the dustbin of history where they belong. Actually I take back what I said about being most interested in what AMD have to say, a replacement for X that isn't just X-again wayland or ubuntu-i-can't-believe-it's-not-linux's mir would be very, very welcome.

One hopes that Nvidia's announcement is also genuine (and also involved in Valve's announcement) and not just a cynical response to something AMD/Valve are expected to say. Because of nvidia's shithouse opencl support and performance on their mainstream parts, they are still "dead to me" - but that isn't a universal opinion.

Post-press

Well ... that was unexpected. A proprietary game api? Oh-kay.

So I guess AMD want to play the market power card? After wrapping up all the consoles?

I thought the whole point of the HSA design was to improve the efficiency of existing apis ...

Still at this point there isn't enough details to really make much of a judgement call. No real info about Linux either, apart from an "importance" of cross-platform support (but that could mean anything).

I guess this was an announcment of game cards, and game cards are bought by game players and game players buy game cards based on game benchmarks ... and a smaller API could definitely make a big difference there.

So I guess we'll just have to continue to wait and see on the APU and HSA fronts, and the same goes for the steam-machine. Poo to that.

Monday, 23 September 2013

Balanced, but fair?

So i've been following the xbone trainwreck over the last few months. It's been pretty entertaining, I like to see a company like m$ get their comeuppance. It's been a complete PR disaster, from the meaningless "180" to the technical specifications.

The FUD is stinking up the discourse a little though so here's a bit of fudbusting of my own.

Balance

Balance is an interesting term - but only if you've fucked up somewhere. Every system aims to be balanced within it's externally determined contraints - that's pretty much the whole point of systems engineering. It relates to the efficiency of a given design but says NOTHING WHATSOEVER about it's performance.

One of the main constraints is always cost and clearly that was one of the major factors in the xbone cpu design. Within the constraints of making a cheap machine it may well be balanced but it's certainly not as fast as the current competition.

m$ are trying to use the chief ps4 engineer's words against him in that he stated that they have more CU's than is strictly necessary for graphics - but the design is clearly intented to use the CU's for compute from the start. And in that scenario the xbone's gpu becomes unbalanced as it has inadequate ALU.

For the sort of developer that works on games I imagine GPU coding is really pretty easy. And with the capabilities of the new HSA-capable devices it should be efficient too - as soon as one has any sort of parallel job just chuck that routine on a GPU core instead of the cpu. Not catering for this seems short-sighted at best.

"Move engines"

These are just plain old DMA engines. Every decent personal computer has them since the Amiga 1000. They have them because they are useful but there's nothing particularly special or unique about them today and the AMD SOC in both consoles will have these - infact they will have several.

Even the beagleboard has a few of them (i can't remember if it's 2 or 4), and they can do rectangle copies, colour fill and even chroma-key. The CELL BE in the PS3 has a 16-deep DMA queue on each SPU - allowing up to 16 in-flight DMA operations PER SPU (i.e. 112 per CELL BE, not including other DMA engines). The epiphany core has 2 2-D DMA channels per EPU - or 32 independent channels for a 16-core chip.

They don't take too much hardware to implement either, just a couple of address registers, adders and a memory interface/arbiter (the biggest bit).

Hardware Scaler & "Display Planes"

i.e. overlays. Video hardware has had this sort of functionality for a couple of decades. IIRC even the lowly beagleboard has 3 "display planes" one of which has an alpha channel, and two of which can be scaled independently using high quality multi-tap filters and two of which support YUV input. Basically they're used for a mouse pointer and a video window, but they could be used for more.

Overlays are really useful if you have a low-bandwidth/low-performance system because of the "free" scaling and yuv conversion, but aren't going to make much of a difference on a machine like the xbone. For example even at 68GB/s one can read or write over 8000x1080P 32-bit frames per second, so you're looking at only a few percent maximum render time on a 50fps display for blending and scaling several separate display planes.

Good to have - sure, but no game-changer and certainly no unique 'value add'.

DRM & the 180

Personally I don't think anything much changed with their "180" on the DRM thing. DRM is still there, and even without a nightly parole check there are plenty of ways to have effectively the same thing. e.g. make a game pretty shit without being constantly on-line, tie a given disk to an account the first time you use it, and so on. And whatever they were planning could always be turned on at the flick of a switch at any future point in time (it needn't have to work with any game previously published, just with ones published after that point).

BTW Sony are really no better here despite all the free PR they wallowed in. Sure they never tried the really dumb idea of banning second hand sales of physical discs (it was an absurd idea anyway as much of they money you might make back from it would be swallowed in adminstration costs and given it would kill the used game market it would probably just end up being revenue negative). But they're making download-only attractive enough that people are foregoing their rights for convenience and the end result is pretty much the same.

All consoles have always been heavily laden with DRM - it was always one of their selling points to developers to negate the wide-spread sharing that everyone does on personal computers.

I can't see the difference...

This is just straight-up PR speak for "we don't expect the average (i.e. uneducateD) `consumer' to notice the difference".

Would you like some condescention with that?

It's all FUD and Games

The great thing about FUD is you don't even have to do much. Say a couple of things in the right places and you get ill-informed but well-intentioned people doing all your work for you. They don't even realise they've been manipulated.

We'd all let the games speak for themselves if we could actually see them ... but developers have to sign NDAs that wont let them talk about the differences, and rumours suggest they're not even allowed to show the games side-by-side at trade shows. So telling people to "see the games" is being very dishonest at best. It's just a FUD teqnique to try to get people locked in to buying their product. Once they get it home and see they've been sold a lemon few will be motivated to do anything about it, and if they get enough early adopters the network effects take over (this is why they're scambling so much even though they were clearly aiming for a 2014 launch - it has to be now or never, at least in their grand plan).

From what we can see the xbone was basically created as the end-game for their trojan horse in the home idea - a $700 hand-wavey remote control that you have to pay a subscription to use, and which monitors demographics and viewer reactions and serves advertisements appropriately. Playing games is only a secondary function - as can clearly be seen by the technical specifications.

If playing games was the primary function of the design then they simply "done fucked up". A company this big doesn't waste this much money over the course of a decade to fuck up at the end of it.

Thursday, 19 September 2013

Anyone for dessert?

Made this yesterday and thought it looked nice enough to post ...

An unbaked cheesecake is not the sort of thing I normally make but it doesn't hurt to know how. I don't have a very sweet tooth but I don't mind sweet things that are supposed to be sweet (in limited amounts). Sister-in-law had bought the strawberries so I used them too.

I'd made a nicely tart lime cheesecake last week and at first I was just going to have this one vanilla but to the same basic recipe. But on the spoon it tasted too much like sweetened condensed milk (yuck) so i added some lime and so it ended up a sort of vanilla + tart yoghurt sort of flavour.

Sunday, 15 September 2013

So that DuskZ thing ...

After doing nothing with it for months I finally checked in all the DuskZ code I had sitting on my HDD.

Unfortunately its very much work in progress and I didn't do any cleaning up (other than check the licensing), so it's all just "as it was" right now. I think the last thing I added was animated tiles, and before that multiple-map support.

At least some of the code there is a decent quality, although not much use on it's own.

Is it dead or just pining?

I'm not sure when I will get time to work on it again - i'm either too busy or hungover lately and it's hard enough to get the time just to fit in social interaction or the garden with all other hacking i'm doing lately. Actually it's not so much the physical time as being able to fit it in mentally as one needs to devote quite a bit of mind-share to do a good job. Like most I'm usually more active during summer so maybe i'll have time to fit it in ...

Just going through the code checking the licenses did pique my interest a little bit but also made me realise I would need a good few days of switched-on thinking to be able to do the next bit of work, that is after I even work out where I was at.

An embedded database backend would definitely be high on the list for example.

Wednesday, 11 September 2013

up/down sampling

One part of a window-based object detector is scaling the input to different resolutions before running the window across it (one can also scale the window, but this is not efficient on modern cpus).

So i've been looking at up/down resampling in little bits here and there over the last few days.

To cut a long story short, after coming up with a fairly complex but pretty good (i think) implementation of a one-pass 1/N and 2/N 2-D 'up-sample/down-sample' scaling filter ... I found that using a simple 1/N box filter is more than good enough for the application - and about 2x faster. Should NEONise pretty easily too.

The up/down filter may well be useful for other purposes though and I did learn more about up/down filters in general which is something i've been meaning to do. Maybe at some point i'll write about both.

I was only looking at implementing this for ARM but the algorithm I came up with should fit epiphany quite well - so at some point I will look deeper into it. Epiphany does offer other more exotic options mind you.

Friday, 6 September 2013

Scheduling in detail

Just some notes on optimising the assembly language version of the viola-jones cascade walker I've been working on for the eiphany chip.

I'm still working toward a tech demo but I got bogged down with the details of the resampling code - i'm using the opportunity to finally grok how upfirdn works.

Excuse the typos, i did this in a bit of a rush and don't feel like fully proof-reading it.

The algorithm

First the data structure. This allows the cascade to be encoded using dword (64-bit) alignment. It's broken into 64-bit elements for C compatability.

union drecord {
        unsigned long long v;
        struct {
                unsigned int flags;
                float sthreshold;
        } head0;
        struct {
                unsigned int count2;
                unsigned int count3;
        } head1;
        struct {
                unsigned short a,b,c,d;
        } rect;
        struct {
                float weight1;
                float fthreshold;
        } f0;
        struct {
                float fsucc,ffail;
        } f1;
};

And then a C implementation using this data structure. The summed-area-table (sat) sum calculates the average of all pixels within that rectangle. The sat table size is hard-coded to a specific width and encoded into the compiled cascade. Because it is only ever processed as part of a window of a known size this doesn't limit it's generality.

It performs a feature test on a 2-region feature which equates to either a "this half is brighter than the other half" test in all 4 directions, or a "the middle half is brighter than the two quarter sides" in both directions and senses.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3
int test_cascade(float *sat, float var, const union drecord *p, float *ssump) {
        union drecord h0;
        union drecord h1;
        float ssum = *ssump;


        do {
                h0 = (*p++);
                h1 = (*p++);

                while (h1.head1.count2) {
                        union drecord r0, r1, f0, f1;
                        float rsum;

                        r0 = (*p++);
                        r1 = (*p++);
                        f0 = (*p++);
                        f1 = (*p++);

                        rsum = (sat[r0.rect.a] + sat[r0.rect.d]
                                - sat[r0.rect.b] - sat[r0.rect.c]) * -0.0025f;
                        rsum += (sat[r1.rect.a] + sat[r1.rect.d]
                                 - sat[r1.rect.b] - sat[r1.rect.c]) * f0.f0.weight1;

                        ssum += rsum < f0.f0.fthreshold * var ? f1.f1.fsucc : f1.f1.ffail;
                        h1.head1.count2--;
                }

                ... 3-feature test is much the same ...

                if (h0.head0.flags & 1) {
                        if (ssum < h0.head0.sthreshold) {
                                return 0;
                        }
                        ssum = 0;
                }
        } while ((h0.head0.flags & 2) == 0);

        *ssump = ssum;

        // keep on going
        return 1;
}

As one can see the actual algorithm is really very simple. The problem with making it run fast is dealing with the amount of data that it can chew through as i've mentioned and detailed in previous posts.

I don't have any timings but this should be a particularly fast implementation on an desktop cpu too - most of the heavy lifting fits in the L1 cache for example, and it's pre-compling as much as possible.

Hardware specific optimisations

This covers a couple of optimisations made to take advantage of the instruction set.

First issue is that there is no comparison operation - all one can do is subtract and compare flags. Furthermore there are only limited comparison operators available - equal, less-than and less-than-or-equal. So in general a compare is at least 2 instructions (and more if you want to be ieee compliant but that isn't needed here).

On the other hand there are fmad and fmsub instructions - AND these set the flags. So it is possible to perform all three operations in one instruction given that we don't need to know the precise value.

Another feature of the epu is that the floating point and integer flags are separate so this can be utilised to fill instruction slots and also perform control flow without affecting the flags.

The epu is most efficient when performing dword loads. It's the same speed as a word load, and faster than a short or byte load. So the format is designed to support all dword loads.

Another general optimisation is in pre-compiling the cascade for the problem. So far i'm only using it to pre-calculate the array offsets but it could also be used to alter the sign of calculations to suit the available fpu flags.

Update: Because the eiphipany LDS is so small another optimisation was to make the cascade streamable. Although the single biggest stage with the test cascade fits in 8k it is pretty tight and limits the code flexbility and tuning options (e.g. trade-off space and time). It also limits generality - other cascades may not have the same topology. So the cascade format is designed so it can be broken at completely arbitrary boundary points with very little overhead - this is probably the single most important bit of engineering in the whole exercise and determines everything else. The difficulty isn't so much in designing the format as in recognising the need for it and it's requirements in the first place. Having a streamable cascade adds a great deal of flexibility for dealing with large structures - they can be cached easily and implementing read-ahead is trivial.

There were some other basic optimisation techniques which became available after studying the actual data:

2-region features use only two variations of weights, therefore it can be encoded in 1 bit or in a single float (the first one is always the same).
3-region features all use the same weights, therefore all 3 floats can be thrown away.
The original cascade format had 2 or 3 region features scattered amongst the cascade randomly which means any inner loop has to deal with the different number of elements (and branch!). Once on realises the only result is the sum then they can be processed in any order (summation algebra ftw ... again), meaning i could group them and optimise each loop separately.

Some of these seem to lose the generality of the routine - but actually the weights are always the same relationship they are just scaled to the size of the native cascade window. So making the algorithm general would not take much effort.

These are things I missed when I worked on my OpenCL version so I think I could improve that further too. But trying to utilise the concurrency and dealing with the cascade size is what kills the GPU performance so it might not help much as it isn't ALU constrained at all. If I ever get a GCN APU I will definitely revisit it though.

Unscheduled ASM

After a (good) few days worth hacking blind and lots of swearing I finally came up with the basic code below. I was dreaming in register loads ...

Actually this was de-scheduled in order to try to follow it and re-schedule it more efficiently. This is the top part of the C code and the entire 2-region loop.

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

0:      ldrd    r18,[r7,#1]     ; count2, count3
        ldr     r16,[r7],#4     ; flags

        and     r0,r18,r5       ; check zer0 count
        beq     1f

2:      ldrd    r0,[r7],#4      ; 0: load a,b,c,d
        ldrd    r2,[r7,#-3]     ; 1: load a,b,c,d

        lsr     r4,r0,#16       ; 0:b index
        ldr     r21,[r6,r4]     ; 0:load b
        and     r0,r0,r5        ; 0:a index
        ldr     r20,[r6,r0]     ; 0:load a

        lsr     r4,r1,#16       ; 0: d index    
        ldr     r23,[r6,r4]     ; 0: load d
        and     r1,r1,r5        ; 0: c indec
        ldr     r22,[r6,r1]     ; 0: load c

        lsr     r4,r2,#16       ; 1: b index
        ldr     r25,[r6,r4]     ; 1: load b
        and     r2,r2,r5        ; 1: a index
        ldr     r24,[r6,r2]     ; 1: load a
        
        lsr     r4,r3,#16       ; 1: d iindex
        ldr     r27,[r6,r4]     ; 1: load d
        and     r3,r3,r5        ; 1: c index
        ldr     r26,[r6,r3]     ; 1: load c

        ldrd    r50,[r7,#-2]    ; load w1, rthreshold
        
        fsub    r44,r20,r21     ; 0: a-b
        fsub    r45,r23,r22     ; 0: d-c
        fsub    r46,r24,r25     ; 1: a-b 
        fsub    r47,r27,r26     ; 1: d-c
        
        fmul    r48,r51,r60     ; rthreshold *= var
        
        fadd    r44,r44,r45     ; 0[-1]: a+d-b-c
        fadd    r45,r46,r47     ; 1[-1]: a+d-b-c
        
        fmsub   r48,r44,r63     ; [-1]: var * thr -= (a+d-b-c) * w0
        ldrd    r52,[r7,#-1]    ; [-1] load fsucc, ffail
        fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1
        movblte r52,r53
        fsub    r17,r17,r52     ; [-2]: ssum -= var * thr > (rsum) ? fsucc: ffail

        sub     r18,r18,#1
        bne     2b
1:

Apart from the trick with the implicit 'free' comparison operations for all that it pretty much ended up in a direct translation of the C code (much of the effort was in the format design and getting the code to run). But even in this state it will execute much faster than what the compiler generates for the very simple loop above. Things the C compiler is missing:

It doesn't use dword loads - more instructions are needed
It does use hword loads - causes fixed stalls
It is using an ieee comparison function (compiler flags may change this)
It doesn't use fmsub as much, certainly not for comparison
It needs to multiply the array references by 4

Because there are no datatypes in asm, this can take advantage of the fact that the array lookups are by the byte and pre-calculate the shift (multiply by sizeof(float)) in the cascade. In the C version I do not as it adds a shift for a float array reference - I do have a way to remove that in C but it's a big ugly.

Otherwise - it's all very straightforward in the inner loop.

First it loads all the rect definitions and then looks them up in the sat table (r6).

Then it starts the calculations, first calculating the average and then using fmsub to perform the multiply by the weight and comparison operation in one.

At the very end of the loop the last flop is to perform a subtraction on the ssum - this sets the status flags to the final comparison (if (ssum < h0.head0.sthreshold) in c). This actually requires some negation in code that uses it which could be improved - the threshold could be negated in the cascade for example.

If one looks closely one will see that the registers keep going up even though many are out of scope and can be re-used. This is done on purpose and allows for the next trick ...

I don't have the full profiling info for this version, but I have a note that it includes 15 RA stalls, and I think from memory only dual-issues 2 of the 10 flops.

Scheduling

A typical optimisation technique is to unroll a loop, either manually or by letting the compiler do it. Apart from reducing the relative overhead of any loop support constructs it provides modern processors with more flexibility to schedule instructions.

The code already has some loop unrolling anyway - the two regions are tested using in-line code rather than in a loop.

But unrolling gets messy when you don't know the the loop bounds or don't have some other hard detail such as that there is always an even number of loops. I didn't really want to try to look at pages of code and try to schedule by hand either ...

So instead I interleaved the same loop - as one progresses through the loop calculating the addresses needed for "this" result, the fpu is performing the calculations for the "last" result. You still need a prologue which sets up the first loop for whatever the result+1 code is expecting, and also an epilogue for the final result - and if only 1 value is processed the guts is completely bypassed. I'll only show the guts here ...

// Copyright (c) 2013 Michael Zucchi
// Licensed under GNU GPLv3

        .balign 8
2:
[  0]   fsub    r46,r24,r25     ; [-1] 1: a-b 
[  0]   ldrd    r0,[r7],#4      ; [ 0] 0: load a,b,c,d
[  1]   fsub    r47,r27,r26     ; [-1] 1: d-c
[  1]   ldrd    r2,[r7,#-3]     ; [ 0] 1: load a,b,c,d
[  2]   fmul    r48,r51,r60     ; [-1] rthreshold *= var
        
[  2]   lsr     r4,r0,#16       ; [ 0] 0:b index
[  3]   fadd    r44,r44,r45     ; [-1] 0: a+d-b-c
[  3]   ldr     r21,[r6,r4]     ; [ 0] 0:load b
[  4]   and     r0,r0,r5        ; [ 0] 0:a index
[  5]   ldr     r20,[r6,r0]     ; [ 0] 0:load a

[  6]   lsr     r4,r1,#16       ; [ 0] 0: d index    
[  6]   fadd    r45,r46,r47     ; [-1] 1: a+d-b-c
[  7]   ldr     r23,[r6,r4]     ; [ 0] 0: load d
[  8]   and     r1,r1,r5        ; [ 0] 0: c indec
[  8]   fmsub   r48,r44,r63     ; [-1] var * thr -= (a+d-b-c) * w0
[  9]   ldr     r22,[r6,r1]     ; [ 0] 0: load c
        
[ 10]   lsr     r4,r2,#16       ; [ 0] 1: b index
[ 11]   ldr     r25,[r6,r4]     ; [ 0] 1: load b
[ 12]   and     r2,r2,r5        ; [ 0] 1: a index
[ 13]   ldr     r24,[r6,r2]     ; [ 0] 1: load a

[ 13]   fmsub   r48,r45,r50     ; [-1] var * thr -= (a+d-b-c) * w1

[ 14]   ldrd    r52,[r7,#-5]    ; [-1] load fsucc, ffail
[ 15]   lsr     r4,r3,#16       ; [ 0] 1: d iindex
[ 16]   and     r3,r3,r5        ; [ 0] 1: c index
[ 17]   ldr     r27,[r6,r4]     ; [ 0] 1: load d
[ 18]   movblte r52,r53         ; [-1] val = var * thr < rsum ? fsucc : ffail
[ 19]   fsub    r44,r20,r21     ; [ 0] 0: a-b
[ 19]   ldr     r26,[r6,r3]     ; [ 0] 1: load c
[ 20]   fsub    r45,r23,r22     ; [ 0] 0: d-c

[ 20]   sub     r18,r18,#1
[ 21]   ldrd    r50,[r7,#-2]    ; [-1] load w1, rthreshold

[ 21]   fsub    r17,r17,r52     ; [-1] ssum -= var * thr > (rsum) ? fsucc: ffail

[ 22]   bne     2b
[ 26] ; if back to the start of the loop

Update: I tried to improve and fix the annotations in the comments. The [xx] value is the index of the result this instruction is working on, the next x: value is the index of the region being worked on (where it is needed).

I've attempted to show the clock cycles the instructions start on (+ 4 for the branch), but it's only rough. I know from the hardware profiling that every flop dual-issues and there are no register stalls. The loop start alignment is also critical to the lack of stalls. And it took a lot of guess-work to remove the final stall which lingered in the last 5 instructions (someone'll probably tell me now that the sdk has a cycle timer, but that would be no matter if they did).

It almost fell out almost completely symmetrically - that is having all ialu ops in loop 0 and having all flops in loop 1, but by rotating the flops around a bit I managed to get the final flop being the ssum "subtraction + comparison" operation and with no stalls ...

The movblte instruction which performs the ternary is the one that uses the implicit comparison result from the fmsub earlier. Not only does this save one instruction, it also saves the 5 clock cycle latency it would add - and this loop has no cycles to spare that I could find.

There is some more timing info for this one on the previous post. The version that this is 30% faster is not the unscheduled one above but an earlier scheduling attempt.

Oh I should probably mention that i found the bugs and the timings in the previous post did change a bit for the worse, but not significantly.

Wednesday, 4 September 2013

That scheduling ...

Had some other stuff I had to poke at the last couple of nights, and I needed a bit of a rest anyway. Pretty stuffed tbh, but i want to drop this off to get it out of my head.

Tonight I finally got my re-scheduled inner loop to work. Because i'm crap at keeping it all in my head I basically made a good guess and then ran it on the hardware using the profiling counters and tweaked until it stopped improving (actually until i removed all RA stalls and had every FLOP a dual-issue). Although it looks like now it's running for realz one of the dual-issue's dropped out - depends on things like alignment and memory contention.

But the results so far ...

       Previous "best" scheduling       New "improved" scheduling

                   CLK = 518683470 (1.3x)           CLK = 403422245
             IALU_INST = 319357570            IALU_INST = 312638579
              FPU_INST = 118591312             FPU_INST = 118591312
             DUAL_INST = 74766734  (63% rate) DUAL_INST = 108870170    (92% rate)
             E1_STALLS = 11835823             E1_STALLS = 12446143
             RA_STALLS = 122796060 (2.6x)     RA_STALLS = 47086269
      EXT_FETCH_STALLS = 0             EXT_FETCH_STALLS = 0
       EXT_LOAD_STALLS = 1692412        EXT_LOAD_STALLS = 1819284

The 2-region loop is 33 instructions including the branch, so even a single cycle improvement is measurable.

I haven't yet re-scheduled the '3-region' calculation yet so it can gain a bit more. But as can be seen from the instruction counts the gain is purely from just rescheduling. The IALU instruction count is different as i improved the loop mechanics too (all of one instruction?).

As a quick comparison this is what the C compiler comes up with (-O2). I'm getting some different results to this at the moment so the comparison here are only provisonal ...

                   CLK = 1189866322 (2.9x vs improved)
             IALU_INST = 693566877
              FPU_INST = 131085992
             DUAL_INST = 93602858   (71% rate)
             E1_STALLS = 31768387   (2.5x vs improved)
             RA_STALLS = 322216105  (6.8x vs improved)
      EXT_FETCH_STALLS = 0
       EXT_LOAD_STALLS = 14099244

The number of flops are pretty close though so it can't be far off. I'm doing a couple of things the C compiler isn't so the number should be a bit lower. Still not sure where all those ext stalls are coming from.

Well the compiler can only improve ...

In total elapsed time terms these are something like 1.8s, 0.88s, and 0.60s from slowest to fastest on a single core. I only have a multi-core driver for the assembly versions. On 1 column of cores best is 201ms vs improved at 157ms. With all 16 cores ... identical at 87ms. But I should really get those bugs fixed and a realistic test case running before getting too carried away with the numbers.

Update: I later posted in more detail about the scheduling. I tracked down some bugs so the numbers changed around a bit but nothing that changes the overall relationships.

I've moved! So long and thanks for all the fish!