A Hacker's Craic: March 2010

Wednesday 31 March 2010

Bye bye CELL

Well, I guess that's really the last nail in the coffin for CELL.

Sony's just announced that the next firmware `upgrade' for the PS3 will drop Linux support (and it's so important, that's all it will do). This is very, very disappointing. They blame it on crackers or 'security', but it's obvious it is just a cost cutting exercise. Sony have been hurting financially for a while now, and the razor gang is out with their daggers looking for savings.

After the `ps3 slim' dropping support (due supposedly to lack of resources to write the hypervisor drivers), and then IBM dropping CELL for HPC ... I guess the writing was on the wall. I'm glad I gave up development on CELL BE some time ago and got hold of a beagleboard instead - overall it's been a more satisfying experience if only because things are simpler. The whole CELL thing was just a costly mistake for all by the looks of it - being a bit ahead of it's time lead to a few limitations that people couldn't cope with.

Even though I rarely use it anymore, the whole thing plainly stinks - this is not a device I rent, I bought it. And for them to come into my home and remove functionality (advertised on the box no less) from a device that I paid for in full (well over-paid) should simply be illegal, if it isn't already.

I can't even log onto the Sony blog to fruitlessly whine about it because they've changed the login system to some horrid mess that takes ages to load and only shows blank pages (I bet it works on ie6 though, if the comments in the page are anything to go by). Well if they don't want me as a customer it isn't really my loss is it?

Time to remove CELL BE from the subtitle of this page at least; not that it has had much point in being there for quite some time.

Tuesday 30 March 2010

Damn numbers

Hmm, that was frustrating.

Have been trying to write a `kernel' boot header - one that sets the MMU up for the kernel to execute at another address (0xC0000000) and then jumps to it. Been very tired from sleeping poorly and a bit brain-dead after work so I haven't been really switched on, but it's been dragging on so much I was about to give up (well not really, but it felt like I should).

Apart from a few little bugs, i was using the wrong TEXCB/AP flags for the level 2 page entry for devices ... but I don't know why it's wrong. It seems to check out in the manual, but for whatever reason it just crashes the code (FWIW I was using 0xb2 - 'non sharable device, rw everyone' rather than '0x16' 'sharable device, rw supervisor only). Blah. One little number change and now it works. $@%!$#

I plan to use the two translation table mode, which means the system memory will start at 0x80000000 - so it may make sense to just identity map the kernel at that address. But for now the memory map will have the kernel at 0xc0000000, and i'll start shared libraries or something else at 0x80000000.

So here it is ... in hindsight I may have done things in the wrong order, but this way makes things easy. I set aside some memory in the BSS section for the page tables and let the linker manage allocating space for them, also for the I/O devices - although this means a couple of physical pages are lost at present.

There is a few little `tricks' that I use so the code is position independent, although there are possibly better ways to do it. The init code has to be position independent because the linker script is set up so that all the code starts from the same virtual address - it could be done otherwise, but then I would need an ELF loader to relocate the image - which is somewhat more work.

 _start:
        adr     r12,_start              @ this will be physical load address
        mov     sp,r12
        push    { r0 - r3 }

First I just setup r12 and the stack to point to our load address - which is 0x80008000 as set by the linker script. This gives the code a fixed location from which to calculate physical and virtual addresses. The incoming arguments are saved too - although nothing uses them yet (das u-boot can pass in arguments or information about modules or filesystems it preloaded into memory).

        ldr     r1,bss_offset
        ldr     r2,bss_offset+4
        add     r1,r1,r12
        add     r2,r2,r12
        mov     r0,#0
1:      str     r0,[r1],#4
        cmp     r1,r2
        blo     1b

Clear the BSS - the code reads a relative offset that the linker creates, that indicates where the BSS starts and stops, and then uses r12 to map that to the physical address. The ldr r1,bss_offset is assembled into a pc-relative instruction so will work no-matter where it's loaded.

Then there is a loop which uses a table to initialise the page tables. I first need to find the space within the BSS where it is stored, and then iterate through the entries. Each range is defined by a virtual target address, a start offset relative to _start, a virtual end address, and the `small page' flags for the pages.

        ldr     r11,ttb_offset
        add     r11,r12                 @ physical address of kernel_ttb
        add     r10,r11,#16384          @ same for kernel_pages

        adr     r9,ttb_map
        mov     r8,#ttb_size
1:      ldm     r9!, { r4, r5, r6, r7 } @ virtual dest, start offset, virtual end, flags
        add     r5,r12                  @ physical address

2:      mov     r3,r4,lsr #20
        ldr     r2,[r11, r3, lsl #2]
        cmp     r2,#0

If the l2 page isn't set yet, then just allocate one and update the l1 entry.

        moveq   r2,r10
        addeq   r10,#1024
        orreq   r2,#1
        streq   r2,[r11, r3, lsl #2]

Form and store the l2 page table entry.

        bic     r2,#0xff                        @ r2 = physical address of l2 page
        mov     r1,r4,lsr #12
        and     r1,#0xff
        orr     r0,r5,r7
        str     r0,[r2, r1, lsl #2]

And then loop for all the pages and all the entries in the table. Here I compare for equality for the end address - I do this so I could map the last page of memory if I wanted to. But currently I don't use this.

        add     r4,#4096
        add     r5,#4096
        cmp     r4,r6
        bne     2b

        subs    r8,#1
        bne     1b

That's really the meat of it - the table has the smarts in it, and uses the linker to create the interesting values required.

Then it just turns on the MMU - this could probably be simplified as I can just enforce the state I want (i.e. don't bother preserving bits). Putting 1 in CP15_TTBCR means that two page tables are used, the TTBR1 table is used for any address with the top bit set (i.e. >= 0x80000000).

        mrc     15, 0, r0, CP15_SCTLR
        bic     r0,#SCTLR_ICACHE
        bic     r0,#SCTLR_AFE | SCTLR_TRE | SCTLR_DCACHE | SCTLR_MMUEN
        mcr     p15, 0, r0, CP15_SCTLR

        mov     r0,#0
        mov     r1,#1

        mcr     p15, 0, r0, CP15_TLBIALL
        mcr     p15, 0, r1, CP15_TTBCR          @ Top 2G uses TTBR1   
        mcr     p15, 0, r11, CP15_TTBR0
        mcr     p15, 0, r11, CP15_TTBR1
        mcr     p15, 0, r0, CP15_TLBIALL
        sub     r0,#1
        mcr     p15, 0, r0, CP15_DACR

        pop     { r0 - r3 }

        mrc     15, 0, r8, CP15_SCTLR
        orr     r8,#SCTLR_MMUEN
        mcr     p15, 0, r8, CP15_SCTLR

This last instruction turns the MMU on (and will probably eventually turn on the caches/etc). The input arguments are restored before turning on the MMU since the stack memory will no longer be valid or mapped (actually I should probably map the same 32K to the system stack wherever I decide to put that). The CPU now flushes the pipeline and starts executing instructions from the current pc - but with the MMU on. Because of this the code has to ensure this instruction is still mapped to the same address otherwise it's a one-way trip to la-la land.

In this case the ldr pc,=vstart will force the assembler to generate a constant load from the constant pool (via a pc-relative load). The linker will set this constant up to point to the virtual address properly.

        ldr     pc, =vstart

Now come the relative offsets used to locate the BSS range, as well as the page table memory from within BSS.

bss_offset:
        .word   __bss_start__ - _start
        .word   __bss_end__ - _start
ttb_offset:
        .word   kernel_ttb - _start

And then the important stuff - the page table mapping descriptions. Rather than store the 'virtual end' address it could probably store the length of the address range, but so long as they are aligned properly it doesn't really make much difference. Note that even with the relative addresses any range in memory can be accessed using the simple arithmetic that the linker supports.

ttb_map:
        @ this page, so mmu can be enabled
        .word   LOADADDR, 0, LOADADDR + start_sizeof, CODE
        @ kernel text at virt address
        .word   __executable_start, 0, __data_start__, CODE
        @ kernel data
        .word   __data_start__, __data_start__-_start, __bss_end__,DATA
        @ system stack, 32K, 4K from end of memory
        .word   0 - 32768 - 4096, 0x8000000 - LOADADDR, 0-4096, DATA
        @ i/o of gpio, for debug too (LEDs!)
        .word   GPIO5, 0x49056000 - LOADADDR, GPIO5+4096, NDEV
        @ do serial port too, for debug stuff
        .word   UART3, 0x49020000 - LOADADDR, UART3+4096, NDEV

        .set    ttb_size, (. - ttb_map) / 16
        .ltorg

The .ltorg ensures the constant pool is stored at this point, so we can guarantee they are within the one page which needs to be identity mapped immediately after turning on the MMU.

vstart:
        ldr     sp,=-4096                       @ init stack
@       bl      __libc_init_array               @ static intialisers
        mov     r8,#(0xf<<20)                   @ enable NEON coprocessor access (still off though)
        mcr     p15, 0, r8, c1, c0, 2
        b       main

And this is the 'virtual address' entry point. This could just occur immediately after the setup code, but separating it makes it more obvious it's separated. About the only necessary setup is the (system) stack pointer. I was going to place this at the end of the virtual memory but having it one page back protects from stack underflow as well.

And finally there is the size of this code, and the BSS which stores the bare minimum so I can set it up and see it works (i.e. the UART or blink the LEDs).

        .set    start_sizeof, ((. - _start)+4095) & 0xfffff000

        .bss
        .balign         16384
        .global kernel_ttb, kernel_pages, UART3
kernel_ttb:
        .skip   16384
kernel_pages:
        .skip   1024*32
GPIO5:  .skip   4096
UART3:  .skip   4096

And ... it's done. Phew.

Unfortunately this means all my 'library code' that uses fixed physical addresses wont work any more, including the debug printing stuff. But that's something to worry about later.

One goal I had was that code isn't just setting up the page table to be thrown away later - this is sufficient to remain the kernel page table forever. Either for a supervisor level kernel process/threads, or for in this case as the `system page table' which is used for any address above 0x80000000. It still needs a little tweaking - the page table should be write-through cache-able for instance - but now it works I can worry about the details. Well now hopefully I can move on to more interesting things.

Interpolating arbitrary values

For work I have been playing with a few things of some interest. I thought I needed a function that could interpolate a set of values spread across an arbitrary 2d plane into a grid of values. I came across this interesting implementation of Thin Plate Splines which seemed to do the job. Unfortunately it turned out that I needed to interpolate more values than is practical with this algorithm (it does it, it just takes too long), and I can probably just force the values to be in a grid anyway so I can use much simpler methods. But still, this is an interesting algorithm to have in the toolkit and it produces pleasant looking results. Interestingly I found the C++ 'ludecomposition' code too messy to convert to C (i'm using different data structures) and just used the Java one it references as a starting point instead. It was much more C-like and translated in a very straightforward manner.

So I wrote a basic bicubic interpolater - the code uses bilinear at the moment although in an inconsistent way which doesn't really work since values can be missing. I was hoping bicubic would be a more natural fit for what it is doing, and worry about the missing values later. Unfortunately it doesn't seem to help much - the input data is just too noisy/inconsistent so I guess there is more to fix first (sorry this doesn't make much sense, I can't really say what it's trying to do).

Walls, dirt

I have some photo's of the progress on the retaining walll but i'm too lazy to put them up today. I got some ag-pipe on the weekend, so I'm just about ready to back-fill at least some of the wall (i don't think I have enough gravel to do the whole lot, but i'll see), although I'm not sure where to run it - and an outlet mid-way along the wall i've already laid will be a bastard! I was going to have it coming out the ends but now i'm not so sure. I need to decide so I can get the right fittings too (which for some reason are rather expensive for what they are).

Boral are having a sale on bricks and whatnot this week so I went and ordered another pile of retaining wall blocks (40% off makes it worth it, even if I don't need them for a while). I wasn't really sure how many I needed to start with, and I used a lot more than I thought originally (just the main wall uses most of them). I have a better plan on what I want to end up with now, so hopefully I got it right ... I guess I can always put them around trees or something if I have too many, or create a lower wall if I don't have enough.

Since I wont need to use them for a while i'm going to try to get them delivered into the driveway - so I don't have to move them off the verge by hand. So today I also moved the rest of the roadbase off of the drive-way to a pile out the back. Unfortunately I overloaded my cheap wheelbarrow and it turned over and I bent the handle (well it was only $60), but it's still usable. If I get stuck into finishing off the walls around the paving area it will get used up pretty fast anyway - of the 3 tons I probably have under 1 left. I'll get the bricks before easter, so it could be a very long long weekend if I get stuck into it ...

Thursday 25 March 2010

Well said.

The Haiku mailing list is the strangest list I ever did subscribe to.

This post from Justin is completely super-fantastalistic. Very good read, if completely off-topic.

Still, this thread was basically the last straw for me at the moment on the Haiku list (but not enough on it's own) ... I guess I may return one day though

Update 1/4/2010. I've edited this post to change the tone a bit - I shouldn't have posted intoxicated as usual, or at the insane hour of the morning I did either, IIRC.

Read an enjoy.

from Justin x
reply-to haiku@freelists.org
to haiku@freelists.org
date 22 March 2010 02:32
subject [haiku] Re: The Holy Bible
mailing list haiku.freelists.org Filter messages from this mailing list

This whole discussion is irritating... when followers of these faiths ignore the fact that they have mutually exclusive religions and the claims of the Christian bible for instance specifically demand the death of false prophets etc (not to mention killing gay people, atheists, witches, anyone who tries to convert you, your own children if they are unruly, rape victims (or just forcing them to marry their rapists), etc and don't try your "Jesus changed that" crap because I can point out to you a litany of verses where he specifically states that he did NOT come to change the old laws but to ensure their fulfillment until the end of time... EVERY JOT AND TITTLE etc... and how he numerous times references the validity of those old laws INCLUDING THEIR PUNISHMENTS OF DEATH etc. So don't even try that ignorant and wrong excuse).

Islam is the same way in it's treatment of Kuffar etc... death to apostates (as Sharia law demands that in a Muslim country under Sharia law if you say that you are no longer a Muslim the punishment is to be killed. How enlightened and civilized!), conversion or death for non-Muslims etc... not to mention that Islam specifically denies the divinity of Christ, who for Christians IS God etc... so that COMPLETELY denies a shared belief in the same God... as they're just two absurd variations on an older myth... just like Mormons in the US today are an even more recent absurd fallacious re-invention of older mythology... exactly as what Judaism was to the pagan polytheistic pre-monotheistic god of Moses... what Christian was to Judaism... what Islam was to Christianity... what Mormonism is to Christianity... just newer and newer absurd twists on an already absurd mythology invented by painfully ignorant barbarians who knew less about the natural world than small children know today.

Not to mention that when it comes down to it any one of the followers of any of these faiths INHERENTLY denies the validity of the others... but is too stupid to actually comprehend that the very reasons they deny all those other religions are the exact same reasons that apply to their own stupid fairytale. They just can't put two and two together because they're brainwashed, compartmentalized, idiots who utterly lack the ability to apply logic, reason, critical assessment etc to their own beliefs even when they do use them to dismiss other beliefs.

And to top that off, again, you have people like this who smile and act like it's some happy thing that you believe in the same stupid fairytale in spite of the fact that in reality you DENY the validity of their belief and their different interpretation on a completely fabricated mythology.

So when I see grown adults behaving like idiots who believe in imaginary gods invented centuries or millennia ago by ignorant bronze age tribesmen etc... and then ignoring what the very holy scriptures they're tittering about actually say etc... and end up just patting each other on the back because their mutually shared delusion stems from the same older absurd mythology... and that they have in common that ignorant and irrational belief in the supernatural etc...

I honestly didn't expect to run across so many idiots on a mailing list for a not well known, in-development, operating system.

And for the record I actually hate Satanists... I made the earlier post just to illustrate the principle of people talking about a pretty stupid topic on a mailing list that other people might find offensive for its stupidity and implications. The reality is that you can't even legally reprint the Satanic Bible because it's still a copyrighted work, having only been written in 1969 etc... but I'd wager that probably nobody in this drooling bunch of idiots actually knows much of anything about the actual Satanic Bible... or the fact that true Satanists who are actual members of the Church of Satan DON'T ACTUALLY BELIEVE IN THE SUPERNATURAL AT ALL. They don't believe in a literal being called Satan and merely refer to him as a symbol of rebellion, self importance, etc... but pretty much not a single religious person I have met knows any of that because they actually do believe in make believe mythological beings and invisible all seeing men etc... and they spend so much time lying to each other about the boogieman that they never bother to actually educate themselves not only about the actual REAL history of their own religions and what their own scriptures actually say and command them to do etc... but certainly not anything that might ever, in any way, pose a threat to their delicate bubble of profound closed minded IDIOCY.

And I say all this as a person who was a Christian and youth minister for over 20 years.

I hate you people for being so damn stupid and for publicly blabbering about it like it's something to be proud of... and for having the false impression that you have some right to be exempt from being ridiculed as the idiots you actually are. That people like me need to sit and watch you publicly make asses of yourselves and seek to get together to try to bring your idiot cult and its literature to a new OS to try to share and spread your profound retardedness and brainwashing to future generations.

Like try this on for size you morons...

We know for a FACT that the Earth was not created 6,000 years ago... much less the whole universe. We now FOR A FACT that humanity did not suddenly appear 6,000 years ago in a magic garden in Mesopotamia. And because we know FOR A FACT that claim is false... we know that Adam and Eve weren't there strolling around and talking to a talking serpent either. Nor were there magic fruit trees that they ate from BECAUSE THEY DIDN'T EXIST. And because we know that, we know that they didn't eat that fruit and commit the original sin because THEY WEREN'T THERE. Thus there was nothing for the made up God to be mad about.. nothing for him to create himself as his son to forgive us from because it never happened because they were never there.

Can your tiny minds follow that train of logic?

http://en.wikipedia.org/wiki/Age_of_the_Earth
http://en.wikipedia.org/wiki/Human_evolution
http://en.wikipedia.org/wiki/Timeline_of_human_evolution
http://en.wikipedia.org/wiki/Human_mitochondrial_DNA_haplogroup

By the very same logic and facts, we know that Noah's Ark NEVER HAPPENED. Not only is it laughably absurd from a physical and logistical standpoint... but that much water has never in the history of this planet existed here on Earth. And had it actually rained that much (not to mention the laughably stupid claim that the bible actually makes about the heavens, the firmament, actually being a mechanical dome over head to hold the literal ocean above us up from falling on us... in which god opened up mechanical floodgates to let all the water pour down to flood the Earth and then made it all evaporate... but I digress...) had it actually rained that much and then evaporated in that span of time it would have saturated the entire atmosphere up into the stratosphere to over 100% and blocked out the sun completely and frozen the entire Earth wiping out life as we know it... and this all supposedly happened in the last 6,000 years... you know when the Egyptians had a kingdom etc... and these other civilizations failed to notice a world wide flood and subsequent world encompassing life obliterating ice age etc... IT'S LAUGHABLE at how absolutely detached from reality you people are that believe this tripe... and these weren't figurative stories either... as we know from a wealth of contemporary archaeological evidence that the people of those times LITERALLY BELIEVED these things because they were IGNORANT and knew essentially nothing about the natural world so they made up answers as best they could to describe the world around them and took it as fact... such as the fact that up until a few centuries before Christ they all still believed that the Earth was a circular flat disk. They LITERALLY believed it because the earth looked flat to them. All the older Old Testament books in the bible are written from that perspective and contain references to the Earth being flat etc.

But I digress...

When people are faced with reality and facts like this which threaten their stupidity bubble, they jump to make Ad Hominem personal attacks... to argue style over substance... to ignore the burden of proof for their ridiculous claims... to use special pleading to try to claim exemptions for their own idiotic beliefs that they don't allow to be applied to anything else... they erroneously and without any evidence presuppose the validity of their wrongheaded mythology and then use that unsupported presupposition to then fallaciously dismiss facts and actual solid real world evidence that disproves their claims.

They hem and haw, they try to claim Pascal's Wager... they intentionally (or simply out of ignorance or stupidity) confuse general deism with faith in a specific dogma such as Christianity etc... kind of like you idiots are doing giggling about how you share the same God etc... completely failing to recognize the very mutually exclusive and specific claims made by your various faiths etc... (and such fundamental schisms also exist WITHIN the faiths... Sunni and Shia... Catholic and Protestant... and people still MURDER EACH OTHER TODAY over these stupid and worthless disagreements over IMAGINARY SHIT.)

I could go on and on.

You people are IDIOTS and I resent you for bringing your idiocy to this list.

I would expect that people who are interested in an operating system such as this would be a little more educated than average... a little more information savvy etc... that people would be capable of pulling their own heads out of their own asses and seeing the absurdity and utter lack of knowledge they have about the world and their own beliefs as a part of that very real world etc.

THINK.

PLEASE.

Good day.
- Hide quoted text -

On Sun, Mar 21, 2010 at 11:20 AM, Mattia Tristo wrote:

Very nice, i think we belive in the same God.

Good work to you and cheers
Mattia Tristo

2010/3/18 Nicholas Otley

Hi Mattia,

I can't help you on the Bible, but I'm planning on developing The Holy Qur'an and Nahj al-Balaghah for Haiku (Arabic and English).

Take care,
Nik

On 18 March 2010 18:27, Mattia Tristo wrote:

Here in the italian mailing list we are discussing about developing the Holy Bible for Beos/Zeta/Haiku, there is someone who want to develop it in English too?

Hope don't wasting your time, let me say best regards to you
Mattia Tristo

Monday 22 March 2010

Google Summer of Code for BeagleBoard

Here's some good news - BeagleBoard.org has been accepted for the Google Summer of Code programme. At the request of the administrator Jason Kridner i've also signed up as a possible mentor.

There is a gsoc group, a page of project ideas, and there's also there's the list of existing project ideas.

I'm not really sure which projects I would like to mentor given the opportunity, but there are a few that piqued my interest:

Minix 3 Support - although TBH I've looked at the latest incarnation of the Minix 3 source, and it is very heavily tilted at x86 processors and not arranged in any particularly extensible way. It is also a far cry of untidiness compared to the Minix 3 of Andy Tannenbaum's book. Support for RTEMS (and embeded realtime os) could also be interesting although I don't know much about it.
JTAG Debugger. Unfortunately I myself don't know much about it so probably not for me to mentor - but it would be nice to have a free software hardware debugger.
Improved bootloader support, which mainly falls down to implementing a USB stack for a bootloader so USB ethernet and USB mass storage become available devices. There are currently no free and re-usable USB stacks outside of the main free operating systems so this could be useful for other reasons, including selfish ones.
Any of the many DSP acceleration or NEON acceleration projects. e.g. OpenVG on DSP and/or NEON would probably be a good sized project with utility. Both `heterogeneous computing' and SIMD processors are going to play a big role in computing for the foreseeable future.

But there's a ton of other interesting projects in there.

Sunday 21 March 2010

A couple of links, the wall of dread lurches forward, dried meat.

A couple of interesting posts I came across this evening:

Obliquity by John Kay, an old article from the Financial Times, via Naked Capitalism.

Showing examples of how direct goals such as 'increasing shareholder value' are the least successful way to achieve it. Food for thought in this crazy modern world where CEO bonuses and shareholder returns are the the sole quantification of success of a company or industry. Interesting in light of globalisation, out-sourcing, patenting of everything from ideas to seeds to animals, and so on and so forth that all put private profit before society.
Curtain of Tragedy Will Be Raised Soon Enough, But Perhaps Not Next in Japan, Jesse's Café Américain.

A pretty depressing look into the crystal ball for the next decade or so. I don't think Australia for one realises how fragile it's current `resource-driven' aka `lucky-country' or rather, `stupid-country' economy really is (as opposed to 'the clever country' era which gave the me the opportunity to go through Uni). A little god-bothery at the end, but don't let that put you off, Jesse always has lots of interesting things to say.
More than a four-letter word from Financial Armageddon, and Revolution, flashmobs, and brain chips. A grim vision of the future from The Guardian.

Both about a new report from the UK Department of Defence about the world-wide strategic threat environment in 30 years time. As The Guardian's sub-heading says, very grim reading. It's all about the population really; I can't imagine any palatable political solution, and any trend line with no limit has no scientific solution either. In for a bumpy ride even if these glimpses of the future are out by a few years. Although at my age I may just escape it through natural means.

All very depressing. I don't think it helped that I was sort-of watching a pretty grim tv series 'The Survivors' at the same time either. Although I don't think the writers there can do maths terribly well - a 90% reduction in population would still leave a sizeable number of people in Britain - over 5 million. Not one dozen, and even raw meat takes more than a few hours to go off on the nose in a disconnected fridge! Hmm all this reminds me for some reason about a book I read a few years ago called `The Genocides' that left me deeply disturbed for at least a week. It just a sadistic device of the author simply give the reader un unpleasant time and lasting memory. It is a grotesque tale with no redeeming features whatsoever. Avoid.

Wall of Dread

Well, on to more mundane and thankfully lighter topics. This afternoon I finally got off my fat bum and spent 4-5 hours working on the main retaining wall. Managed to finish the bottom layer and stacked another two on top since I was on a run - I still need some ag-pipe before I can back-fill each layer with gravel, so I should try to get that early this week so I can get most of the work just out of the way at last. I was going to walk it from the plumbing shop - it's only about 1km away. Although I may have stuffed up a bit - i'm not sure where the ag-pipe should drain to, and I didn't leave any drainage holes along the 12m of the run. It can still come out of the ends though, so maybe i'll just do that (unlikely i'll move all those damn blocks again, particularly the bottom row). I also re-cut the stair-case base to make the steps deeper, although I can't fill that in till I have that ag-pipe.

I measured the height along the length once I finished it rises a bit in the middle unfortunately - about 8mm, but at least in a relatively smooth curve. Yes yes, stupid I know - I should have regularly measured, but I did have another string-line at the brick level and used a couple of spirit levels, but all those little `close enough fuckit's add up. But it doesn't really matter because it only goes out `significantly' where the stair-case is, which is a natural break in the line. I guess i'll have to see how much it settles over time too. My $5 rubber mallet disintegrated before the last few base bricks were done, but I managed to tap down and level the last few with a bit of wood and a heavy ball-peen hammer - pretty well over it all by that stage, but I think I didn't drop off too much in finishing quality. Knowing my luck it will be a bit out compared to the boundary (and to-be-shed-wall), but that should only affect the aesthetics unless I made a really big mistake.

Dried Beef

The biltong turned out more or less ok. I've been sampling bits as I went along to see how it 'aged', and the first bits were a bit young (they were also thinner). The flavour didn't really settle till today, 6 days in - once the vinegar had dissipated. I made a slight variation for two sets of strips. One I rinsed in water and vinegar to remove the excess salt, as according to one recipe, and the other I left as is in an attempt to keep more flavour in-tact, according to another. The vinegar one ended up pretty bland, so i'm not sure that's the right idea. But the other one is a bit salty - ok with a beer in hand or after a long hot day in the sun, but not fantastic. Since I was caught up with some other things I also salted it for about 15 hours, so maybe next time I'll just try a shorter seasoning session. Although I imagine most of the salt would still be clinging to the outside once it is hung up. Need a way to get more chilli flavour on it too - maybe some sambal pedas as a final stage. Still, one bit I tried today I'd label a success, even with the saltiness. Almost has a touch of salami flavour to it, mixed with bbq steak, a bit more chilli and less salt and it could be a winner. 1kg of well-trimmed lean porterhouse made about 400g of biltong; so it's pretty cheap to make. Will have to investigate a better drying cabinet than a house-moving box with a shade-cloth cover though, at least if it becomes a regular task.

Thursday 18 March 2010

SSE, gcc, AMD

I'm a bit ahead of schedule on my work and I am awaiting some feedback so I thought i'd look at what SSE2 could do to speed up tiny bits of what it does. Might not need to use it, but it'd be good to have some sort of feel for what it can do - e.g on CELL it can make an order of magnitude difference. Some interesting and surprising results.

As a quick aside, while searching for why using an SSE intrinsic kept complaining about assigning to the wrong data-type (turned out it was just a typo and gcc didn't complain about the undefined `function') I came across this interesting comparison of SSE optimisations on common compilers. Actually I was quite surprised that gcc was doing quite so much with constant subexpression elimination of vector quantities - I thought intrinsics were pretty much in-line assembly with register aliases. Anyway good to see it is well ahead of the pack in all departments.

First, I tried implementing the following function:

X = abs(Z)*exp(Y*i)
Z, X complex, Y real

The C code is all pretty straightforward:

                        Xrow[x] = cabsf(Zrow[x])
                                * cexpf(Yrow[x]*I);

The SSE code was a bit messier, but for such a simple function quite manageable. I found a nice library of SSE optimised transcendental functions needed to implement a vector cexpf(), and the Cephes maths library was an excellent reference for the internal workings of functions like cexp - e.g. so I could remove redundant work.

                        v4sf d0, d1;
                        v4sf i, r;
                        v4sf absz;

                        v4sf rr, ri;
                        v4sf ci, si;

                        d0 = Zrow[x];
                        d1 = Zrow[x+1];

                        i = _mm_shuffle_ps(d0, d1, 0x88);
                        r = _mm_shuffle_ps(d0, d1, 0xDD);
                        absz = _mm_sqrt_ps(r*r + i*i);

                        i = Yrow[x/2];
                        sincos_ps(i, &si, &ci);

                        rr = absh*ci;
                        ri = absh*si;

                        d0 = _mm_unpacklo_ps(ri, rr);
                        d1 = _mm_unpackhi_ps(ri, rr);

                        Xrow[x] = d0;
                        Xrow[x+1] = d0;

I'm sure it's not the prettiest code, and it may not even work (not sure my shuffles are correct), but assuming it is otherwise correct, it'll do. I also tried a double version for simple C, and the above sse version with 1 unrolled loop.

I have a number of machines to try it on, a pentium-m core-duo-wtf-its-called, athlon 64 x2, and a phenom II 945. I'm using gcc 4.4.3. Running the loop 10 000 times over a 64x64 array using constant loop counts. I'm using the aggressive optimisation flags suggested by AMD, including -march=amdfam10 and -funroll-all-loops, and also c99 features like restrict.

Anyway I'll keep this brief. I'm not so much interested in the various CPU's as the relative differences within them.

 Code             A64-X2    Phenom II Pentium-M

 C single           1.16         0.84      1.17
 SSE single         0.69         0.31      0.37
 SSE unrolled       0.66         0.31      0.37
 C double           1.36         0.90      1.25

Well it's certainly an improvement, but nothing like on Cell. The Athlon-64 X2 really suffers from a 64-bit internal data-path, so the SSE is not even 2x faster, even though it's calculating 4x values at once for most operations. The intel shows the biggest improvement using SSE at just over 3x, but the Phenom is about the same. gcc has much improved since I tried similar things on the SPU so perhaps the non-sse code is 'just that good', but I can't help but imagine there is more silicon dedicated to making non-sse code run faster too. Oh and it is interesting that double's are pretty much on-par with singles, at least in this case. I was hoping that moving to singles would make a bigger difference - too lazy at this point to try a double SSE3 version (i'd have to write sincos_pd for example).

I tried this first because I thought the a little complexity (yet not too time consuming to write) would help the compiler do more optimisations, and it is complex enough it can't be memory constrained. I couldn't get gcc to inline the sincos_ps, but I doubt that would've made much difference.

Then I tried something much simpler, just adding two 2d arrays. I didn't use sse for this, but just used the vector types. And just looped over every element of every row - I used a stride calculation rather than treating it as completely linear. Hmm, perhaps I should have used non-constant dimensions to see what the compiler did with more general purpose code.

a[x] += b[x];

gcc unrolled the whole row-loop so things are about as simple as they get. Actually gcc vectorised the version using simple C types as well (well if it couldn't vectorise this loop it'd be a worry), and then it included the two versions: one vectorised for when the addresses are aligned appropriately (which they are), and one using the FPU for otherwise. Looking at x86 assembly gives me a headache so I didn't look too closely though. Since this loop is so simple I had to run it 1 000 000 times to get a reasonable timing.


  Code             A64-X2    Phenom II Pentium-M

  C single           8.5         1.1        2.1
  C vector single    6.0         0.98       1.2

Ok, so it's tuned for the phenom, but wtf is going on with the athlon64? Memory speed constrained? I presume they are all constrained by the L2 cache, or I might even be getting cache thrashing or something (this is the sort of thing you could guarantee on cell ... huge pity IBM aren't progressing it for HPC). Don't ask me why the pentim-m version is so different - afaict from the source they're both using SSE2 code since the memory should be aligned properly. *shrug*.

So after all that i'm not sure i'm much wiser. The question being - is it worth the effort to manually vectorise the code or even use sse intrinsics on an x86 cpu, at least for the algorithms i'm using? Hmm ... probably ... even a 2x increase on a minor operation is better than nothing, but it'll only be an incremental improvement overall- but at least in every case it was an improvement. These are all using 32-bit operating systems too, I wonder if I should try a 64-bit one. Well i'll see how much time I have to play with it.

Should really try GPGPU and/or OpenCL simply to get a good 'feel' of what to expect.

Tuesday 16 March 2010

Indecision, games, and the rest

Wow, tired. Then again I was up till 3am last night hacking on some MMU code, so that probably isn't surprising. And some damn person called me with a wrong number at 8:30am too. Been a long couple of days. Oh hmm, now i remember, I was up for 23 hours on Saturday too - after a very early awakening that led me to the last post, I spent all afternoon helping a mate put up a fence, then a few beers and pizzas and a late night later ended up crashing about 3am. No wonder it feels like I need a weekend, and it's only Tuesday.

Been very busy with work, and had a few wins today - I think i'm all out of endorphins now, so I can't get back into the MMU code as i'd planned to. As a side note, whilst generating data to test some algorithms I accidentally created the image below. What took me about it is that it is surprisingly uncomfortable to look at - it seems to jump around randomly, about 3 times a second (at least on a bright LCD monitor). I don't think it's just my tired or coffeed eyes ...

On the MMU code I was writing a more comprehensive bootstrap function. One that initialises the MMU to relocate the loaded image to somewhere appropriate, and in a way that it can remain in use for the running system. I have some code, and it's nice and compact but I'm still undecided on the final memory map. For example, the ARM CPU on the Beagle can have two page tables, one for high addresses, i.e. the system, and one for the rest. If I use that feature it forces the system table to use at least the top 2G of the virtual address space, which was different from what I had in mind. Not that it matters I guess since the machine only has 256MB of memory anyway ... so probably no big deal. I'd like to use that option since it simplifies the management of dynamic globally shared data the way I was thinking of doing it (i.e. shared library code/rodata). There's a few other details i'm still undecided on too, it would be nice to use section pages for instance (1MB pages that only require a level 1 PTE).

I spent the few nights earlier working on the higher level MMU code, managing translation tables, allocating physical pages and thinking about how to implement efficient `agile' (migratable) memory. The actual page table manipulation is pretty simple but you need a fair bit of infrastructure before you can use it.

So I also wrote a simple first-fit memory allocator that uses an AVL tree to coalesce free blocks. It's something i've researched quite a bit in the past and first-fit works surprisingly well for a general allocator particularly if there's a quick way to coalesce frees (the main drawback being not-so-great worst-case allocation performance). Even though I've written one before (for AROS), I'm using the AVL code derived from the parent-linked one from libavl, since I couldn't find the final version of the one I wrote and it's probably better anyway. I modified it though to remove the iterator and tree objects so it can be used without requiring any additional or dynamic memory. I also poked the balance factor into the low 3 bits of the parent pointer so I can jam in the free block size into 16 bytes - which determines the minimum allocation granularity. If the linear search for the first-fit allocation proves too inefficient, I can always add another tree and bump the minimum granularity to 32 bytes which is probably still ok. Although the overhead of managing two trees may not be.

And I played with a couple of different ideas for allocating physical pages. I tried a bitmap, which is pretty simple but tricky to allocate large ranges (>32 pages), has poor worst-case, and has no room to store any meta-data. So then I went to an indirectly linked list of words, mainly because I then have somewhere to store extra per-page data like a shared refcount, or other bits of information. But that's even messier to allocate any range more than 1 block in size; although I thought I didn't need to do that, I do at least need to allocate 16kb blocks for the first level translation tables (although I only need 8kb if I use the dual-translation table setup). So I may end up going back to the idea I had with my other os which was just to track the holes in the address space using a tree - much like with the gp memory allocator but without being able to use the holes themselves to store data. I suspect I chose to do it that way for a reason - it has a reasonably high worst-case memory requirement, but that is unlikely, and bounded. Since I will need the functionality for a global virtual address allocator for the 'agile' memory blocks, it probably makes sense to just use the same thing.

For something different i'm also trying to make some biltong. Thought about it for ages but never got motivated enough to try it. I was at the Central Market on Friday arvo and grabbed some dried meat and super-hot sausage and found it particularly yumscrious. But the extraordinary price helped me to decide that it was time to try to make something like that myself. I just started with 1kg of beef to see if it works or if I just end up making a batch of biological warfare agent. I should really have given the kitchen a good clean first. I thought i'd skip the receipe that starts with '25lbs of beef or game', at least to start with ;-)

Read an interesting article a few days ago about five creepy ways video games try to get you addicted. The point I found most interesting was the assertion that games are basically taking the place of an un-fulfilling work environment, and particularly since the points mentioned seem to match my hobby and at least the 'good bits' of the work I do:

Autonomy (that is, you have some say in what you do day to day);
Complexity (so it's not mind-numbing repetition);
Connection Between Effort and Reward (i.e. you actually see the awesome results of your hard work).

Maybe that's why I don't get terribly obsessed with games - in effect i'm already playing one. Particularly of late, this has been very obvious to me - I can't put a problem down until I get it to work, or at least finish some little mile-stone along the way. If I can do that I can go away and feel a bit pleased that that 'target' has been reached, even if I know they will never end. On the other hand, if I can't it can be very disappointing, and it can even act as a catalyst to something worse since I have no other distractions just now. The other point this ties into is quite interesting too:

What is a game?

Well, we humans play games because there is a basic satisfaction in mastering a skill, even if it's a pointless one in terms of our overall life goals. It helps us develop our brains (especially as children) and to test ourselves without serious consequences if we fail. This is why our brains reward us with the sensation we call "fun" when we do it

Basically games are a way to learn stuff. I guess a bit like fizzy drinks or other food with flavour but low nutritional content, some of these modern games are simply abusing key survival traits honed through evolution over millions of years simply to make a buck. This is what the world has become, hollow food, hollow entertainment, and hollow learning, for a hollow existence?

Actually I'm not so sure any of the survival skills gained through evolution are much use to us any more. Most of them don't work when you have unbounded supplies of high-energy food-stuffs within easy reach, spend all of your time sitting on a chair behind a desk or in a machine screaming across a layer of processed oil, or simply have so many people that there's no room for them or any other animals simply to live in peace. Evolution only optimises based on the past, not on the future. Yes, we're all screwed - keep an eye on the fish.

Saturday 13 March 2010

Fork'n Evolution!

I just saw a post suggesting it is time for Ubuntu to fork evolution and thought it deserved a comment ... Although now i've just about finished this post I realise just how silly the whole notion is to start with, so it probably doesn't.

It is a pretty bizarre suggestion.

Even just on a technical level, the author may not realise just how much work is involved in maintaining a piece of code the size of Evolution. Or the amount of organisation and effort required to make the sort of major structural changes it probably needs before any real progress can be made.

And politically it is a naive and arrogant to suggest one GNU/Linux company can 'go it alone' like that on any major piece of software, particularly one developed by a competitor. It is also pretty rude - I haven't even touched Evolution for 5 years as a user or developer, but as a free software developer and advocate I think forking should really be the point of last resort. And even then it would only be after a demonstrable and intractable conflict had arisen - e.g. refusal to accept many reasonable patches of significant size. A few 10 line patches hardly turns one into a maintainer. And I wonder if anyone who might be involved has even done that.

Evolution was always a strange free software project, which i've written about plenty of times before. Whilst I was working on it it had only a few external contributions, and even less so from any of the GNU/Linux vendors (apart from Sun). Most had patches just for (re)branding, but the occasional bug fix they did have rarely made it to the code-base through normal processes i.e. they rarely if ever submitted them to us. Not that we made it terribly easy, anally retarded patch submission and review processes, the copyright assignment crap, and simply being busy most of the time.

But even despite that - it is a free software project, and open to contributions from anyone. There is simply no need to fork it, if anyone desires to improve it they can, and indeed are encouraged to. It is part of the `social contract' if you will, of any GPL project. You get the code for free, and if you want to make changes you can, and add them to the public pool for the good of all. And unlike many 'open source' projects - Evolution was always a real free software product and not just an alpha-grade product looking for free beta-testers and code-monkeys like many seem to be.

"They didn’t understand that there is just one killer feature (just as with integrated desktop social networking) that needs to be in there which is Exchange support."

Back on topic ... this one statement probably undermines the post itself more than any other. As Jeff said in comments on the post this was Ximian's business model pretty much from the start, and one of the reasons Novell bought them. It was pretty much the reason Evolution existed anyway, although more to replace the need for Microsoft Exchange than to work with it - at least in my mind. As a free software developer I never saw the need to inter-operate with proprietary crap like that - which is probably why I was never working on that part of the code.

Also the whole bit about `social networking' mentioned in the post - this was a major part of the internal thinking at Ximian right from the start - before the term even existed. Although like then, I still don't really see the need for it, particularly in a corporate environment. We all just used IRC and it worked just fine for us.

I was working on the mail code even longer than Jeff - I can't believe I started over 10 years ago. I haven't looked at the code since leaving Novell and the project, and I never worked on the `microsoft exchange component' anyway, but i'd suggest the code was simply never up to snuff given the way it originally designed, and even just because who wrote it.

Nice guy and all, but not great at coming up with a maintainable design let alone writing reliable C. Jeff and I's experience with the original IMAP code was a nightmare that started when he simply abandoned it in a fit of terrible management. One day he simply began refusing emails or dealing with patches saying he'd moved to another project (might have even been the microsoft exchange plugin). So we were left maintaining a really nasty bit of code with no deep knowledge of how it worked. It was all too lisp-like with little or no structure which made it very difficult to grok particularly given it's requirement for multi-threading. Lisp might be a great language, but it simply isn't how you write C.

And apart from the code itself, I believe the microsoft exchange component had a messy architecture layered on top of the already overly complex ones (yes, multiple) within evolution. Partly because all the different data-types were routed through the same connection, and also because it started as a proprietary extension and had silly things like a symbol obfuscater. To be honest once I found out how it worked I was surprised it worked at all ... it was not surprising in the least that it was slow and buggy and probably always would be.

Disbanding the original team and moving the project wholesale to India can't have helped either, at least at the time. Apart from the brain drain that involved, India has a tendency to add layers of middle-management too, so big decisions we could have gotten away with by the end (probably without asking ;-) became completely impossible due to risk averse middle-managers only worried about their next promotion-due-to-tenure. High staff turn-over due to a liquid job market was also an issue.

The entire component architecture of the system should have been replaced years ago (I wouldn't know - perhaps it has been) - almost all of the early design decisions were wrong, and particularly outside of the mail code none of it was ever reviewed because of the high turn-over of developers. Inside the mail component there was a huge scope for doing things in a much more re-usable way. CORBA got a bad wrap because we had an absolutely terrible design and often simply used it incorrectly. The original bonobo design was just a COM clone, so no surprises there - pluggable multi-process widget systems was a stupid use of CORBA. But the evolution-data-server wasn't, although the interfaces were not as good as they could have been.

I don't think there is any conspiracy - but Novell are a proprietary software company first and foremost and a project like evolution never really fit it's culture - probably the major reason I left in the end. Once HR starts impacting on engineering I think you have worries as a technology company, and even more as an `open sauce' one where the most valuable `eye-pee' you have is in your employee's heads.

Canonical make some odd decisions about Ubuntu too - more odd since they are purporting to be such a community focused organisation. But I don't think even they would be arrogant or silly enough to seriously consider such a strange notion as forking evolution. I'm sure they could surprise me though.

Tuesday 9 March 2010

Syscalls and IPC

The next thing i've been looking at is system calls, and using them to implement IPC.

I wanted system calls to lead to a direct context switch efficiently, so that for example, in the simple case an IPC call to a server process acts synchronously. But on the other hand, there are other tasks for system calls which are really just isolated or system-level function calls, and they don't need the overhead.

So I came up with the idea of two levels - simple system calls which are invoked and return much any other function invocation, and heavier ones which might lead to a context switch. I just then use the system call number to chose what code to execute.

It's actually easier just showing the code I came up with than trying to explain it.

svc_entry:      
        cmp     r7,#SVC_SIMPLE_LAST
        push    { r12, lr }
        adrlo   r12,svc_vectors
        adrlo   lr,svc_exit             @ low vectors, call and return to svc_exit
        ldrlo   pc,[r12, r7, lsl #2]
  ...
svc_exit:
        ldmia   sp!,{r12, pc }^

Well that's the entirety of the code-path for 'simple' system calls. It doesn't strictly need to save r12, but it makes the context-switching case a bit simpler, and keeps the stack aligned properly too. It doesn't need to save the normal EABI work registers, because the target vector will save any it needs to.

At the svc_exit label I initially had `pop { r12, pc }^' thinking that the pop pseudo-op would work the same as ldmia as it does normally, but implement an exception return because of the trailing `^' ... but it doesn't ...

The "we might context switch" case is then much the same as every other exception handler, and re-uses some of the code too. It saves the state to the current TCB, and then invokes the vector. It also handles invalid system call numbers by suspending the task.

        srsdb   #MODE_IRQ
        cps     #MODE_IRQ
        stm     sp,{r0-r14}^
        cps     #MODE_SUPERVISOR

        cmp     r7,#SVC_LAST
        adr     r12,svc_vectors
        adr     lr,ex_context_exit
        ldrlo   pc,[r12, r7, lsl #2]
        b       task_error_handler      @ syscall # invalid - suspend task

One of the simplest useful IPC mechanism I can think of is simply a signal based one (the AmigaOS idea of signals, not the Unix one, although they can be made to work the same). If you're waiting on a signal and it arrives you wake up, otherwise it just gets registered asynchronously and if you wait later, it no-ops.

It only required 2 basic operations, and another 2 for a bit more flexibility.

Signal signals another task with a single signal bit, and Wait puts the current task to sleep until any one of a given set of signals arrives, or does nothing if it already has. Since either may trigger a context switch, these are implemented using the heavier-weight system call mechanism.

The other two functions are AllocSignal and FreeSignal whose usage is pretty obvious. These are implemented using non-context switching system calls.

The functions are very simple. First Signal, all it does is set the signal bit in the task control block, and checks if the task was waiting for it. If so it sets the return value the function expects, adds the task back to the run queue and reschedules the task, which may then execute depending on the scheduling algorithm.

void svc_Signal(int tid, int sigNum) {
        struct task *tcb = findTask(tid);

        if (!tcb)
                return;

        sigmask_t m = (1<<sigNum);
        sigmask_t sigs;

        tcb->sigReceived |= m;
        if (tcb->state == TS_WAIT) {
                sigs = tcb->sigReceived & tcb->sigWaiting;
                if (sigs) {
                        tcb->tcb.regs[0] = sigs;

                        ... move tcb to run queue ...
                        ... reschedule tasks ...
                }
        }
}

Wait is just as simple. In the simplest case, if any of the signals it is waiting on have already arrived, it just returns the intersection of the two sets. Otherwise, the task is put to sleep by placing it on the wait queue, storing the signals it is waiting on. The `trick' here is that when it wakes up, it will just act like it has returned from a context switch - and start executing immediately after the svc instruction. The Signal code above uses that to return the return-code in the delayed case via register r0 - the standard ABI register for function return values.

sigmask_t svc_Wait(sigmask_t sigMask) {
        struct task *tcb = thistask;
        sigmask_t sigs;

        sigs = tcb->sigReceived & sigMask;
        if (sigs) {
                tcb->sigReceived &= ~sigs;
        } else {
                tcb->sigWaiting = sigMask;
                tcb->state = TS_WAIT;

                ... move current task to wait queue ...
                ... reschedule tasks ...
        }

        return sigs;
}

Invoking system calls uses the svc instruction. This encodes the system-call number in the instruction, but you can also pass it in a register - and I follow the linux EABI which puts the system call number in r7.

With this mecanism it is trivial to implement a system-call in assembly language - tell the compiler what the args are, but the assembly just passes them to the system.

C code:

extern int Wait(int sigmask);

ASM:

Wait:   push { r7 }
        mov  r7,#3
        svc  #3
        pop  { r7 }
        bx   lr

But this wont let the compiler in-line the calls, which would be desirable in many cases. So one must resort to inline asm in the C source.

sigmask_t Wait(sigmask_t sigMask) {
        register int res asm ("r0");
        register int sigm asm("r0") = sigMask;

        asm volatile("mov r7,#3; svc #3"
                     : "=r" (res)
                     : "r" (sigm)
                     : "r7", "r1", "r2", "r3", "r12");

        return res;
}

(I'm not sure if that is 100% correct - but it seems to compile correctly.)

Code in ipc-signal.c and another version of context.S.

Other IPC

Well beyond simple signals there are a few other possibilities:

Mailboxes - These are usually a fixed-length queue which can be written to asynchronously if is it not full, but only with simple data type such as an integer or pointer. No dynamic memory is required inside the kernel so it is easy to implement. Shared memory and the like is required to send more complex information.
Shared memory - Fairly simple once you have a mechanism to share memory. No way to send or receive notifications about state changes though, so requires polling (bad for lots of reasons), or other IPC primitives such as signals. Requires some sort of serialisation mechanism too, such as mutexes or semaphores which are best avoided.
Sychronous message passing - This is used to implement a light-weight context switch in microkernels. Because the callee and caller rendezvous at the same time, there's no need to save state beyond the normal state saved in a function call, and simplifies scheduling. If asynchronous or multi-processing is required threads are needed though, and passing large messages can be expensive.
Asynchronous message passing - This uses a queue to keep track of the received messages and allows the target to receive them when ready. If the receiver is ready and waiting it can work much like the synchronous case, otherwise the sender can keep doing work. In a non-protected/non-virtual memory environment this easy and efficient to implement, but these both add complications. For example dynamic memory is needed in the kernel, and messages saved to a kernel-queue involve a double copy - although it can be implemented without copying too.

I'm fond of the async message passing model, particularly with non-copying of messages - I used it in Evolution and other threaded code i've written since. So i'd like to try and get that working efficiently in a protected/virtual memory environment if I can - I had it working before on x86. I don't know if it worked terribly efficiently, but I don't see why it shouldn't be too bad.

Mailboxes also look interesting - because of their simplicity mostly. They can convey more information than a single bit as with signals, but i'm not sure they could replace them (the CELL BE hardware implements mailboxes plus a bit-based system much the same as signals - and if they spent the hardware to do them both there's probably a good reason).

Once I get some mechanism to pass data between tasks I guess i'll look at something a bit more meaty, like a device driver - and go back to banging my head against that gigantic OMAP manual.

Sunday 7 March 2010

FOPS in context

Another late night bit of hacking last night to add FPU support to the context switch code (which incidentally is now committed).

It works by disabling the FPU when a new task is scheduled which doesn't match the owner of the FPU. FPU instructions then cause an illegal instruction exception, and that is used to switch the context if it was an FPU instruction.

I think I spent more time working out how to check for floating point instructions in the illegal instruction exception handler than anything else. Partly just finding out what the instruction bits were, but also just how to implement it without a big mess.

Basically, afaict the following bit patterns are those used by all of the FPU related instructions.

  1111001x        -        -        - advanced simd data processing
  xxxx1110        - xxxx101x xxx0xxxx vfp data processing
  xxxx110x        - xxxx101x xxxxxxxx ext reg l/s
  11110100 xxx0xxxx        -        - asimd struct l/s
  xxxx1110        - xxxx101x xxx1xxxx vfp<>arm transfer
  xxxx1100 010xxxxx xxxx101x        -

Well, the problem is that ARM can only compare to an immediate 8 bit value (albeit anywhere within the 32 bits). So it becomes a bit of a mess of mucking about with those, or loading constants from memory. In the end I opted for the constant load - I figure that a constant load from memory is not really any different from a constant load from the instruction stream, once in the L1 cache, so assuming you're going to save a lot of instructions it will probably be faster. Well it's smaller anyway ...

As a comparison I wondered what gcc would come up with, e.g. with:

void modecheck(unsigned int *a) {
        unsigned int v = *a;

        if ((v & 0xfe000000) == 0xf2000000
            || (v & 0x0f000e00) == 0x0e000a00
            || (v & 0x0e000e00) == 0x0c000a00
            || (v & 0xff100000) == 0xf4000000
            || (v & 0x0fe00e00) == 0x0c400a00)
                printf("do stuff\n");
        else
                printf("do other stuff\n");
}

And it's pretty ugly:

        ldr     r0, [r0, #0]
        and     r3, r0, #-33554432
        cmp     r3, #-234881024
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #234881024
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #-251658240
        bic     r3, r3, #16711680
        bic     r3, r3, #61696
        mov     r2, #201326592
        bic     r3, r3, #255
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        bic     r3, r0, #14680064
        mov     r3, r3, lsr #20
        mov     r3, r3, asl #20
        cmp     r3, #-201326592
        beq     .L2
        bic     r3, r0, #-268435456
        bic     r3, r3, #2080768
        bic     r3, r3, #12736
        mov     r2, #205520896
        bic     r3, r3, #63
        add     r2, r2, #2560
        cmp     r3, r2
        beq     .L2
        ... else
        b       done
.L2:    .. if
done:

So basically it's converting all the constants to 8 bit operations. AFAICT the `branch predictor' basically just has a cache of recently taken branches, and the `default prediction' is that branches are not taken. If that's the case, the above code has a 13 cycle penalty in the best case ... plus all the extra instructions to execute in the worst. The first 2 cases are the most likely with fpu instructions though.

Not that this is in any way time-critical code (at most it will execute once per context switch), it just seems a bit of a mess.

I came up with something somewhat simpler, although to be honest I have no idea which would run faster.

        .balign 64
vfp_checks:     @ mask    ,value
        .word   0x0f000e00,0x0e000a00   @ xxxx1110        - xxxx101x -
        .word   0x0e000e00,0x0c000a00   @ xxxx110x        - xxxx101x -
        .word   0xff100000,0xf4000000   @ 11110100 xxx0xxxx        - -
        .word   0x0fe00e00,0x0c400a00   @ xxxx1100 010xxxxx xxxx101x -

        ...

 ldr r2,[r3,#-8]  @ load offending instruction
 ldr r2,[r2]

 and r0,r2,#0xfe000000 @ check 1111001x xxxxxxxx xxxxxxxx xxxxxxxx
 cmp r0,#0xf2000000
 ldrned r0,r1,vfp_checks
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+8
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+16
 andne r0,r0,r2
 cmpne r0,r1
 ldrned r0,r1,vfp_checks+24
 andne r0,r0,r2
 cmpne r0,r1
 bne 1f

        ... if

1:      ... else

Given the constants are on a cache boundary, presumably their loads should all be 1 cycle (for all 64 bits) after the cache line is loaded, so they shouldn't be any worse than in-line constants that take more than 2 or more instruction to create (let alone the 7 required for both constants in the second C case).

Interestingly although not surprisingly the -Os code is more similar to the hand-rolled assembly, although it includes explicit branches rather than conditional execution. And does single-word loads instead of double.

Apart from that the context switching itself is pretty simple - just store the fpu registers above the others. irq_new_task is where the fpu is turned on or off depending on the current fpu owner, so it doesn't do anything if only one task is actively using the fpu. The existing context switch code required no changes - it is all handled 'out of band'.

Given that I now have some tasks to manage I also added a general illegal instruction handler for really `illegal instructions'. All it does is suspend the current task and remove it from the run queue as well, and the other tasks keep running just fine. For my uKernel it will then have to signal a process server to clean up/or whatever.

Code is in context.S and irq-context-fp.c.

OCaml

Well yesterday I should have been out working on the retaining wall but in the morning I got stuck thinking about optimising implementations of mathematical expressions (too much coffee late the night before methinks).

The primary reason is that I think I found a pretty extreme case of wasteful calculation in an algorithm I'm working on. As far as I can tell from manual inspection, in one place it calculates a `square' 4D intermediate result - actually of the 8M array entries (64^4), most of them, about ~7M, remain unset at (0+0i). But when it uses this array for further processing it only uses 2047 of the results! Considering the intermediate array is 256MB in size(!!!), this is an obvious target for optimisation - it spends far more time clearing the memory than calculating the results it uses!

(I had the mistaken impression that matlab did smart things with such expressions, but I guess it doesn't - it's really just a scripting language front-end to a bunch of C functions.)

I have a feeling this sort of thing is happening throughout the algorithm although this is probably at the extreme end, but analysing it manually seems error prone and time consuming.

Surely there's a better way ... functional languages?

I'm not sure if i'll be able to take advantage of any of them, at least in the short-term. I was reading about how FFTW uses OCaml to implement it's optimiser - this is the sort of thing I might have to look at, although I never could wrap my head around functional programming languages. But anwyay - it's something i'll keep investigating although my focus to start with will just be code translation. Perhaps some sort of symbolic expression analyser might help me reduce the equations - once I work out what they are at least.

Another brick in the wall

I did get a couple of hours in on the retaining wall, laying the foundation layer of bricks which is the tricky bit - but then just as I was starting to make good progress it started to rain. I got nearly half-way on the main wall at least. I assessed that the stair-well i'd created was too steep and started too high (a slack string led me to believe the first-run of bricks would be lower), so i'll have to dig that out more too, but that wont take too long. I need to get some ag-pipe before I can back-fill it though. Hmm, should really get out there and do a bit more for a couple of hours - at least do the first run - hmm, but alas I think the day has gotten away from me again, and I have some other things to do.

It's a long-weekend anyway, so maybe tomorrow is a better day to get dirty, and the clouds are looking a bit dark at the moment.

Friday 5 March 2010

Strange week.

Phew, another week down. That was a pretty long one. New job, all sort of things to learn about.

I went with CentOS for my workstation ... hmm, but i'm not really all that pleased so far. It's mostly working but there's a few weird bits. xterm's keep dying occasionally while i'm using them - they lock up hard and I can't even close the window. I have 'focus follows mouse' on, but it isn't always working properly - if the machine is busy when I move the mouse it seems to do weird shit. I also had a locked pointer grab - gee, I haven't had those outside of using gdb for years and years. And today I was switching virtual desktops and it just went mental - keep cycling the windows as fast it could once I let go of the keyboard (it still accepted keystrokes as the windows whizzed by) - had to kill X. The whole machine feels pretty slow too - it is an older machine being re-used, but it feels slower than it should. Might be the encrypted home mount too ... and ext3 must share some blame too. And finally the lack of packages - I expected it to be more limited than other more bleeding edge systems i've used lately, but the lack of packages is really stark; some what I thought really basic 3rd party packages are simply missing. I guess i'll have to re-install it with something a bit more production ready; CentOS 5.4 just isn't, at least for my needs. Sigh, not really what I wanted to have to do - stability is pretty much the whole reason I was looking at CentOS anyway.

As part of work i've been reading about convolutions and fft's and other fun stuff. Not all new to me, but I've never had to use it in anger before so it's pretty interesting delving into it. Unfortunately my maths was never really up to scratch in this area ... on the other hand I just have to use it, not derive it. Octave works pretty well for playing with ideas, and gnuplot does some nice plots too. Maybe if I come up with some interesting stuff as i'm learning I'll post some shots, but i'm a bit too worn out this week.

Last Saturday and today I spent a couple of hours shifting road-base and filling up the trench and tamping it down. Pretty hard work but not that bad in short spurts, and the trench is just about where I need it - 12 barrow loads so far. I was going to get some sand for the final layer, but I might try with the roadbase since I have it, and would have to move it all before I can order any sand. It's a lot harder making it level though, so I may regret that idea. I'll see - should really do some more on it this weekend, but there's other things I should probably do too. It might rain anyway (hmm, rain, i've forgotten what that feels like).

Monday 1 March 2010

Context switching

After working on it in bits and pieces over the last few days I managed to get context switching working (late) last night.

Again apart from one little error it might've happened a lot sooner - I read the APSR description about the 'big endian bit' and I think my mind just assumed 'not x86 ==> big endian' and I decided I needed to turn it on. Oops. Very odd, apparently scheduling tasks just fine but they don't seem to execute at all, nor cause any crashes. Oh this is the first time i've had user-mode code executing as well.

My initial idea was to only save the minimal state possible on interrupt entry, and only bother with a full save/restore in the event of an actual context switch. But then thinking about the microkernel design I want to implement later on, this seems a bit pointless since generally the `low' part of the interrupt handler will always run straight away so it will probably lead to a context switch. This simplifies the context switcher a little bit and adds a bit more flexibility internally too. For system calls i'm not so sure yet - quite a few system calls should lead to a context switch too but a lot wont, so i'm still thinking about whether it will work the same way. And I have to decide if system-calls can be interrupted - if they remain very tiny then they wont need to be.

The context switch code basically ends up the same as the one out of the manual, but hooked into the normal IRQ handler. The IRQ stack pointer is used as the point to ThisTask.tcb.regs[0] all the time, so nothing needs to be loaded at interrupt handler time. I have a trivial ASM function which lets the system code set the next task to run:

        @ r0 = pointer to r0 in tcb
        .global irq_new_task
irq_new_task:
        mrs     r1,cpsr
        cps     #MODE_IRQ
        mov     sp,r0
        msr     cpsr,r1
        bx      lr

The function below then becomes the new IRQ entry point. The lines highlighted in bold indicate the changes from the previous interrupt handler - there aren't too many, and one is just a cleanup (the mov lr,pc). The main difference is that all of the context state is stored on the `irq stack' rather than the supervisor stack. But it isn't really a stack, it's just a pointer to the TCB. This does have one consequence - it is impossible to implement re-entrant interrupts (at least without further code). If more state is required it may make sense to shift all interrupts to use fast interrupts, since it could then use the other FIQ banked registers to store per-processor state - although it could equally just be a pointer from the TCB to a per-processor or per-process struct.

        .global irq_entry
irq_entry:
        sub     lr,#4
        stm     sp,{ r0-r14 }^          @ save all user regs
        srsdb   #MODE_IRQ               @ save spsr and return pc

        cps     #MODE_SUPERVISOR
        push    { r12, lr }             @ save supervisor lr and r12 to supervisor stack

        ldr     r5,=INTCPS_BASE         @ find active interrupt from INTCPS
        ldr     r0,[r5,#0x40]
        
        ldr     r2,=irq_vectors         @ execute vectored handler
        and     r0,r0,#0x7f
        mov     lr,pc
        ldr     pc, [r2, r0, lsl #2]

        mov     r1,#1                   @ tell INTCPS we've handled int
        str     r1,[r5,#INTCPS_CONTROL]
        dsb

        pop     { r12, lr }             @ last of state on supervisor stack

        cps     #MODE_IRQ

        ldm     sp,{r0-r14}^
        rfedb   sp                      @ back to new or old task

Now the IRQ sp is just used as a pointer to the current TCB - so the code doesn't perform any write-back to the sp when writing or restoring values. The user registers are stored/restored above it, and the pc and spsr registers are stored below it. The push { r12, lr } doesn't actually need to store r12 since we already saved it, but this is used to keep the stack aligned at this point - it will need to change to ensure the alignment specifically so r12 wont need to be saved once that is done.

struct tcb {
        uint32_t pc;
        uint32_t spsr;
        // <- sp_irq points here always
        uint32_t regs[15];
};

The C structure that maps to this is shown above, indicating where the sp_irq actually points. I have a simple linked-list of tasks to hold this state, and whatever other state the kernel might need.

struct task {
        struct Node Node;
        int id;

        struct tcb tcb;
};

Within an interrupt routine, if I wish to schedule a new task all I need to do is call the aforementioned irq_new_task function with the value from inside the tcb, and that task will run once the interrupt is finished.

Although not particularly practical in the real world, a round-robin scheduler of all tasks in the run-queue is as simple as:

        Remove(&thistask->Node);
        AddTail(&tasks, &thistask->Node);

        thistask = (struct task *)tasks.Head;
        irq_new_task(&thistask->tcb.regs[0]);

There are (at least?) two other pieces of context that also need changing in a complete system, the MMU tables and the floating point registers.

Floating point registers don't need saving/restoring here because the kernel will never use floating point itself. And instead of wasting the time saving/restoring full FPU state at every interrupt, the code will just find out when a floating point instruction is used and swap the state then. The irq_new_task call can probably just disable the floating point unit, and when an undefined instruction interrupt occurs on the floating point co-processor the state can be changed then if it needs to be. And/or some combination there-of. e.g. irq_new_task could check which task `owns' the FPU and enable/disable it appropriately.

The MMU is a little trickier, it will need the page tables changed at every context switch. Since I will map the system memory globally across all tasks, I will probably also be able to put all of that logic into irq_new_task as well. Not sure how i'll deal with system calls that cause a context switch yet. I am looking at a process+task model too, so each task will be associated with a parent process which will encompass the memory map, so for example, a mutli-tasked process wont need an MMU switch if it is only switching between tasks.

And finally the last piece of the puzzle is boot-strapping the task system. There is only 1 CPU and it can only execute one bit of code at a time, so basically the initial booting process just 'falls through' to the 'current task' and automagically just starts executing as part of it's context. There is actually very little that needs to be set up - simply changing to user mode, and then initialising it's stack pointer, and that's it. The code then jumps to what is the entry point of the current task (as defined by the context switching mechanism above).

        // <- in supervisor state here, with boot-strap stack, etc
        asm volatile("cps #0x10");
        // <- user mode, undefined stack pointer
        asm volatile("ldr sp,=0x88000000 - 32768");
        // <- now sp is set to tasks's stack
        task(0);

The code above is just a demo, but the final thing will just jump to the idle task, so it can still be hard-coded in a similar way. Or it can be even simpler - the idle task does not need any stack, e.g. the following would suffice.

idle_task:
        wfi
        b idle_task

Code not committed yet.

Hmm, not sure what to look at next, maybe MMU context switching, or perhaps something simpler like system calls. Really exhausted tonight - spent the whole day staring at code not written by a programmer, and I need a good meal too.

Maiden Century

This is my first web diary that's made it to 100 posts, that being this post. Yay.

I've moved! So long and thanks for all the fish!

A Hacker's Craic