Tuesday, 30 March 2010

Damn numbers

Hmm, that was frustrating.

Have been trying to write a `kernel' boot header - one that sets the MMU up for the kernel to execute at another address (0xC0000000) and then jumps to it. Been very tired from sleeping poorly and a bit brain-dead after work so I haven't been really switched on, but it's been dragging on so much I was about to give up (well not really, but it felt like I should).

Apart from a few little bugs, i was using the wrong TEXCB/AP flags for the level 2 page entry for devices ... but I don't know why it's wrong. It seems to check out in the manual, but for whatever reason it just crashes the code (FWIW I was using 0xb2 - 'non sharable device, rw everyone' rather than '0x16' 'sharable device, rw supervisor only). Blah. One little number change and now it works. $@%!$#

I plan to use the two translation table mode, which means the system memory will start at 0x80000000 - so it may make sense to just identity map the kernel at that address. But for now the memory map will have the kernel at 0xc0000000, and i'll start shared libraries or something else at 0x80000000.

So here it is ... in hindsight I may have done things in the wrong order, but this way makes things easy. I set aside some memory in the BSS section for the page tables and let the linker manage allocating space for them, also for the I/O devices - although this means a couple of physical pages are lost at present.

There is a few little `tricks' that I use so the code is position independent, although there are possibly better ways to do it. The init code has to be position independent because the linker script is set up so that all the code starts from the same virtual address - it could be done otherwise, but then I would need an ELF loader to relocate the image - which is somewhat more work.
 _start:
adr r12,_start @ this will be physical load address
mov sp,r12
push { r0 - r3 }
First I just setup r12 and the stack to point to our load address - which is 0x80008000 as set by the linker script. This gives the code a fixed location from which to calculate physical and virtual addresses. The incoming arguments are saved too - although nothing uses them yet (das u-boot can pass in arguments or information about modules or filesystems it preloaded into memory).
        ldr     r1,bss_offset
ldr r2,bss_offset+4
add r1,r1,r12
add r2,r2,r12
mov r0,#0
1: str r0,[r1],#4
cmp r1,r2
blo 1b
Clear the BSS - the code reads a relative offset that the linker creates, that indicates where the BSS starts and stops, and then uses r12 to map that to the physical address. The ldr r1,bss_offset is assembled into a pc-relative instruction so will work no-matter where it's loaded.

Then there is a loop which uses a table to initialise the page tables. I first need to find the space within the BSS where it is stored, and then iterate through the entries. Each range is defined by a virtual target address, a start offset relative to _start, a virtual end address, and the `small page' flags for the pages.
        ldr     r11,ttb_offset
add r11,r12 @ physical address of kernel_ttb
add r10,r11,#16384 @ same for kernel_pages

adr r9,ttb_map
mov r8,#ttb_size
1: ldm r9!, { r4, r5, r6, r7 } @ virtual dest, start offset, virtual end, flags
add r5,r12 @ physical address

2: mov r3,r4,lsr #20
ldr r2,[r11, r3, lsl #2]
cmp r2,#0
If the l2 page isn't set yet, then just allocate one and update the l1 entry.
        moveq   r2,r10
addeq r10,#1024
orreq r2,#1
streq r2,[r11, r3, lsl #2]
Form and store the l2 page table entry.
        bic     r2,#0xff                        @ r2 = physical address of l2 page
mov r1,r4,lsr #12
and r1,#0xff
orr r0,r5,r7
str r0,[r2, r1, lsl #2]
And then loop for all the pages and all the entries in the table. Here I compare for equality for the end address - I do this so I could map the last page of memory if I wanted to. But currently I don't use this.
        add     r4,#4096
add r5,#4096
cmp r4,r6
bne 2b

subs r8,#1
bne 1b
That's really the meat of it - the table has the smarts in it, and uses the linker to create the interesting values required.

Then it just turns on the MMU - this could probably be simplified as I can just enforce the state I want (i.e. don't bother preserving bits). Putting 1 in CP15_TTBCR means that two page tables are used, the TTBR1 table is used for any address with the top bit set (i.e. >= 0x80000000).
        mrc     15, 0, r0, CP15_SCTLR
bic r0,#SCTLR_ICACHE
bic r0,#SCTLR_AFE | SCTLR_TRE | SCTLR_DCACHE | SCTLR_MMUEN
mcr p15, 0, r0, CP15_SCTLR

mov r0,#0
mov r1,#1

mcr p15, 0, r0, CP15_TLBIALL
mcr p15, 0, r1, CP15_TTBCR @ Top 2G uses TTBR1
mcr p15, 0, r11, CP15_TTBR0
mcr p15, 0, r11, CP15_TTBR1
mcr p15, 0, r0, CP15_TLBIALL
sub r0,#1
mcr p15, 0, r0, CP15_DACR

pop { r0 - r3 }

mrc 15, 0, r8, CP15_SCTLR
orr r8,#SCTLR_MMUEN
mcr p15, 0, r8, CP15_SCTLR
This last instruction turns the MMU on (and will probably eventually turn on the caches/etc). The input arguments are restored before turning on the MMU since the stack memory will no longer be valid or mapped (actually I should probably map the same 32K to the system stack wherever I decide to put that). The CPU now flushes the pipeline and starts executing instructions from the current pc - but with the MMU on. Because of this the code has to ensure this instruction is still mapped to the same address otherwise it's a one-way trip to la-la land.

In this case the ldr pc,=vstart will force the assembler to generate a constant load from the constant pool (via a pc-relative load). The linker will set this constant up to point to the virtual address properly.
        ldr     pc, =vstart
Now come the relative offsets used to locate the BSS range, as well as the page table memory from within BSS.
bss_offset:
.word __bss_start__ - _start
.word __bss_end__ - _start
ttb_offset:
.word kernel_ttb - _start
And then the important stuff - the page table mapping descriptions. Rather than store the 'virtual end' address it could probably store the length of the address range, but so long as they are aligned properly it doesn't really make much difference. Note that even with the relative addresses any range in memory can be accessed using the simple arithmetic that the linker supports.
ttb_map:
@ this page, so mmu can be enabled
.word LOADADDR, 0, LOADADDR + start_sizeof, CODE
@ kernel text at virt address
.word __executable_start, 0, __data_start__, CODE
@ kernel data
.word __data_start__, __data_start__-_start, __bss_end__,DATA
@ system stack, 32K, 4K from end of memory
.word 0 - 32768 - 4096, 0x8000000 - LOADADDR, 0-4096, DATA
@ i/o of gpio, for debug too (LEDs!)
.word GPIO5, 0x49056000 - LOADADDR, GPIO5+4096, NDEV
@ do serial port too, for debug stuff
.word UART3, 0x49020000 - LOADADDR, UART3+4096, NDEV

.set ttb_size, (. - ttb_map) / 16
.ltorg
The .ltorg ensures the constant pool is stored at this point, so we can guarantee they are within the one page which needs to be identity mapped immediately after turning on the MMU.
vstart:
ldr sp,=-4096 @ init stack
@ bl __libc_init_array @ static intialisers
mov r8,#(0xf<<20) @ enable NEON coprocessor access (still off though)
mcr p15, 0, r8, c1, c0, 2
b main
And this is the 'virtual address' entry point. This could just occur immediately after the setup code, but separating it makes it more obvious it's separated. About the only necessary setup is the (system) stack pointer. I was going to place this at the end of the virtual memory but having it one page back protects from stack underflow as well.

And finally there is the size of this code, and the BSS which stores the bare minimum so I can set it up and see it works (i.e. the UART or blink the LEDs).
        .set    start_sizeof, ((. - _start)+4095) & 0xfffff000

.bss
.balign 16384
.global kernel_ttb, kernel_pages, UART3
kernel_ttb:
.skip 16384
kernel_pages:
.skip 1024*32
GPIO5: .skip 4096
UART3: .skip 4096
And ... it's done. Phew.

Unfortunately this means all my 'library code' that uses fixed physical addresses wont work any more, including the debug printing stuff. But that's something to worry about later.

One goal I had was that code isn't just setting up the page table to be thrown away later - this is sufficient to remain the kernel page table forever. Either for a supervisor level kernel process/threads, or for in this case as the `system page table' which is used for any address above 0x80000000. It still needs a little tweaking - the page table should be write-through cache-able for instance - but now it works I can worry about the details. Well now hopefully I can move on to more interesting things.

Interpolating arbitrary values

For work I have been playing with a few things of some interest. I thought I needed a function that could interpolate a set of values spread across an arbitrary 2d plane into a grid of values. I came across this interesting implementation of Thin Plate Splines which seemed to do the job. Unfortunately it turned out that I needed to interpolate more values than is practical with this algorithm (it does it, it just takes too long), and I can probably just force the values to be in a grid anyway so I can use much simpler methods. But still, this is an interesting algorithm to have in the toolkit and it produces pleasant looking results. Interestingly I found the C++ 'ludecomposition' code too messy to convert to C (i'm using different data structures) and just used the Java one it references as a starting point instead. It was much more C-like and translated in a very straightforward manner.

So I wrote a basic bicubic interpolater - the code uses bilinear at the moment although in an inconsistent way which doesn't really work since values can be missing. I was hoping bicubic would be a more natural fit for what it is doing, and worry about the missing values later. Unfortunately it doesn't seem to help much - the input data is just too noisy/inconsistent so I guess there is more to fix first (sorry this doesn't make much sense, I can't really say what it's trying to do).

Walls, dirt

I have some photo's of the progress on the retaining walll but i'm too lazy to put them up today. I got some ag-pipe on the weekend, so I'm just about ready to back-fill at least some of the wall (i don't think I have enough gravel to do the whole lot, but i'll see), although I'm not sure where to run it - and an outlet mid-way along the wall i've already laid will be a bastard! I was going to have it coming out the ends but now i'm not so sure. I need to decide so I can get the right fittings too (which for some reason are rather expensive for what they are).

Boral are having a sale on bricks and whatnot this week so I went and ordered another pile of retaining wall blocks (40% off makes it worth it, even if I don't need them for a while). I wasn't really sure how many I needed to start with, and I used a lot more than I thought originally (just the main wall uses most of them). I have a better plan on what I want to end up with now, so hopefully I got it right ... I guess I can always put them around trees or something if I have too many, or create a lower wall if I don't have enough.

Since I wont need to use them for a while i'm going to try to get them delivered into the driveway - so I don't have to move them off the verge by hand. So today I also moved the rest of the roadbase off of the drive-way to a pile out the back. Unfortunately I overloaded my cheap wheelbarrow and it turned over and I bent the handle (well it was only $60), but it's still usable. If I get stuck into finishing off the walls around the paving area it will get used up pretty fast anyway - of the 3 tons I probably have under 1 left. I'll get the bricks before easter, so it could be a very long long weekend if I get stuck into it ...

No comments: