Skip to content

BMC Part 2 - The C Runtime

In this part of the tutorial we'll look deeper into the linker and the C-Library so that we end up with a working C-Library link. Exciting stuff, huh?! Let's look further into what the compiler and linker are doing in order to create our bare-metal executable.

The C-Runtime (different to the C-Library!) is currently missing from our code. In a lot of embedded systems the C-Runtime is essential, or else things break instantly. The most notable thing that's instantly visible in most embedded systems is that static variables are not initialised.

This is why in our previous example, we were working without pre-initialised variables. Instead, we initialise the variable in the code at the start of main from a pre-processor define.

Github The code for the tutorials is now on Github. You can either browse the code, checkout the code, fork, branch, or download as a zip from GibHub.

Understanding the C-Runtime Startup

Let's modify and use a pre-initialised variable instead:

part-2/armc-04

#include "rpi-gpio.h"

/** GPIO Register set */
volatile unsigned int* gpio = (unsigned int*)GPIO_BASE;

/** Simple loop variable */
volatile unsigned int tim;

/** Main function - we'll never return from here */
int main(void)
{
    /* Write 1 to the GPIO16 init nibble in the Function Select 1 GPIO
       peripheral register to enable GPIO16 as an output */
    gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);

    /* Never exit as there is no OS to exit to! */
    while(1)
    {
        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
           for plus models )*/
        gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);

        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin high ( Turn OK LED off for original Pi, and on
           for plus models )*/
        gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
    }
}

Compile it (using the build.sh script):

part-2/armc-04 $ ./build.sh rpi0
arm-none-eabi-gcc -g -nostartfiles -mfloat-abi=hard -O0 -DRPI0 -mfpu=vfp -march=armv6zk \
    -mtune=arm1176jzf-s /.../part-2/armc-04/*.c -o /.../part-2/armc-04/kernel.armc-04.rpi0.elf
/.../gcc-arm-none-eabi-7-2017-q4-major/bin/../lib/gcc/arm-none-eabi/7.2.1/../../../../\
    arm-none-eabi/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objcopy /.../part-2/armc-04/kernel.armc-04.rpi0.elf -O binary \
    /.../part-2/armc-04/kernel.armc-04.rpi0.img

and we notice a few things:

  • The size of the binary image is now 33k, but the previous version of this code was only a hundred bytes or so!
  • The code, when written to the SDCARD still works - this isn't really expected without a working C-Runtime in place to initialise the variable gpio before calling main!?
part-2/armc-04 $ ls -lah
total 40K
drwxr-xr-x 2 brian brian 4.0K Oct 11 21:07 .
drwxr-xr-x 8 brian brian 4.0K Jan  4  2018 ..
-rw-r--r-- 1 brian brian 1.2K Sep 21 00:19 armc-04.c
-rwxr-xr-x 1 brian brian 2.7K Oct 11 21:07 build.sh
-rwxr-xr-x 1 brian brian  108 Sep 21 00:19 disassemble.sh
-rwxr-xr-x 1 brian brian  35K Oct 11 21:07 kernel.armc-04.rpi0.elf
-rwxr-xr-x 1 brian brian  65K Oct 11 21:07 kernel.armc-04.rpi0.img
-rw-r--r-- 1 brian brian 1.7K Sep 21 00:19 rpi-gpio.h

In fact, this embedded system is different to a lot because we're loading an entire binary image into RAM and then executing from RAM. The majority of systems have a non-volatile memory section (Flash/ROM) where the executable code resides, and a volatile memory section (RAM) where the variable data resides. Variables exist in RAM, everything has a position in RAM.

When we compile for a target that executes from an image in Flash and uses RAM for variables, we need a copy of the initial values for the variables from Flash so that every time the system is started the variables can be initialised to their initial value, and we need the code in Flash to copy these values into the variables before main() is called.

This is one of the jobs of the C-Runtime code (CRT). This is a code object that is normally linked in automagically by your tool-chain. This is usually not the only object to get linked to your code behind your back - usually the Interrupt Vector Table gets linked in too, and a Linker Script tells the linker how to organise these additional pieces of code in your memory layout.

Normally of course, this happens without you knowing. In general, you'll select your processor or embedded system on the command line and the appropriate linker script and C-Runtime is chosen for you and linked in.

I urge you to go and look at your arm-none-gcc-eabi install now to see some of these files. Look under the arm-none-eabi sub-directory and then under the lib sub-directory. The C-Runtime code is a binary object file and is called crt0.o, the C Library for info is an archive of object files called libc.a (there may be several versions with different names), and then you'll have some .ld files. Under the ldscripts subdirectory you'll find the standard linker scripts.

It's just worth a look to know they're there. GCC uses a thing called specs files too, which allow specifying system settings so that you can create a machine specification that allows you to target a machine easily. You can select a custom specs file with a command line option for GCC, otherwise gcc uses it's built-in specs.

specs files are considered an advanced subject in the world of the GNU tool-chain, but they provide an excellent way of supplying machine-specific compilation settings. For the embedded engineer they're worth knowing about! :D

So, now we've got two questions, why does our code work - because the initialisation isn't present in the C-Runtime? and, why has our code size jumped from a 100 bytes or so to 64k?

The code works without any initialisation because the variables exist in the same memory space as the code. The bootloading process results in the raspberry-pi kernel being loaded into RAM in order to be executed, the GPU bootloader runs before the ARM processor we're targeting runs, and loads the kernel.img file from disk. Because of this, the variables position within the binary image becomes their variable memory location.

The image is loaded by the boot-loader at address 0x8000 and then executed. So the bootloader has essentially done a taskt that the C-Runtime would normally do, copy the initial values of initialised variables from non-volatile memory to volatile memory. Cool.

Look at the code produced closer with a disassembler. You've already got a disassembler! It comes with the toolchain; Welcome to the world of objdump (or in our case arm-non-eabi-objdump).

We disassemble the elf file because then objdump knows what processor the binary was built for. It also then has knowledge of the different code sections too. There's a disassemble.sh script so go ahead and disassemble the code to see what the compiler generated. You'll get a kernel*.asm file that looks similar to if not the same as the following (RPI0) code:

Disassembly of section .text:

00008000 <main>:
    8000:    e59f30b8     ldr    r3, [pc, #184]    ; 80c0 <main+0xc0>
    8004:    e5933000     ldr    r3, [r3]
    8008:    e2833004     add    r3, r3, #4
    800c:    e5932000     ldr    r2, [r3]
    8010:    e59f30a8     ldr    r3, [pc, #168]    ; 80c0 <main+0xc0>
    8014:    e5933000     ldr    r3, [r3]
    8018:    e2833004     add    r3, r3, #4
    801c:    e3822701     orr    r2, r2, #262144    ; 0x40000
    8020:    e5832000     str    r2, [r3]
    8024:    e59f3098     ldr    r3, [pc, #152]    ; 80c4 <main+0xc4>
    8028:    e3a02000     mov    r2, #0
    802c:    e5832000     str    r2, [r3]
    8030:    ea000004     b    8048 <main+0x48>
    8034:    e59f3088     ldr    r3, [pc, #136]    ; 80c4 <main+0xc4>
    8038:    e5933000     ldr    r3, [r3]
    803c:    e2833001     add    r3, r3, #1
    8040:    e59f207c     ldr    r2, [pc, #124]    ; 80c4 <main+0xc4>
    8044:    e5823000     str    r3, [r2]
    8048:    e59f3074     ldr    r3, [pc, #116]    ; 80c4 <main+0xc4>
    804c:    e5933000     ldr    r3, [r3]
    8050:    e59f2070     ldr    r2, [pc, #112]    ; 80c8 <main+0xc8>
    8054:    e1530002     cmp    r3, r2
    8058:    9afffff5     bls    8034 <main+0x34>
    805c:    e59f305c     ldr    r3, [pc, #92]    ; 80c0 <main+0xc0>
    8060:    e5933000     ldr    r3, [r3]
    8064:    e2833028     add    r3, r3, #40    ; 0x28
    8068:    e3a02801     mov    r2, #65536    ; 0x10000
    806c:    e5832000     str    r2, [r3]
    8070:    e59f304c     ldr    r3, [pc, #76]    ; 80c4 <main+0xc4>
    8074:    e3a02000     mov    r2, #0
    8078:    e5832000     str    r2, [r3]
    807c:    ea000004     b    8094 <main+0x94>
    8080:    e59f303c     ldr    r3, [pc, #60]    ; 80c4 <main+0xc4>
    8084:    e5933000     ldr    r3, [r3]
    8088:    e2833001     add    r3, r3, #1
    808c:    e59f2030     ldr    r2, [pc, #48]    ; 80c4 <main+0xc4>
    8090:    e5823000     str    r3, [r2]
    8094:    e59f3028     ldr    r3, [pc, #40]    ; 80c4 <main+0xc4>
    8098:    e5933000     ldr    r3, [r3]
    809c:    e59f2024     ldr    r2, [pc, #36]    ; 80c8 <main+0xc8>
    80a0:    e1530002     cmp    r3, r2
    80a4:    9afffff5     bls    8080 <main+0x80>
    80a8:    e59f3010     ldr    r3, [pc, #16]    ; 80c0 <main+0xc0>
    80ac:    e5933000     ldr    r3, [r3]
    80b0:    e283301c     add    r3, r3, #28
    80b4:    e3a02801     mov    r2, #65536    ; 0x10000
    80b8:    e5832000     str    r2, [r3]
    80bc:    eaffffd8     b    8024 <main+0x24    >
    80c0:    000180cc     andeq    r8, r1, ip, asr #1
    80c4:    000180d0     ldrdeq    r8, [r1], -r0
    80c8:    0007a11f     andeq    sl, r7, pc, lsl r1

Disassembly of section .data:

000180cc <gpio>:
   180cc:    20200000     eorcs    r0, r0, r0

Disassembly of section .bss:

000180d0 <tim>:
   180d0:    00000000     andeq    r0, r0, r0

NOTE: Unless we're using the exact same compiler, your mileage may vary here. So assume the assembly code above is what's come out of the compiler and follow the text below which goes through it in detail.

Let's take it line by line. The toolchain's linker has decided that the entry point for the code should be at memory address 0x8000. We see from the disassembled listing that this is where the machine code starts. Let's look at what it does.

00008000 <main>:
    8000:    e59f30b8     ldr    r3, [pc, #184]    ; 80c0 <main+0xc0>

This first line loads r3 with the value at the address contained at the Program Counter (PC) + 184. In this assembler, [] is kind of an equivalent to dereferencing a pointer in C. So instead of loading r3 with the value 80c0 it will instead load r3 with the 32-bit value at memory location 0x80c0

But wait a minute - if you do the maths of 0x8000 + 184 you get 0x80b8. How come the disassembler is suggesting the data comes from 0x80c0?

Well PC relative addressing works slightly different to what you may expect and really there's a issue with it, the value of PC is different depending on whether the instruction set is currently ARM or Thumb. Further information is available on the ARM website.

The important thing to note here is:

In ARM state, the value of the PC is the address of the current instruction plus 8 bytes.

In Thumb state:

  • For B, BL, CBNZ, and CBZ instructions, the value of the PC is the address of the current instruction plus 4 bytes.

  • For all other instructions that use labels, the value of the PC is the address of the current instruction plus 4 bytes, with bit[1] of the result cleared to 0 to make it word-aligned.

We're in ARM instruction mode here. If you're not sure about ARM and Thumb instruction sets you can do some googling. Thumb (there's more than one Thumb mode!) are smaller width instructions to allow for more compact code which is very useful in heavily embedded systems.

So actually the maths here sound be: 0x8000 + 0x8 + 0xb8 = 0x80c0 to get us to a memory address. At 0x80c0 there's the value 0x180cc (0x8000 + 0x80cc):

80c0:    000180cc     andeq    r8, r1, ip, asr #1

You can ignore the disassembled version of this value as it's not machine code, instead it's merely a data value.

The next line of the code loads r3 with the 32-bit value that's in memory at the address currently contained in r3. In c, this would look a bit horrid, but so you get the idea of what I've just explained, imagine something like this:

uint32_t r3 = 0x180cc;
r3 = *(uint32_t*)r3;

At the address 0x180cc in our disassembled version there's the 32-bit value 0x20200000 which is the value we want the variable gpio initialised to (for the RPI1, for the RPI2 this will be 0x3F200000. So this is why the code works without any explicit loading or initialisation, but let's look at exactly what's going on and find out why it works like this.

The value at the end of our executable image can be viewed by dumping the hex and having a look at the plain machine code that's in the binary file. The disassemble script does this for you with a tool called hexdump and puts the result into a kernel*.img.hexdump file. It's a plain text file - you can go ahead and crack it open in a text editor or cat it.

0000000 30b8 e59f 3000 e593 3004 e283 2000 e593
0000010 30a8 e59f 3000 e593 3004 e283 2701 e382
0000020 2000 e583 3098 e59f 2000 e3a0 2000 e583
0000030 0004 ea00 3088 e59f 3000 e593 3001 e283
0000040 207c e59f 3000 e582 3074 e59f 3000 e593
0000050 2070 e59f 0002 e153 fff5 9aff 305c e59f
0000060 3000 e593 3028 e283 2801 e3a0 2000 e583
0000070 304c e59f 2000 e3a0 2000 e583 0004 ea00
0000080 303c e59f 3000 e593 3001 e283 2030 e59f
0000090 3000 e582 3028 e59f 3000 e593 2024 e59f
00000a0 0002 e153 fff5 9aff 3010 e59f 3000 e593
00000b0 301c e283 2801 e3a0 2000 e583 ffd8 eaff
00000c0 80cc 0001 80d0 0001 a11f 0007 0000 0000
00000d0 0000 0000 0000 0000 0000 0000 0000 0000
*
00100c0 0000 0000 0000 0000 0000 0000 0000 2020
00100d0

The binary is in little-endian format because the processor is little endian. Adjust the GPIO_BASE value, recompile and disassemble again so you can see the value change. Sure the code won't work properly, but you can prove to yourself you're looking (and making sense of) the right thing.

The thing to note here is that our binary image ends at 0x100cf. After the bootloader has placed this binary image in memory at 0x8000 the last memory location we touch is 0x180cf.

Let's have a quick look at what happens next:

8008:    e2833004     add    r3, r3, #4

Increase the value in r3 from 0x20200000 to 0x20200004 and store it in r3. This is part of the C line:

gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);

LED_GPFSEL for this compilation for the RPI Zero is set by:

#define LED_GPFSEL      GPIO_GPFSEL1
#define GPIO_GPFSEL1    1

From the definition of gpio as volatile unsigned int* each item pointed to is an unsigned int which is 32-bit wide. That's 4 bytes, so we increase the pointer into gpio by 4 bytes to get the register address we require in r3.

800c:    e5932000     ldr    r2, [r3]

Load the value of gpio[LED_GPFSEL] into r2. This is common for read-modify-write operations such as the |= operator we're using in this example. We need to do exactly that, read the value, modify the value and then write the new value.

8010:    e59f30a8     ldr    r3, [pc, #168]    ; 80c0 <main+0xc0>
8014:    e5933000     ldr    r3, [r3]
8018:    e2833004     add    r3, r3, #4

The compiler being rather inefficient here. It's preparing r3 again to be the memory write destination which is exactly the same read memory location. Any sort of compiler optimisations should be able to get rid of the above three lines. They're really not required, but we understand what's going on still.

801c:    e3822701     orr    r2, r2, #262144    ; 0x40000

The modify part of |=. In this case (1 << LED_GPFBIT) has been reduced to a constant. The original gpio[LED_GPFSEL] value is still in r2. We OR that value with the constant to set the bit and store the new value in r2. The ARM architecture cannot modify a register value and store it to a memory location in a single instruction BTW in case you're thinking that would serve us better.

8020:    e5832000     str    r2, [r3]

Finally, the write part of the read-modify-write operation is to write the new value back to the destination gpio[LED_GPFSEL]

8024:    e59f3098     ldr    r3, [pc, #152]    ; 80c4 <main+0xc4>
...
80c4:    000180d0     ldrdeq    r8, [r1], -r0

Loads the value 0x180d0 into r3.

This value is a problem! It is outside our binary image space. If we were to use this value it would be a random value that isn't within our control. How come?

This memory location relates to the C variable tim in our program. This variable is determined by the C standard to have automatic storage duration and according to the C standard should therefore have a initialised value of 0, but we know here that this is not the case, instead we'll have a random value.

Actually what happens in the code next is the for loop starts by setting the tim variable to 0.

8028:    e3a02000     mov    r2, #0
802c:    e5832000     str    r2, [r3]

Let's do a sanity check and make sure we're right by assuming that this is the tim variable:

000180d4 B __bss_end__
000180d4 B _bss_end__
000180d0 B __bss_start
000180d0 B __bss_start__
000180cc D __data_start
000180d0 D _edata
000180d4 B _end
000180d4 B __end__
000180cc D gpio
00008000 T main
00080000 N _stack
         U _start
000180d0 B tim

Yep! We're talking the same as nm. That's good at least!

Now let's do an experiment. There's a reason I decoded the value 0x100cc which is the address of the gpio variable as 0x8000 + 0x80cc earlier:

part-2/armc05.c

#include "rpi-gpio.h"

/** GPIO Register set */
volatile unsigned int* gpio = (unsigned int*)GPIO_BASE;

/** Simple loop variable */
volatile unsigned int tim;

/** Main function - we'll never return from here */
int main(void) __attribute__((naked));
int main(void)
{
    /* Write 1 to the GPIO16 init nibble in the Function Select 1 GPIO
       peripheral register to enable GPIO16 as an output */
    gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);

    /* Never exit as there is no OS to exit to! */
    while(1)
    {
        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
           for plus models )*/
        gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);

        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin high ( Turn OK LED off for original Pi, and on
           for plus models )*/
        gpio[LED_GPSET] = (1 << LED_GPIO_BIT);

        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
           for plus models )*/
        gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);
    }
}

armc-05.c varies only slightly from armc-04.c in that it adds in some code. When we look at this with nm we see that the __data_start section shifts by the same amount as the code we've added:

part-2/armc-05 $ ./build.sh rpi0
...

part-2/armc-05 $ ./disassemble.sh
...

part-2/armc-05 $ cat kernel.armc-05.rpi0.elf.nm
 00018120 B __bss_end__
 00018120 B _bss_end__
 0001811c B __bss_start
 0001811c B __bss_start__
 00018118 D __data_start
 0001811c D _edata
 00018120 B _end
 00018120 B __end__
 00018118 D gpio
 00008000 T main
 00080000 N _stack
          U _start
 0001811c B tim

If we go back to the original disassembled output above, we see that address 0x80cc (0x180cc - (0x8000 * 2)) is the next available memory address after the constants data which is itself immediately after the code section.

I hope you're following this, we're really seeing a "bug" in the linker script. There's no need for us to have this offset. The offset load is correct because when we load this image in RAM the data is indeed going to be + 0x8000 because that's where the boot-loader is going to place the image, but something is spacing the data section away from the constants data.

There's a lot more output when building this code. That's because we've added an option (-Wl,-verbose) to get verbose output from the linker to see what's going on.

As you see from the start of the ld verbose output:

GNU ld (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 2.29.51.20171128
  Supported emulations:
   armelf
using internal linker script:

LD is using an internal linker script. That means the linker script it's using is compiled into the ld executable.

part-2/armc-06.c

The code is identical to armc-04.c. (So the original code without the additional loop in armc-05.c)

However, we add -Wl,-T,rpi.x to the compilation command to instruct LD to use a different linker script to the one it was using. We've retained -Wl,verbose so we can see what happens.

When passing options to the linker using GCC to compile and link, check the gcc documentation which details the options you can pass.

Now we have control of the linker script. We can try to find out what's "wrong". I quote wrong, because technically this works, but we've got an annoying 0x8000 offset. It'll be a pain to debug one day, I know it. Let's find out what we need to do to fix it instead. You can also see how complicated a linker script can get when it needs to deal with C++ sections!

I hope you're staying with me! I know this may seem far removed from C development on the Raspberry-Pi bare-metal, but in fact knowing how your tools work to construct the code is essential later on down the line - heck you wouldn't be getting it up and running on your own from scratch unless you knew at least some of this stuff!

Whilst I'm scan-reading the linker script I'm looking for things that stick out. Most of these sections are not used in our basic example so far, so whatever is messing with us must have a size set, and must be present only when we have something in the initialised variables section which we know from our investigations with nm is the __data_start section (it aligns with the gpio symbol which we initialised).

The "problem" must be between the __data_start section and end of the text section (the symbol main indicated the start of the text section). After a little bit of hunting, I see a comment and an alignment pertaining to the data section on line 107:

/* Adjust the address for the data segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = ALIGN(CONSTANT (MAXPAGESIZE)) + (. & (CONSTANT (MAXPAGESIZE) - 1));

Interesting, we apparently "want" to adjust up to the next page. If I comment this out with a standard C-style comment as can be seen in the rest of the file, we get back down to a size of a few hundred bytes with initialised data. Check again with the output of:

part-2/armc-06 $ ll

-rwxr-xr-x 1 brian brian   284 Oct 14 22:17 kernel.armc-06.rpi0.img*

When we disassemble the kernel code, we end up with something similar to the following:

part-2/armc-06 $ ./disassemble.sh
Disassembly of section .text:

00008000 <main>:
    8000:    e59f30b8     ldr    r3, [pc, #184]    ; 80c0 <main+0xc0>
    8004:    e5933000     ldr    r3, [r3]
    8008:    e2833010     add    r3, r3, #16
    800c:    e5932000     ldr    r2, [r3]
    8010:    e59f30a8     ldr    r3, [pc, #168]    ; 80c0 <main+0xc0>
    8014:    e5933000     ldr    r3, [r3]
    8018:    e2833010     add    r3, r3, #16
    801c:    e3822602     orr    r2, r2, #2097152    ; 0x200000
    8020:    e5832000     str    r2, [r3]
    8024:    e59f3098     ldr    r3, [pc, #152]    ; 80c4 <main+0xc4>
    8028:    e3a02000     mov    r2, #0
    802c:    e5832000     str    r2, [r3]
    8030:    ea000004     b    8048 <main+0x48>
    8034:    e59f3088     ldr    r3, [pc, #136]    ; 80c4 <main+0xc4>
    8038:    e5933000     ldr    r3, [r3]
    803c:    e2833001     add    r3, r3, #1
    8040:    e59f207c     ldr    r2, [pc, #124]    ; 80c4 <main+0xc4>
    8044:    e5823000     str    r3, [r2]
    8048:    e59f3074     ldr    r3, [pc, #116]    ; 80c4 <main+0xc4>
    804c:    e5933000     ldr    r3, [r3]
    8050:    e59f2070     ldr    r2, [pc, #112]    ; 80c8 <main+0xc8>
    8054:    e1530002     cmp    r3, r2
    8058:    9afffff5     bls    8034 <main+0x34>
    805c:    e59f305c     ldr    r3, [pc, #92]    ; 80c0 <main+0xc0>
    8060:    e5933000     ldr    r3, [r3]
    8064:    e283302c     add    r3, r3, #44    ; 0x2c
    8068:    e3a02902     mov    r2, #32768    ; 0x8000
    806c:    e5832000     str    r2, [r3]
    8070:    e59f304c     ldr    r3, [pc, #76]    ; 80c4 <main+0xc4>
    8074:    e3a02000     mov    r2, #0
    8078:    e5832000     str    r2, [r3]
    807c:    ea000004     b    8094 <main+0x94>
    8080:    e59f303c     ldr    r3, [pc, #60]    ; 80c4 <main+0xc4>
    8084:    e5933000     ldr    r3, [r3]
    8088:    e2833001     add    r3, r3, #1
    808c:    e59f2030     ldr    r2, [pc, #48]    ; 80c4 <main+0xc4>
    8090:    e5823000     str    r3, [r2]
    8094:    e59f3028     ldr    r3, [pc, #40]    ; 80c4 <main+0xc4>
    8098:    e5933000     ldr    r3, [r3]
    809c:    e59f2024     ldr    r2, [pc, #36]    ; 80c8 <main+0xc8>
    80a0:    e1530002     cmp    r3, r2
    80a4:    9afffff5     bls    8080 <main+0x80>
    80a8:    e59f3010     ldr    r3, [pc, #16]    ; 80c0 <main+0xc0>
    80ac:    e5933000     ldr    r3, [r3]
    80b0:    e2833020     add    r3, r3, #32
    80b4:    e3a02902     mov    r2, #32768    ; 0x8000
    80b8:    e5832000     str    r2, [r3]
    80bc:    eaffffd8     b    8024 <main+0x24>
    80c0:    000080cc     andeq    r8, r0, ip, asr #1
    80c4:    000080d0     ldrdeq    r8, [r0], -r0
    80c8:    0007a11f     andeq    sl, r7, pc, lsl r1

Disassembly of section .data:

000080cc <gpio>:
    80cc:    20200000     eorcs    r0, r0, r0

Disassembly of section .bss:

000080d0 <tim>:
    80d0:    00000000     andeq    r0, r0, r0

Use the ARM instruction set quick reference card to decypher if you're not familiar with ARM assembler.

By the way, as a quick note - the comment in the linker script is entirely useless as it only describes what the code is doing, the worst type of comment! There is no WHY in the comment. Why are we wanting to do this? What's the reason for forcing the alignment? Anyway this alignment was forcing the 0x8000 additional offset to our rodata (initialised data) section.

Cool, now the initialised data is tagged immediately on the end of the data and the values look sane too! I'll leave you to decode the assembly.

part-2/armc-07

The code here is the same as armc-06.c. Nothing has changed, we'll just modify our linker script. At this point, we can get rid of the annoying _start undefined symbol warning. At the top of the linker script, on line 9 you'll see a line that says ENTRY(_start)

In armc-07 we've changed this to read ENTRY(main)

The warning will go away at last! But we'll soon learn why we need that _start section for the C-Runtime anyway!

Have a quick check of the sections with nm again, now there's no Undefined sections!

part-2/armc-07 $ ./build.sh rpi0

arm-none-eabi-gcc -g -nostartfiles -mfloat-abi=hard -O0 -DRPI0 -mfpu=vfp -march=armv6zk \
    -mtune=arm1176jzf-s part-2/armc-07/*.c -o part-2/armc-07/kernel.armc-07.rpi0.elf

arm-none-eabi-objcopy part-2/armc-07/kernel.armc-07.rpi0.elf -O binary \
    part-2/armc-07/kernel.armc-07.rpi0.img
part-2/armc-07 $ ./disassemble.sh
part-2/armc-07 $ cat ./kernel.armc-07.rpi0.elf.nm
000080d4 B __bss_end__
000080d4 B _bss_end__
000080d0 B __bss_start
000080d0 B __bss_start__
000080cc D __data_start
000080d0 D _edata
000080d4 B _end
000080d4 B __end__
000080cc D gpio
00008000 T main
00080000 N _stack
000080d0 B tim

Sections and C Startup

As we've found out, there are a lot of sections, some which you may not know the meaning of. The .bss section is used for data that is implicitly initialised to 0 at startup (This is mandated by the C Standard). This means that all variables that are statically declared are set to zero initially. Statically declared essentially means global, so local (automatic) variables in a function that are not marked as static do not get initialised to zero if you do not explicitly add an initial value. It's easier with an example, take for example the following C file:

    unsigned int var1;
    unsigned int var2;
    unsigned int var3 = 10;

    void function( void )
    {
        unsigned int funcvar1;
        static unsigned int funcvar2;

        /* ... */

    }

In the above example var1, var2 and funcvar2 are in the bss section, the rest are not. For more information on the bss section, see the wikipedia page on it

The linker organises the data and sorts it out into the different sections. In the above, var3 for example goes into the data section. The code is compiled into machine code and then put in the .text section. See here for "a little!" more information about the text section. There can be many text sections and there can be many data sections too. These sections are wild-carded in the linker script to ensure different sections can be defined whilst still knowing whether they are code or data sections.

In our current code (armc-07) the bss section is not valid because it is not being initialised. That's because the C-Startup is missing. This is the importance of the _start symbol!

The _start section is run before the c main() entry point and one of it's jobs is to initialise the bss section. The linker provides us with a couple of symbols for the sections so we know where they start and end. In the startup code all we need to do is loop between the addresses defined by bss_start and bss_end and set all locations to zero. It's easy when you know what you're meant to do!

The _startup code should also setup the stack pointer. We'll need a working stack to get anything useful up and working! The stack is temporary memory space available for functions to use. The compiler tries to use registers for local (automatic) function variables, but if the size of data required by the local variables exceeds the amount of registers available, the compiler uses the stack for the local variables.

So lets go ahead and setup the stack pointer to a sane value and initialise the bss section.

Firstly, we'll set the linker script back to standard so the entry point is the _start symbol again, we'll have to generate this symbol and generally we'll need to do this is assembler. We need to setup the stack pointer before we enter the C code so that we're safe to write C which may attempt to use the stack straight away.

Normally the complete startup is done in assembler, including zeroing the bss section, but it doesn't have to! I prefer to get into C as soon as possible, so let's see how little assembler we can get away with, and see what trips us up next!

armc-08-start.S

.section ".text.startup"

.global _start

_start:
    // Set the stack pointer, which progresses downwards through memory
    // Set it at 64MB which we know our application will not crash into
    // and we also know will be available to the ARM CPU. No matter what
    // settings we use to split the memory between the GPU and ARM CPU
    ldr     sp, =0x8000

    // Run the c startup function - should not return and will call kernel_main
    b       _cstartup

_inf_loop:
    b       _inf_loop

Not bad! There's not exactly a lot there to set up the stack pointer. The stack is placed at the start of our program and grows downwards through memory. You don't need a large amount of memory for the stack.

Some important information about the .section declaration. With the linker script, I can see in rpi.x that the following text sections are available:

.text           :
{
  *(.text.unlikely .text.*_unlikely .text.unlikely.*)
  *(.text.exit .text.exit.*)
  *(.text.startup .text.startup.*)
  *(.text.hot .text.hot.*)
  *(.text .stub .text.* .gnu.linkonce.t.*)
  /* .gnu.warning sections are handled specially by elf32.em.  */
  *(.gnu.warning)
  *(.glue_7t) *(.glue_7) *(.vfp11_veneer) *(.v4_bx)
}

This is also the order in which the text section is linked together by the linker. The standard functions generally go in the standard .text section. Therefore, we can put our _startup function before the main .text section by putting it in the .text.startup section. Other than .text.unlikely and text.exit which we won't use, _startup will be the first thing to go into the text segment which is where execution will start at 0x8000

Then, we're in C for initialising the bss section by running the cstartup function:

armc-08-cstartup.c

extern int __bss_start__;
extern int __bss_end__;

extern void kernel_main( unsigned int r0, unsigned int r1, unsigned int atags );

void _cstartup( unsigned int r0, unsigned int r1, unsigned int r2 )
{
    /*__bss_start__ and __bss_end__ are defined in the linker script */
    int* bss = &__bss_start__;
    int* bss_end = &__bss_end__;

    /*
        Clear the BSS section

        See http://en.wikipedia.org/wiki/.bss for further information on the
            BSS section

        See https://sourceware.org/newlib/libc.html#Stubs for further
            information on the c-library stubs
    */
    while( bss < bss_end )
        *bss++ = 0;

    /* We should never return from main ... */
    kernel_main( r0, r1, r2 );

    /* ... but if we do, safely trap here */
    while(1)
    {
        /* EMPTY! */
    }
}

Again, not that bad and pretty reasonable. NOTE: We've now changed from main to kernel_main, and there's a reason for this - the bootloader is actually expecting a slightly different entry definition compared to the standard C main function. So as we're setting up our own C-Runtime anyway, we can define the correct entry format. The correct bootloader entry point defines a couple of values which we can check to know what system we're booting.

The actual C code hasn't really changed apart from the kernel_main difference:

armc-08.c

#include "rpi-gpio.h"

/** GPIO Register set */
volatile unsigned int* gpio = (unsigned int*)GPIO_BASE;

/** Simple loop variable */
volatile unsigned int tim;

/** Main function - we'll never return from here */
void kernel_main( unsigned int r0, unsigned int r1, unsigned int atags )
{
    /* Write 1 to the GPIO16 init nibble in the Function Select 1 GPIO
       peripheral register to enable GPIO16 as an output */
    gpio[LED_GPFSEL] |= (1 << LED_GPFBIT);

    /* Never exit as there is no OS to exit to! */
    while(1)
    {
        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin low ( Turn OK LED on for original Pi, and off
           for plus models )*/
        gpio[LED_GPCLR] = (1 << LED_GPIO_BIT);

        for(tim = 0; tim < 500000; tim++)
            ;

        /* Set the LED GPIO pin high ( Turn OK LED off for original Pi, and on
           for plus models )*/
        gpio[LED_GPSET] = (1 << LED_GPIO_BIT);
    }
}

We again confirm with nm that everything is ordered at least something like sensible:

part-2/armc-08 $ ./build.sh rpi0

arm-none-eabi-gcc -g -nostartfiles -mfloat-abi=hard -O0 -DRPI0 -mfpu=vfp -march=armv6zk \
    -mtune=arm1176jzf-s part-2/armc-08/*.S part-2/armc-08/*.c -o part-2/armc-08/kernel.armc-08.rpi0.elf

arm-none-eabi-objcopy part-2/armc-08/kernel.armc-08.rpi0.elf -O binary \
    part-2/armc-08/kernel.armc-08.rpi0.img
part-2/armc-08 $ ./disassemble.sh
part-2/armc-08 $ cat ./kernel.armc-08.rpi0.elf.nm

00008164 B __bss_end__
00008164 B _bss_end__
00008160 B __bss_start
00008160 B __bss_start__
0000800c T _cstartup
0000815c D __data_start
00008160 D _edata
00008164 B _end
00008164 B __end__
0000815c D gpio
00008008 t _inf_loop
00008078 T kernel_main
00080000 N _stack
00008000 T _start
00008160 B tim

Yup, looks good - _start is at 0x8000 where execution will start. Stick it on the card and make sure the OK LED is blinking still. I appreciate that we're doing a lot of work where you can't see more interesting results - but it'll get better soon, I promise! For now we're getting a great foundation for developing in C for the bare metal Raspberry-Pi. Knowing what your tools are doing is pretty essential!

part-2/armc-08 $ ./make_card.sh rpi0
part-2/armc-08 $ cat ./card.armc-08.rpi0.img > /dev/sdg && sync && eject /dev/sdg

The C-Library stubs

Now that we've got the C-Runtime pretty much setup, we can implement the C-Library again with our own C-Runtime startup and C-Library stubs. This is where we want to get to - easy compiling with the C-Library being linked in so that we can make use of the C-Library functions on the Raspberry-Pi without an operating system.

We are actually linking against the C-Library, but we're not using anything from within it. Therefore, the linker disregards the whole of the C-Library because there are no references to it within the code. Let's jump straight in and malloc some memory as I know that requires a stub. Compile armc-09 without the *-cstubs.c and you'll see the error:

.../arm-none-eabi/lib/fpu\libg.a(lib_a-sbrkr.o): In function `_sbrk_r':
sbrkr.c:(.text._sbrk_r+0x18): undefined reference to `_sbrk'
collect2.exe: error: ld returned 1 exit status

So the malloc call is calling something in the c library called _sbrk_r which is a re-entrant safe (or multi-thread/interrupt safe) C library function all called _sbrk. This is the C-stub function that we must implement as it is operating-system dependant.

It's worth looking at how other people have implemented these function calls. Look at them to see what they're doing. Generally you can look at other kernel code, or embedded system code:

http://www.opensource.apple.com/source/Libc/Libc-763.12/emulated/brk.c

http://linux.die.net/man/2/sbrk

However, the best place is if examples are in the C-Library you're using! Newlib is well documented and includes example minimal systemc calls/stubs.

We implement this in the next tutorial armc-09

armc-09-cstubs.c

#include <sys/stat.h>

/* A helper function written in assembler to aid us in allocating memory */
extern caddr_t _get_stack_pointer(void);


/* Increase program data space. As malloc and related functions depend on this,
   it is useful to have a working implementation. The following suffices for a
   standalone system; it exploits the symbol _end automatically defined by the
   GNU linker. */
caddr_t _sbrk( int incr )
{
    extern char _end;
    static char* heap_end = 0;
    char* prev_heap_end;

    if( heap_end == 0 )
        heap_end = &_end;

     prev_heap_end = heap_end;

     heap_end += incr;
     return (caddr_t)prev_heap_end;
}

armc-09-start.S

.section ".text.startup"

.global _start
.global _get_stack_pointer

_start:
    // Set it at 64MB which we know our application will not crash into
    // and we also know will be available to the ARM CPU. No matter what
    // settings we use to split the memory between the GPU and ARM CPU
    // ldr     sp, =0x8000
    ldr     sp, =(64 * 1024 * 1024)

    // Run the c startup function - should not return and will call kernel_main
    b       _cstartup

_inf_loop:
    b       _inf_loop

armc-09-cstartup.c

extern int __bss_start__;
extern int __bss_end__;

extern void kernel_main( unsigned int r0, unsigned int r1, unsigned int atags );

void _cstartup( unsigned int r0, unsigned int r1, unsigned int r2 )
{
    /*__bss_start__ and __bss_end__ are defined in the linker script */
    int* bss = &__bss_start__;
    int* bss_end = &__bss_end__;

    /*
        Clear the BSS section

        See http://en.wikipedia.org/wiki/.bss for further information on the
            BSS section

        See https://sourceware.org/newlib/libc.html#Stubs for further
            information on the c-library stubs
    */
    while( bss < bss_end )
        *bss++ = 0;

    /* We should never return from main ... */
    kernel_main( r0, r1, r2 );

    /* ... but if we do, safely trap here */
    while(1)
    {
        /* EMPTY! */
    }
}

part-2/armc-09.c

#include <string.h>
#include <stdlib.h>

#include "rpi-gpio.h"

/** GPIO Register set */
volatile unsigned int* gpio = (unsigned int*)GPIO_BASE;

/** Main function - we'll never return from here */
void kernel_main( unsigned int r0, unsigned int r1, unsigned int atags )
{
    int loop;
    unsigned int* counters;

    /* Set the LED GPIO pin to an output to drive the LED */
    gpio[LED_GPFSEL] |= ( 1 << LED_GPFBIT );

    /* Allocate a block of memory for counters */
    counters = malloc( 1024 * sizeof( unsigned int ) );

    /* Failed to allocate memory! */
    if( counters == NULL )
        while(1) {     LED_ON();/* Trap here */ }

    for( loop=0; loop<1024; loop++ )
        counters[loop] = 0;

    /* Never exit as there is no OS to exit to! */
    while(1)
    {
        /* Light the LED */
        LED_ON();

        for(counters[0] = 0; counters[0] < 500000; counters[0]++)
            ;

        /* Set the GPIO16 output low ( Turn OK LED on )*/
        LED_OFF();

        for(counters[1] = 0; counters[1] < 500000; counters[1]++)
            ;
    }
}

Build the example and program, now the counter variable is from malloc'd memory - usually malloc is the most mysterious of the c library stubs, but as you can see from the example, you can implement it easily and also implement it in C easily too! Of course occasionally it's necessary to drop down to assembler to get certain register values or talk to some special hardware features. Keep the cheat-sheet close at hand while you're working with the C-Stubs.

While we're here, we also remove the custom linker script at this point. We don't need to be using anymore. There is an option to control the MAXPAGESIZE variable in the linker script to keep the binary small. See the GNU LD documentation for details of the max-page-size option.

There's a great description of this option here and thanks to eyalabraham for bringing it to my attention.

For armc-09 we change the linker flags to include the max-page-size option and set it to just 4 bytes. The build.sh file now has the following linker setting:

lflags="${lflags} -Wl,-z,max-page-size=0x04"

Part 3

In the next part of the bare metal tutorial, we'll add on a build system so that we're not using a silly single command-line, and we can harness the power of a full build system for our bare-metal environment. We'll then fall back to the Cambridge ASM tutorials and progress to using the timer for flashing the LED rather than the incrementing loop counter we've used so far.

So, what are you waiting for? Head off to Part3 now!