COROS I: The Beating Heart

February 10, 2022

In an earlier post I'd mentioned that I'd written an RTOS in my spare time for fun and then incorporated a (slightly reworked) version of it into code at my job, much to my boss' amazement.

Since then I have been working quietly in the background on gussying up the OS for public release and am now at the stage where I can provide preliminary information on what it is, how it works, and why I'm making it the way I am.

Parts is parts

There are three major parts in this RTOS:

Its heart: the coroutine system that handles concurrency;
Its skeletal and circulatory systems: the event queue (well, event tree) mechanism that provides its signalling backbone; and,
Its organs: the component architecture that it is intended to support (complete with helpful macros to make most components trivially-written code).

Coroutines

Today's topic is an architectural overview of the heart of the system: coroutines. If you don't know what a coroutine is or why you'd want to use one, think of them as functions that can return from anywhere in their body and then resume from exactly that point. The Wikipedia article may help.

I've always been fond of coroutines. They allow linear logic flow in a concurrent system, eschewing the “backwards, inside-out” logic that reactor systems (callback-based event systems like Node.js or most GUI systems) force while eschewing the immense effort in defensive programming that pre-emptive tasking forces. They are easier, as a result, to reason about than either of the competing approaches, offer superior performance to pre-emptive tasking (though a bit more costly than reactors), and make resulting software more reliable and more easily modified.

They do have one significant problem compared to pre-emptive tasking, however: they do not pre-empt. You must release control flow to other coroutines explicitly in a timely fashion. This will necessitate some changes to habits born of pre-emptive systems.

A good way to learn coroutines and how to use them in a friendly environment is to install the Lua Programming Language and try them out there. The coroutines of COROS are strongly modelled off of Lua's.

COROS coroutines

The public-facing API of COROS' coroutines is this:

typedef struct _co_t *co_t;

typedef void (*co_function)(void *);

co_t co_create(co_function, void *, size_t, uint32_t *);
void co_destroy(co_t);
void *co_resume(co_t, void *);
void *co_yield(void *);
extern void *(*co_start)(co_t, void *);

A co_t is an opaque pointer to a structure describing a coroutine's runtime state (or its “identity”). It is obtained by calling co_create() and is used in all other API calls to indicate which coroutine is being operated upon.

co_create() takes a function (that will be the heart of the coroutine), an arbitrary pointer to some data that will be passed to the coroutine, a stack size (number of integer-equivalents in the stack), and a pointer to the stack area. Once this function is called, assuming the returned co_t isn't NULL, the coroutine is ready for business.

co_resume() (equivalently co_start()—the latter is a synonym for the former) will either start a not-yet-started coroutine or it will continue a coroutine that has yielded control. In the first case the void * that is passed in will show up as the void * parameter of the co_function being called. In the latter case it will be the return value of co_yield().

co_yield() yields control flow out of the coroutine (effectively a return statement). Its single void * argument becomes the return value of the co_resume() call that started or continued the coroutine. As stated above, whatever was passed into co_resume() will show up as the return value of co_yield(). This provides coroutines with a simple way to exchange values and/or commands.

co_destroy() does what the name says on the tin. It is not expected this will be often used, but it is provided for completeness' sake.

Coroutine implementation

COROS is intended to be relatively architecture-neutral. Its only stipulation is that the native pointer size is 32-bits (though I'm rethinking this and trying to come up with simple ways to change even this). As a result some of the heavy lifting of coroutine implementation is put into specific architecture/compiler implementation files. These files implement this API:

struct _co_t
{
    bool     static_stack;
    uint32_t *stack_store;
    uint32_t *stack_pointer;
    uint32_t *stack_canary;
    void     *data;
};

extern co_t co_active;

void co_halt_impl(void);
void co_init_impl(uint32_t **, uint32_t **, co_function, void *, uint32_t *, size_t);
__attribute__((naked)) void co_prologue_impl(void);
__attribute__((naked)) void *co_yield_impl(uint32_t **, void *);

__attribute__((naked)) signals here that we're using inline assembly and that there is no need for C's prologue and epilogue code. (The compilers that I have available all support this attribute, but if this turns out not to be enough, future versions of COROS will employ another mechanism here.)

struct _co_t provides the implementation of the co_t type. A flag indicates if the stack was passed in when co_create() was called and is used for cleaning up in co_destroy(). stack_store points to the base of the memory block being used as this coroutine's stack. stack_pointer is the current top of stack. stack_canary is a pointer to a special, known value whose change indicates that a stack overflowed.

co_halt_impl() implements the logic for a halted coroutine. It is in effect just an endless loop of coroutine yields that ensures that an ended but not destroyed coroutine doesn't crash the system or endlessly use up the CPU. It is architecture-neutral.

The remaining functions are not, and thus will be explained specifically in the context of an ARM Thumb target using the GCC compiler. (Implementation files are built that specifically.) Note that some code (mainly error handling) has been stripped to make this as easy to understand as possible. When COROS gets a public release (soon!) all the missing code will be in place.

`co_init_impl()`

#define SET_STACK_VAL(O, V) (stack_buffer[stack_buffer_count - (O)] = V)

void co_init_impl(uint32_t **stack_pointer, 
                  uint32_t **stack_canary, 
                  co_function co_cb, 
                  void *co_data, 
                  uint32_t *stack_buffer, 
                  size_t stack_buffer_count)
{
    SET_STACK_VAL(1, (uintptr_t)co_prologue_impl); // lr
    SET_STACK_VAL(2, 0);                           // r7
    SET_STACK_VAL(3, (uintptr_t)co_halt_impl);     // r6
    SET_STACK_VAL(4, (uintptr_t)co_data);          // r5
    SET_STACK_VAL(5, (uintptr_t)co_cb);            // r4
    SET_STACK_VAL(6, 0);                           // r11
    SET_STACK_VAL(7, 0);                           // r10
    SET_STACK_VAL(8, 0);                           // r9
    SET_STACK_VAL(9, 0);                           // r8

    *stack_pointer = &stack_buffer[stack_buffer_count - 9];
    *stack_canary = stack_buffer;
}

Initializing a coroutine once the architecture-independent code is executed, involves setting up the stack such that upon invocation appropriate values are in the right place for parameters. Most notably we set things up so that our own version of a prologue is called when the function returns and that certain values are available as local variables in the implementation of the prologue.

Which leads us to...

`co_prologue_impl()`

__attribute__((naked)) void co_prologue_impl(void)
{
    __asm volatile(
        "    mov lr, r6 \n"
        "    mov r0, r5 \n"
        "    bx r4      \n"
    );
}

Understanding this function requires a careful inspection of the previous. co_prologue_impl() was in the stack at the point the LR (link register) sat, so upon return from a coroutine, co_prologue_impl() is branched into. Here we mess with three registers. If you look at the comments in co_init_impl() you'll be able to follow along.

First, the LR is set to co_halt_impl(). Any return from this point onward returns into the aforementioned endless loop of yields. This ensures that should an exited coroutine ever be resumed by accident it will simply immediately yield back. Next the data item passed in to co_resume() is set as the first parameter of the coroutine function co_cb. Finally we branch to the co_cb function pointer passed into co_create() (a so-called tail call).

Why we perform all these bizarre gyrations thus far becomes clear when the final implementation function is called.

`co_yield_impl()`

This is the hardest part of the code to understand because it involves knowledge of the ARM Thumb ISA, how it differs from ARM Thumb2, what the calling conventions are for C compilers, and all the previous functions shown as well. Here it is in all its gory (sic).

__attribute__((naked)) void *co_yield_impl(uint32_t **, void *)
{
    __asm volatile (
        "    push {r4,r5,r6,r7,lr} \n"  // <1>
        "    mov r4, r8            \n"  // <2>
        "    mov r5, r9            \n"
        "    mov r6, r10           \n"
        "    mov r7, r11           \n"
        "    push {r4,r5,r6,r7}    \n"
        "    mov r2, sp            \n"  // <3>
        "    ldr r3, [r0]          \n"
        "    str r2, [r0]          \n"
        "    mov sp, r3            \n"
        "    mov r0, r1            \n"  // <4>
        "    pop {r4,r5,r6,r7}     \n"  // <5>
        "    mov r8, r4            \n"
        "    mov r9, r5            \n"
        "    mov r10, r6           \n"
        "    mov r11, r7           \n"
        "    pop {r4,r5,r6,r7,pc}  \n" // <6>
    );
}

The first thing we note with this bizarre function is that it has two parameters ... which we don't use.

Of course we do use them. They're buried in the stack frame and get shuffled around as part of the register layout. For details on where I'd recommend reading the ARM calling conventions for your compiler (which itself is likely a copy of ARM's own default conventions). For your sanity, however, I'd just recommend that you trust that this code is correct and functional.

No? You want details? OK. You'll be sorry, though!

<1> pushes all the user-saved registers into the stack in a known order (that matches the order they were manually placed in co_init_impl() so consult that function for details).

Starting at <2>, the remaining user registers get stored. The roundabout approach used is a nod to one of the differences between the Thumb and Thumb2 ISAs. Thumb2 would have permitted pushing all the registers with a single push instruction. Thumb only permits pushing registers r0 to r7 inclusive that way. “Extended” registers cannot be addressed that way, so we pull registers r8 to r11 into the registers we just stored and store them again.

At <3> we switch the stacks. This is comically verbose again because of Thumb. Thumb instructions don't permit ldr/str operations on sp (where Thumb2 does). So instead we copy the sp into r2, load r3 with the value of the first argument (stored in r0) which happens to be the new stack address, store r2 (the old sp) into that variable (which happens to be the stack_pointer member of a co_t kept for future co_yield() calls), then finally puts the passed-in stack pointer address into sp.

At this point we have switched stacks and are working in another coroutine's stack.

First we put the passed in data value (void *) into r0 as the return value of the function at <4>. We now reverse the operations we did above. <5> restores r8 to r11. <6> restores the remaining user-saved registers and stores what was the lr (link register, a.k.a. return address) value into the pc, causing us to branch to that address immediately.

Well that was ... different

We have now “returned” into an entirely different function's stack frame. The code is hairy when you look at it, but it is sound. Both parameters are used and the system is set up that calls to co_yield() switch back to the stack of the co_resume() that called them and, further, that co_resume() either uses the initial stack provided manually in co_init_impl() or the stack condition of the coroutine as it last yielded.

People who have implemented (or at least looked into) a pre-emptive tasking system will recognize most of what was done there. In this regard coroutines are not all that different from pre-emptive concurrency systems. Where they differ is the lack of need for synchronization primitives to ensure shared memory structures aren't pre-empted and botched mid-stream. Because the transitions are synchronous, there is no need of the usual system of semaphores, critical sections, mutexes, monitors, condition variables, etc. of pre-emptive tasking systems. In the rare cases these are needed, their implementation is far simpler and can be easily done ad hoc without the gory edge cases of their pre-emptive cousins.