COROS Ia: My Heart Must Go On

Continuing with the description of COROS that I'd begun earlier, today's article starts applying some structure to what is, at the core, really just a fancy system of continuable function calls.

Where we're at

COROS thus far is just a fairly simple coroutine system. This is, by itself, already pretty nifty and gives us lots of capabilities.

Generators

Out of the box we can get generators, for example. This is nothing particularly special. You can get them in C with normal functions. Here's a cheap and stupid Fibonacci Sequence generator in bog-standard C:

#include <stdio.h>

unsigned next_fib(void)
{
    static unsigned n = 0U, f0 = 0U, f1 = 1U;

    if (n++ != 0U)
    {
        unsigned fn = f0 + f1;
        f0 = f1; f1 = fn;
    }
    return f0;
}

int main(int argc, char** argv)
{
    for (unsigned i = 0; i < 10; i++) { printf("%u ", next_fib()); }
    printf("\n");
    return 0;
}

This simple function is a form of generator. Each call to it elicits the next entry in the Fibonacci Sequence. And for many use cases this is actually fine. This is what you want. It's fast and it works.

Until you want, for whatever reason, two different Fibonacci generators. Then things get hairy. The existence of static variables makes this impossible to restart from the beginning and solving this starts getting very wasteful of CPU time, RAM, or both. Enter a coroutine:

#include <stdbool.h>
#include <stdio.h>

#include "co.h"

void next_fib(void *unused)
{
    unsigned n = 0U, f0 = 0U, f1 = 1U;
    while (true)
    {
        if (n++ != 0U)
        {
            unsigned fn = f0 + f1;
            f0 = f1; f1 = fn;
        }
        co_yield(&f0);
    }
}

int main(int argc, char** argv)
{
    co_t next_fib_co1 = co_create(next_fib, NULL, 32, NULL);
    co_t next_fib_co2 = co_create(next_fib, NULL, 32, NULL);

    for (unsigned i = 0; i < 10; i++)
    {
        printf("%u ", *((unsigned *)co_resume(next_fib_co1, NULL)));
    }
    printf("\n");
    for (unsigned i = 0; i < 10; i++)
    {
        printf("%u ", *((unsigned *)co_resume(next_fib_co2, NULL)));
    }
    printf("\n");
    return 0;
}

Since the coroutines each have their own stack, and since that stack frame doesn't go away between resumes/yields, we lose the static variables. Without static variables we can have as many Fibonacci generators as we like in our code, each starting from the beginning of the set and moving independently forward in it. It's more expensive than just the function-based one, so don't use this if you only need the one generator! If you need more than one, however, this one can't be beat, especially if you wrap the co_resume() call into a friendlier function invocation.

Sequential, interrupted processing

Consider a command processor in an embedded system. Commands are entered via a serial port. Upon reaching a CR LF pair, the command is completed and processed. (Think of the near-ubiquitous “AT” command systems that abound in communications, for example.)

This is a very typical producer/consumer problem and it is specifically the situation that coroutines were created to deal with. Building a command processor using regular functions is well-known, but it tends to involve either static variables or manually held and passed-around state block that keeps track of the current built-up string. The first has flexibility problems already pointed out above (you can have only one command processor) and the latter tends to make code obfuscated and difficult to reason about without a debugger to inspect the state block.

The coroutine, by virtue of being its own state container (you can view a coroutine as a function that carries its stack frame and state around with it), renders the final logic much simpler, rather like this:

void process_command(void *size)
{
    /* one-time initialization */
    void *rv = NULL
    size_t offset = 0;
    size_t buffer_size = *((size_t *)size);
    command_buffer = malloc(buffer_size);
    memset(command_buffer, 0, buffer_size);

    /* main loop */
    while (true)
    {
        char next_char = *((char *)co_yield(rv));
        if (next_char == '\r')
        {
            next_char = *((char *)co_yield(NULL));
            if (next_char == '\n')
            {
                /* we have CR/LF */
                rv = dispatch_command(command_buffer);
                offset = 0;
            }
            else
            {
                command_buffer[offset++] = '\r'
                command_buffer[offset++] = next_char;
                rv = NULL;
            }
        }
        else
        {
            command_buffer[offset++] = next_char;
            rv = NULL;
        }
    }
}

This coroutine function accumulates characters passed to it until it recognizes a full command line (CR/LF termination). It then processes that command and returns the status code from the command's execution. All other times it returns a NULL, signalling “nothing yet”. Typical use case would be like this:

/* stuff goes here */

    size_t cmd_buf_siz = COMMAND_BUFFER_SIZE;
    size_t stk_siz = 32;
    cmd_proc_co = co_create(process_command, &cmd_buf_siz, stk_siz, NULL);
    co_start(cmd_proc_co, NULL);
    
/* stuff goes here */

    while (true)
    {
        char ch = get_char_from_serial();
        void *rv = co_resume(cmd_proc_co, &ch);
        if (end_loop(rv))
        {
            break;
        }
    }

/* stuff goes here */

(Of course the various calls to co_resume() and co_yield() would normally be wrapped in static helper functions to give them readable, descriptive names, but this has been left out to show the role coroutines are playing in the example.)

So what's happening here is that two independent subsystems have been decoupled. The serial port system is likely a simple one that reads the port (or is interrupt-driven and reads the next character in the port's buffer) while the command processor is passed one character at a time until it gets a full command buffer and processes that. The command itself returns something that determines if the command processing loop ends or not.

The coroutine's logic is straightforward and works in simple, linear logic. A more typical system running off of some kind of state machine with callbacks would be harder to work out and harder to spot bugs in. This is doubly true if there's nesting of state machines. (Ask any Node.js developer about Callback Hell.) Each piece is trivial to follow and yet hooks up beautifully.

And this segues into the next nice piece:

Trampolining and IPC

The way the code is used above, it assumes that cmd_proc_co yields only a flag that indicates if the command processor should exit or not. But this isn't the only use of it and, indeed, there's room for argument that we want far more from our system.

In an ideal world coroutines could yield to any other coroutine, but there are complex technical reasons for why this isn't common. (Some coroutine systems actually do this: the yield calls specify which specific coroutine to yield to.) One of the main problems with these kinds of systems, however, is that they get very hairy, very quickly, even if the underlying implementation wasn't itself already a problem. This is why most coroutine systems are yield-to-parent varieties. This doesn't mean, however, that we can't simulate the broader world.

To do this we'd only have to change the dispatch_command() function we handwaved above, and the parent code that invokes the cmd_proc_co. Indeed the processing loop would turn out to look something more like this:

    while (true)
    {
        char ch = get_char_from_serial();
        co_t rv = co_resume(cmd_proc_co, &ch);
        if (rv && co_resume(rv, NULL))
        {
            break;
        }
    }

The (not shown) dispatch_command() now returns a coroutine's co_t handle. If anything is returned at all, it is resumed and that return value determines if we end the processor or not. Literally a two-line change moves us from a value return to an active process return.

And here we have concurrency (albeit cooperative concurrency, not pre-emptive). The individual returned co_t implementations may be more function-like (calculate a result and yield) or they may contain within them complex state machines that yield in several locations using straightforward linear code. We don't know at the level of the mainline code, and more importantly we don't care.

This technique is called the “trampoline” and is a powerful way of doing concurrency in a more understandable, easier to reason about way.

Where we're going

Three little examples show how coroutines can fit in an application in three different use cases. And once you have a trampoline, you have, in effect, a task scheduler.

An operating system, in short, or at least the kernel of it.

Stopping just here at this point already brings with it benefits. You could build good small- to medium-sized embedded systems with these uses of coroutines (and a few other obvious extensions of them). Trampolines in particular can serve as the foundation for a very powerful tool for concurrent (but not parallel!) computing in a small space that's easy to reason about.

But what fun is stopping here?

After all, if there are standard techniques for doing things (like trampolining for concurrent dispatching) does it not stand to reason that this could be wrapped into its own API that hides the gory implementation details behind a facade of cleanliness?

That's what the next blog post will be about.