qqmrichter

An embedded systems programmer living the Shenzhen I/O life in another city.

When last I wrote about COROS I explored the EVQ component of it with a focus on the API and some of its underlying construction. In this post I will expand on that underlying construction giving reasons for some of the design decisions, as well as providing some example use cases for this.

The event queue itself.

Stripped down of runtime irrelevancies, this is what an event queue looks like:

typedef struct _evq_queue_t
{
    evq_queue_t down, up; // <1>

    evq_event_t head;     // <2>

    // ... elided content ...
} evq_queue_t_impl;

At <1> we establish where in the tree of queues this particular queue sits: down points to the child of this queue, up points to the parent. The parent value is only ever used when managing queue relationships; it is not used in navigating queue processing. The child value is used both in managing queue relationships as well as when processing queued events.

At <2> we point to the first event in this queue. Events are chained as a list, though each event also has a pointer to its parent.

Queue insertion and deletion

When evq_attach_queue(parent, child) is called, the meat of the operation is as follows:

child->down = parent->down;
parent->down = child;
child->up = parent;

if (child->down)
{
        child->down->up = child;
}

As can be seen, the list of children is a push-down stack, notionally, in that a new child is placed in front of previous children. This has implications for fancier uses of EVQ in priority management, but is simple enough to follow.

Calling evq_detach_queue(q) effectively reverses the above operations, using knowledge of parents and children to snap itself out of wherever the queue finds itself. Note that queues cannot be deleted, only removed from the tree. Since they occupy so little actual memory, the added complexity of code for dealing with deleted queues (including all of its attached events) is likely to cost more in code space than is saved in reclaiming such a paltry amount of memory.

Implications of this design

There are some key implications to this design. First, as will be seen when we discuss event processing below, the tree-of-lists structure of EVQ lends itself to establishing, at need, both a priority system and a means to circumvent the priority scheme. In addition, the ability to detach and attach queues at need allows entire sets of events to be paused in-flight. When the third component of COROS is discussed, the importance of this will be clearer, but in effect it permits a system to turn pieces of itself off and on at need. This is especially valuable if, for example, the system does power management and needs to be able to disable and enable component processing quickly.

Events

Stripped to focus on important constructs, an event looks like this:

typedef struct _evq_event_t
{
    tick_ms  trigger_time; // <1>
    delay_ms period;

    bool is_coroutine; // <2>
    void *data;
    union
    {
        evq_callback cb;
        co_t         co;
    };

    evq_event_t chain;     // <3>
    evq_queue_t container;
} evq_event_t_impl;

As shown at <1>, events have two time-related pieces of information: the time at which the event is triggered, and, for repeated events, the period of repetition. Note that although the delay and period are relative to the current time in the API, the trigger_time is absolute: it is calculated when the event is created. This is important for reasons which will be explained later.

<2> deals with how events are fired. There are two kinds of events: they can be simple callback functions or they can be coroutines. Each kind is activated differently, so a flag is provided to identify which kind and a union holds either a callback function or a coroutine handle. In either case the event data is stored as a void *.

<3> marks where the event management occurs. Each event contains a link to the next event in its chain (singly-linked list) as well as a pointer to its owning event queue. The latter is used for the deletion or rescheduling of events. The former plays a key role in how events are inserted and how they are selected for firing.

Event queue processing

Now the pieces are in place to understand both how event processing works and the implications of some of the design decisions.

Whenever evq_queue_now/in/every() or evq_resume_now/in/every() are called, behind the scenes it boils down to the manual void evq_post_event(evq_queue_t q, evq_event_t e, void *data); function. This function will insert event e into queue q associating the provided data with the event (data which will be passed into the callback or coroutine when called or resumed as appropriate). The container of e will be set to q for ease of navigation and queue management.

Event insertion

To insert the event, the logic is simple: 1. the chain is searched for the first event whose trigger_time is greater than e's trigger_time and is placed ahead of it in the chain; but 2. if no such event is found by the end of the chain, e is placed at the end of the chain.

The trigger_time is a calculated value based on a current tick counter plus the delay value of the event when created. This is done to prevent an event from starving the queue by being inserted at the head perpetually. By calculating the trigger_time based on an offset from current, and by placing the inserted event after all events with a lower or equal trigger_time, the system ensures that all events will eventually be fired as the tick counter inexorably counts upward. (Note that events can't starve their own queue. They can starve queues lower in the chain than them. Care needs to be taken still.)

Another, more subtle facet, of the design, however, requires understanding of how the assorted evq_process*() functions process their queue.

Event extraction

evq_process() and its siblings all take an evq_queue_t as an argument. While their processing is active (permanently, for a given duration, or for a single pass, depending on which function of the family is called), they follow this logic, given q as the passed-in evq_queue_t:

  1. The chain of events in q->head is visited one by one.
  2. Each event in that chain has its callback or coroutine called or resumed iff the stored trigger_time is less than or equal to the current timer tick value.
  3. As soon as an event with a trigger_time that is greater than the current timer tick is encountered the chain is abandoned.
  4. At this point q->down is processed recursively.

This is where the utility of having chained, tree-structured event queues shows itself. Usually the evq_process*() family will be invoked on a “master queue” that is system-wide, and thus all events of all components in the system will be processed in a strictly-defined order. Since, however, there's no requirement for the “master queue” to be the one processed at any given point, it is easily possible to limit event processing to a subset of the overall system at need. Indeed, since not every message queue has to even be attached to the overall event tree, it's possible to have critical services that are on their own, detached queue all serviced before systemwide services are processed.

Implications

EVQ is not only a good tool for decoupling subsystems and governing their interactions, it is a powerful tool for establishing priorities and processing critical code independently of the rest of the system. And there is one final feature that is hinted at in the design, but not stated openly.

Note that an event is associated with a void *data. The nature of void * means that any kind of data can be passed into an event handler.

Including an event queue. Or another event.

Using EVQ gives the ability to do Pi-calculus-like operations since the communications channel can accept a communications channel as an argument. Exactly how this would work will be explored in the next two articles in the COROS series, but the gist of it is that subsystems can flexibly address other subsystems without knowing they exist.

For a practical example, consider an embedded system that logs events. Where it logs will depend on which hardware modules are present and/or active. If, for example, the Ethernet hardware is present and it has an active connection, the logger can send things there. If it's not present, it may be able to instead log to a serial port. If that isn't available, maybe in a pinch it can record its logs to built-in or peripheral Flash.

The key is that the component that's sending out the log messages should not be required to know of all these. It should just know that when a loggable event happens in any subsystem, that subsystem knows to talk to it (via a message queue), and that it just has to add its decoration (severity, time stamp, etc.) to that information and push it out...

...to the message queue that it was passed in when initialized.

This keeps coupling in the overall system to a minimum. The other subsystems only have to know that a logger is available that accepts messages in a very limited number of formats. The logger only has to know that it is connected to something that accepts a particular set of messages. None of the subsystems need to know that there's possibly a dozen different places to log messages, and the logger doesn't need to know it either. Only the portion of the code that configures the relationships need care about how each piece is connected.

One could argue that this could be handled using a global variable containing a collection of function pointers, and indeed this would be the usual way to do this. (It could very well be the way that gets used in your system even if using COROS!) The strength, however, of doing it this way is that all configuration is kept in one place, and all involved components are decoupled. Log clients only have to know the logging message format (or API, as will be seen when the COMP component is discussed). Loggers will only have to know the logging message format. Configurators only need to know of the existence of logging backends and a single logger message format. And all the subsystems communicate in exactly the same way using exactly the same communications medium.

And, most importantly from my perspective as a builder and maintainer of dozens of embedded systems, the code for each component can be easily ripped out of an existing project and copied wholesale to another project, especially if the COMP component discipline is followed. (This is foreshadowing, a well-known sign of quality writing!)

Caveats

Events in EVQ are usually dynamically created and destroyed. This can be murder in embedded systems as it tends to fragment the heap leading to ridiculous situations where there's plenty of free SRAM, but not enough contiguous to allocate an event object. To mitigate this, an object pool allocator is used that has proved to be useful in real-world applications. In this case it employs the default doubling strategy and individual object allocation with a starting size of five events. All of these can be adjusted in a single line of code in evq.c to fine tune memory costs and fragmentation issues.

EVQ supports “classic” event style, in the form of callback events, but these should contain only very simple logic with little to no persistent state between calls. A perfect example of a use case for callback events, for example, would be something that just turns a LED on or off (or toggles it). If there is any significant logic, the backwards-inside-out nature of callback-based systems (“callback Hell”) suggests that using coroutine event handlers is the better approach.

In either case, disciplined use of event queues to decouple distinct subsystems that work together but should not necessarily know anything about each other beyond “existence” is the way to approach overall system design. Use of COROS' CO and EVQ components can lead to very robust overall systems. Which segues nicely into the final component of COROS.

Parts is parts

COROS' final component, subject of the next article, is the COMP component. (COMP being short for “component”. So the component component.) COROS is, at its heart, an opinionated system, albeit the opinions being weakly stated so that people with different opinions can use different parts of it in isolation. The final opinion of COROS (the first two being coroutines are superior to threading and loose coupling of subsystems is a must) is that object-orientation is king.

But not object-orientation the way that it is traditionally viewed. You will not see any classes and instances and inheritance. You will not see prototypes and cloning, even. COROS is a nod to the originator of OOP whose words people mouth without ever actually using. COROS components are Kay-style OOP: independent software entities that communicate via messages.

For a brief taste of what a COROS COMP is like, a stereotypical COMP will have:

  1. A configuration function.
  2. Its own private coroutine for processing messages.
  3. Its own private message queue for accepting messages.
  4. An API that it presents to the outside world to conceal precisely how the messages are implemented and handled.

The details of how this plays out will be the topic of the next article.

Software has a problem.

OK, it has many problems. I've already highlighted one of them. But this is another important one.

The problem is that software—all software, with no exceptions—sucks. The reason for this is multifaceted and we could spend years and years arguing about who has the larger list of reasons, but in the end it boils down to the proverbial shoemaker's children: Our development tools are the worst of the worst in software.

The problem in a single question …

How are we meant to make good software when the software we rely on to make our software is such utter shit? Why do we tolerate such shit tooling?

Picture a bicycle repair shop. Go ahead. Get that picture in your mind. Now picture the person in the shop working with, say, wrenches that are made of such soft and malleable steel that they routinely get stripped or bent out of shape. What do you think the response would be? (Hint: it rhymes with “they'd get fed up and toss them out before buying new ones that work”. Because identical rhymes are rhymes.)

Picture a carpenter. Picture said carpenter working with, say, chisels that can't hold an edge for longer than four or five taps. How long until that carpenter throws them out (adding a lot of salty language to the bargain) and buys a set that doesn't suck?

Picture any tool-using profession. Picture how any such tool-using profession deals with tools that are simply bad.

Now look around you in the programming world and ask yourself “what the flying fuck happened to us!?”

The state of software tooling

This is a question that should be asked because software tooling is utterly fucking terrible. For example, our problem domains have ballooned in size and complexity exponentially, yet we still chiefly use products that are literally a single dimension line of text. I mean sure, we try to make it look at least two dimensional like this:

switch (foo)
{
case bar:
    do_something_with(bar);
    break;
case baz:
    do_something_with(baz);
    break;
default:
    do_something_else();
    break;
};

But the tragic secret here is that this is all just formatting convention. It's literally identical to this when the compiler sees things:

switch (foo) { case bar: do_something_with(bar); break; case baz: do_something_with(baz); break; default: do_something_else(); break; };

There's something fundamentally broken in how we try to map these pathetic single-dimensional strings into a world that has concurrency, parallelism, distributed computation, and a whole host of other horrifically complicated, multi-dimensional logic.

N.B. Before you proudly trot out your language that isn't just a one-dimensional string, answer two questions:

  1. ARE YOU SURE? For example Python is not, despite its attempt to force that “structured” style above.
  2. How common is your favourite tool if it really is? (Hint: not very.)

But this goes beyond tooling choice. It's frustrating. It's difficult. It's error-prone. But it is possible to make software that isn't fundamentally broken, even with the tooling we've got. There's more to it than this.

Programmer attitude

This is the real culprit. It's the shoemaker's children problem only it's not the children who don't have shoes. It's the shoemakers themselves. The shoemakers are so intent on making (slipshod) shoes for other people (because that's how they get their money) that they don't realize they're barefoot themselves.

While crouching on a floor consisting of sharp flint shards.

They don't understand that if they only had, say, a good workbench, some nice hammers and nails, some high quality leather, and even (gasp!) a chair, they could make nicer shoes that don't make people want to throw them at the shoemakers' heads.

A case study

I use microcontrollers from STMicroelectronics (the STM32 line) in my day-to-day work. One thing that ST likes to push is their STM32CubeMX code generator. It generates startup code, peripheral configuration code, and even, lately, code for specific external peripherals and middlewares. They recently added Azure RTOS support to CubeMX for select processors, and in December-ish time frames they expanded it to cover almost their entire line.

Including their STM32WL line of LoRa-enabled MCUs.

Let's take a look at the Azure pack for STM32WL in STM32CubeMX, shall we?

First, here's my version of STM32CubeMX: STM32CubeMX v6.4.0

Version is 6.4.0. Now let's check ST's site for what the latest version is: Latest version is v6.4.0

Yep. I have the latest version. Now let's go install the Azure RTOS pack for STM32WL: Azure RTOS pack needs v6.5.0

WTF!? It really wants a version that doesn't exist! This must have just been put out and v6.5.0 is due to be released soon, right? When was this pack put up? December 31st, 2021

… What the actual…? This has been sitting out there, unusable with any released version of STM32CubeMX for almost two months! WHY!?

Why this happened

Well the reason why is clear: in modern business thought, work on the STM32Cube infrastructure is not considered a benefit. It costs money to make the firmware (STM32Cube HAL and LL drivers), the code generator (STM32CubeMX), and the various packs and they don't sell these. There's no direct income from them, so they're loss items on the balance sheet. You don't direct engineering, marketing, and quality control resources to things that don't directly earn you money, so the quality languishes and development is slow and shoddy. (When I look at the STM32Cube HAL drivers, for example, I'm absolutely positive they assigned interns to write them given just how naively and stupidly written they are. And don't get me started on the STM32Cube documentation!)

The deeper reason

Of course there's a reason why they don't sell the tools: nobody would buy them. Even if they didn't suck bowling balls through garden hoses (and ST's tooling is nearly best of breed in this industry: Altera's tooling sucks matter out of galactic core black holes!), nobody buys development tools. If it's not free, it's not used, essentially. Why is this?

Another CASE study (hah!)

Way back when, in the days I'd succumbed to the OOP Kool-Aid, I spent my own money on a copy of Rational Rose, a UML modeling/code generation tool. I'd fallen for the marketing. It could read in my C++ (ugh!) code and generate UML models that would help me understand it better, refactor it, and improve it all! It could generate code from models I made to practically write my application for me! It would improve my productivity in ways that I'd never believed possible!

And it was all a lie.

Rational Rose was a buggy piece of shit that crashed multiple times per session of use (usually taking its fragile model database with it each time). It could only import a subset of C++, choking (and, naturally, crashing) on constructs it couldn't work out … you know, such alien C++ things as nested templates … and it generated code that wouldn't compile with any compiler known to humanity at the time. (Seriously: I tried to get it to compile with MSVC, gcc, and several commercial compilers and never had any one of them compile the generated code out of the box.) It not only didn't improve my productivity, it actively damaged it.

And for this I paid five digits. In 1990s money!

And I played this out for several tools in a row, spending easily in the range of six figures of personal money (in 1990s money!), with only one such tool—Clarion—having actually profited me any, before finally deciding that, well, software tooling sucks no matter how much you pay for it, so I might as well just use free (as in beer) software.

The shoemaker paradox death spiral

This is where the death spiral begins. I'm not the only developer in the world who's spotted the pattern “software tools don't work as advertised, so spending money on them is pointless”. This means software tooling (like the aforementioned CubeMX) is not viewed as a profit line item. This means that resources aren't properly allocated to make them work properly. This means that the quality slips, reinforcing the position that spending money on software tooling is a fool's game.

In the end this results in: – compilers that are so buggy that I have a hobby of collecting bizarre compiler bugs; – libraries that are so buggy I often just read the source (if available) for inspiration as I hand-roll my own (differently buggy) clones; – non-coding tools (like SCMs) that are so hopelessly convoluted (cough Git cough) that their use is a full-time job of frustration; and, finally, – a software ecosystem of crappy software made because programmers won't self-nurture.

What's the solution?

There isn't one. Not any longer. Not in the current climate. Because the real solution is professionalism, and this is anathema to the Big Tech™ TechBroDudes churning out one shitty web page after another to monetize all teh thingz!.

No, what's going to have to happen first is a major disaster—a huge catastrophe that involves loss of life on a grand scale, like five figures or more in a single incident—so that software and its practitioners are called out on the societal carpet for being the incompetents we are. Legislation will have to be passed that mandates standards of behaviour and quality in software (like every other branch of engineering already has!). The software industry will have to collapse into a fiery heap, and then get rebuilt from the ground up based on sound principles with a focus on quality engineering.

And then we have to get people away from thinking that quality is free. We'll have to pay for tools and customers will have to pay for software again.

Yes, even if it's F/OSS.

In the last installment talking about COROS, I built up on various uses for coroutines, ending with a primitive scheduling tool. I then hinted at wrapping this up into an API that conceals the gorier elements of micromanaging coroutines.

This is that API.

EVQ

Where the CO (coroutine) component of COROS is the heart of the RTOS, EVQ is its circulatory system and skeleton. It is, at its simplest, an event queue mechanism, but one with some facilities that will permit (as is to be seen in later articles) some surprisingly sophisticated constructs.

The main job of EVQ, as with any event queue, is to decouple components in a system, having them communicate solely in terms of events. Events can be “wake up” calls (for example to flash a LED) or they can be sophisticated, state-carrying and -modifying messages. Similarly event handlers can be simple callback functions (familiar to anybody who has ever programmed Win16 or “modern” web frameworks), or they can be full-blown, independent coroutines using the CO component.

The mile-high view

Events are placed into an event queue. An application can have many event queues. Event queues are typically (but not necessarily) placed into a tree-structured event environment. Posting an event to an event queue makes it processed by any event queue higher in the chain; conversely, servicing an event queue processes all triggered events in all child queues.

This tree-structured approach allows standalone, plug-in components for ease of porting: components in a system only need to know of their own event queue and to have a parent queue passed in upon initialization. A form of priority mechanism can be implemented through the judicious use of child queues as well.

Event queues in EVQ are first class entities and can be passed around a system, even in messages, plausibly enabling Pi Calculus-like operations.

The EVQ API

The EVQ API is of necessity more involved than the CO API. That being said, it is not a particularly large one, and it is divided into five related groups:

  • event queue management;
  • persistent event management;
  • ad-hoc event management;
  • event processing;
  • user-supplied functions.

Data types

Aside from the data types from CO (co_t and co_function), the following data types are used in EVQ:

typedef uint32_t delay_ms;
typedef uint32_t tick_ms; 
typedef struct _evq_queue_t *evq_queue_t;
typedef struct _evq_event_t *evq_event_t;
typedef void (*evq_callback)(void *);

delay_ms and tick_ms are simple counter values counting out milliseconds used primarily for making the intent of a variable clear. The first signals a duration for delays or repetitions. The second is a snapshot time.

evq_queue_t is the type of an event queue's opaque handle. It contains the details of an event queue including its parent, its children, the head of the events list, and other such information needed for running the queue.

evq_event_t contains the information about an event including its trigger time, its period for repeating events, the nature of its callback (coroutine or function), data that is associated with the event, its containing queue, its chaining information, and other data associated with moving the event smoothly through the system.

evq_callback is the type of the function used in simple function callbacks. (Coroutine callbacks are, naturally, of the co_t type.)

Event queue management

Event queue management is done with the following functions:

evq_queue_t evq_create_queue(void);
void evq_attach_queue(evq_queue_t parent, evq_queue_t child);
void evq_detach_queue(evq_queue_t child);

Event queues are created with evq_create_queue() and once created cannot be destroyed. Children are attached to a parent via evq_attach_queue(). A child can remove itself from its parent queue using evq_detach_queue() if needed, to either pause its own processing (say in preparation for a low-power phase of operation) or to move its relationship around.

Not all queues are necessarily attached at all times, or indeed ever. Values of evq_queue_t can be used as regular data, passed around to functions or through events, and can be directly processed at need, permitting a style of operation that is remarkably flexible even while it is carefully managed.

Persistent event management

Persistent events are those events which tend to be created statically, or at least long-term, and are then posted at need. Persistent events are most useful in the context of repeated events that cannot be periodically scheduled. An example of such a use case is an event to signal that a DMA buffer has been filled. Rather than adding the overhead of creating an event to the DMA's interrupt handler, a well-designed system would create an event and then post it at need manually.

Persistent events are managed with the following functions:

evq_event_t evq_create_fn_event(delay_ms delay, delay_ms period, evq_callback, void *);
evq_event_t evq_create_co_event(delay_ms delay, delay_ms period, co_t, void *);
void evq_post_event(evq_queue_t, evq_event_t, void *);
void evq_cancel_event(evq_event_t);
delay_ms evq_query_time_remaining(evq_event_t);
void evq_destroy_event(evq_event_t);

There are two functions to create events: one that creates an event intended to call a function on firing (evq_create_fn_event()), and one that creates an event intended to resume a coroutine on firing (evq_create_co_event()). The delivery mechanism aside, the two are otherwise functionally identical.

Events can be of three kinds: – immediate – deferred – periodic

An immediate event is ready to be fired the moment it is put in the event queue. This is signalled by providing a delay and period of 0.

A deferred event is delayed by a provided number of milliseconds. This is signalled by providing a delay of some non-zero value and a period of 0.

Both immediate and deferred events are automatically cancelled upon being fired and will not be in the event queue afterward.

A periodic event is a long-lived event that has a non-zero period value. It will first fire after its delay has passed in milliseconds (0 meaning, naturally, immediately), and will then re-fire ever period milliseconds thereafter until cancelled or destroyed.

Events are placed on the event queue for processing at their allotted time with the evq_post_event() function. Note that events can be created with a data pointer and that when posted a data pointer is also provided. If evq_post_event() supplies a non-NULL data, this overrides the data provided on creation. If it passes NULL, then the data provided on creation is used instead.

The amount of time an event has left before firing can be queried with evq_query_time_remaining(). An event in the queue can be removed from the queue with evq_cancel_event() and will be both removed and have the memory they hold be reclaimed through evq_destroy_event(). (Note that immediate and deferred events are automatically cancelled, but not destroyed.)

Ad-hoc event management

Ad-hoc events are dynamically-created events that are queued immediately upon creation. There are six functions for dealing with them, in two groups—one for coroutine events and one for function callback events:

/* callback events */
evq_event_t evq_queue_now(evq_queue_t, evq_callback, void *);
evq_event_t evq_queue_in(evq_queue_t, evq_callback, void *, delay_ms delay);
evq_event_t evq_queue_every(evq_queue_t, evq_callback, void *, delay_ms period);
/* coroutine events */
evq_event_t evq_resume_now(evq_queue_t, co_t, void *);
evq_event_t evq_resume_in(evq_queue_t, co_t, void *, delay_ms delay);
evq_event_t evq_resume_every(evq_queue_t, co_t, void *, delay_ms period);

The three functions in each group correspond to immediate, deferred, and periodic events with the proviso that the periodic event is deferred by its period, in effect the equivalent of making an event with evq_create_*_event() with a delay and period value that is the same.

In all cases, as soon as the function is called, behind the scenes the event is created from the provided data and inserted into the queue in a single operation. Unlike persistent events, however, ad-hoc events that are immediate or deferred will be destroyed upon firing, not merely cancelled. (Periodic ad-hoc events will, naturally, remain in being until manually destroyed.)

The main advantage to ad-hoc events is ease of use. The main disadvantage to them is the prospect of heap fragmentation as events and event data of various sizes is constantly allocated and cleared, leaving the possibility in the long term of a system failing because no available fragment is large enough to allocate an event. (There are mitigation strategies available and, indeed, COROS uses the one outlined.)

Event processing

The core of an application in an event-driven system like COROS is always the event pump. A typical application will initialize hardware, initialize any software components and then simply process events, letting control flow and activity be governed by events flowing through the system. To this end there are three functions reflecting three different strategies for managing events in applications:

void evq_process(evq_queue_t);
void evq_process_for(evq_queue_t, delay_ms);
void evq_process_one_tick(evq_queue_t);

The first of these should never exit. (If it does this is a gross failure and should lead to an application shutdown and/or restart!) evq_process() processes the provided queue and does nothing else in an endless loop.

evq_process_for() does the same as evq_process() but will return after a given duration. This permits the mainline code to periodically perform global maintenance duties like feeding a watchdog timer or the like, giving, in effect, an “idle task” for the application.

evq_process_one_tick() is even more extreme. It will process all currently-due events (if any) and then immediately return. This is provided for more involved or time-sensitive global duties and for hybrid systems which may not be able to push everything they need into the event queue (because of, say, a third-party library). This is also the likely function to be used when processing independent event queues or when children elevate themselves in priority by only processing their own events.

User-supplied functions

tick_ms evq_get_tick(void);

The way that various systems get timer ticks is to varied for EVQ to supply one of its own. Instead EVQ provides a weakly-linked implementation that asserts if called. To use EVQ a user must provide their own implementation of this function. This function must return a tick_ms number that increments as accurately as possible once every millisecond. In extreme cases where this granularity is not available (though this is not common on the targets EVQ is aimed at), the returned value may be incremented by however many milliseconds actually transpire between ticks. For example if the old MS-DOS timer tick of 50ms is used, each tick should increment the return value by 50.

Closing thoughts

EVQ is of necessity more involved than CO in COROS. It is not complicated, but it does have design decisions that could legitimately cause some scratching of heads.

Obvious design choices in the API are the split between callbacks and coroutines. Very simple event reactions (like toggling a LED or flipping a GPIO) are not well served by coroutines: these are rather heavyweight for such simple, nigh-instantaneous actions and callbacks are just better suited there. On the other hand doing any kind of complicated logic in only callbacks is the kind of nightmare that keeps people awake at night.

Less obvious a design choice, however, is the decision to provide a tree structure of event queues in addition to the ability to have independent event queues. The next article in this series will explore some of the reasoning behind this decision, (including a Pi Calculus-type approach to task management).

Continuing with the description of COROS that I'd begun earlier, today's article starts applying some structure to what is, at the core, really just a fancy system of continuable function calls.

Where we're at

COROS thus far is just a fairly simple coroutine system. This is, by itself, already pretty nifty and gives us lots of capabilities.

Generators

Out of the box we can get generators, for example. This is nothing particularly special. You can get them in C with normal functions. Here's a cheap and stupid Fibonacci Sequence generator in bog-standard C:

#include <stdio.h>

unsigned next_fib(void)
{
    static unsigned n = 0U, f0 = 0U, f1 = 1U;

    if (n++ != 0U)
    {
        unsigned fn = f0 + f1;
        f0 = f1; f1 = fn;
    }
    return f0;
}

int main(int argc, char** argv)
{
    for (unsigned i = 0; i < 10; i++) { printf("%u ", next_fib()); }
    printf("\n");
    return 0;
}

This simple function is a form of generator. Each call to it elicits the next entry in the Fibonacci Sequence. And for many use cases this is actually fine. This is what you want. It's fast and it works.

Until you want, for whatever reason, two different Fibonacci generators. Then things get hairy. The existence of static variables makes this impossible to restart from the beginning and solving this starts getting very wasteful of CPU time, RAM, or both. Enter a coroutine:

#include <stdbool.h>
#include <stdio.h>

#include "co.h"

void next_fib(void *unused)
{
    unsigned n = 0U, f0 = 0U, f1 = 1U;
    while (true)
    {
        if (n++ != 0U)
        {
            unsigned fn = f0 + f1;
            f0 = f1; f1 = fn;
        }
        co_yield(&f0);
    }
}

int main(int argc, char** argv)
{
    co_t next_fib_co1 = co_create(next_fib, NULL, 32, NULL);
    co_t next_fib_co2 = co_create(next_fib, NULL, 32, NULL);

    for (unsigned i = 0; i < 10; i++)
    {
        printf("%u ", *((unsigned *)co_resume(next_fib_co1, NULL)));
    }
    printf("\n");
    for (unsigned i = 0; i < 10; i++)
    {
        printf("%u ", *((unsigned *)co_resume(next_fib_co2, NULL)));
    }
    printf("\n");
    return 0;
}

Since the coroutines each have their own stack, and since that stack frame doesn't go away between resumes/yields, we lose the static variables. Without static variables we can have as many Fibonacci generators as we like in our code, each starting from the beginning of the set and moving independently forward in it. It's more expensive than just the function-based one, so don't use this if you only need the one generator! If you need more than one, however, this one can't be beat, especially if you wrap the co_resume() call into a friendlier function invocation.

Sequential, interrupted processing

Consider a command processor in an embedded system. Commands are entered via a serial port. Upon reaching a CR LF pair, the command is completed and processed. (Think of the near-ubiquitous “AT” command systems that abound in communications, for example.)

This is a very typical producer/consumer problem and it is specifically the situation that coroutines were created to deal with. Building a command processor using regular functions is well-known, but it tends to involve either static variables or manually held and passed-around state block that keeps track of the current built-up string. The first has flexibility problems already pointed out above (you can have only one command processor) and the latter tends to make code obfuscated and difficult to reason about without a debugger to inspect the state block.

The coroutine, by virtue of being its own state container (you can view a coroutine as a function that carries its stack frame and state around with it), renders the final logic much simpler, rather like this:

void process_command(void *size)
{
    /* one-time initialization */
    void *rv = NULL
    size_t offset = 0;
    size_t buffer_size = *((size_t *)size);
    command_buffer = malloc(buffer_size);
    memset(command_buffer, 0, buffer_size);

    /* main loop */
    while (true)
    {
        char next_char = *((char *)co_yield(rv));
        if (next_char == '\r')
        {
            next_char = *((char *)co_yield(NULL));
            if (next_char == '\n')
            {
                /* we have CR/LF */
                rv = dispatch_command(command_buffer);
                offset = 0;
            }
            else
            {
                command_buffer[offset++] = '\r'
                command_buffer[offset++] = next_char;
                rv = NULL;
            }
        }
        else
        {
            command_buffer[offset++] = next_char;
            rv = NULL;
        }
    }
}

This coroutine function accumulates characters passed to it until it recognizes a full command line (CR/LF termination). It then processes that command and returns the status code from the command's execution. All other times it returns a NULL, signalling “nothing yet”. Typical use case would be like this:

/* stuff goes here */

    size_t cmd_buf_siz = COMMAND_BUFFER_SIZE;
    size_t stk_siz = 32;
    cmd_proc_co = co_create(process_command, &cmd_buf_siz, stk_siz, NULL);
    co_start(cmd_proc_co, NULL);
    
/* stuff goes here */

    while (true)
    {
        char ch = get_char_from_serial();
        void *rv = co_resume(cmd_proc_co, &ch);
        if (end_loop(rv))
        {
            break;
        }
    }

/* stuff goes here */

(Of course the various calls to co_resume() and co_yield() would normally be wrapped in static helper functions to give them readable, descriptive names, but this has been left out to show the role coroutines are playing in the example.)

So what's happening here is that two independent subsystems have been decoupled. The serial port system is likely a simple one that reads the port (or is interrupt-driven and reads the next character in the port's buffer) while the command processor is passed one character at a time until it gets a full command buffer and processes that. The command itself returns something that determines if the command processing loop ends or not.

The coroutine's logic is straightforward and works in simple, linear logic. A more typical system running off of some kind of state machine with callbacks would be harder to work out and harder to spot bugs in. This is doubly true if there's nesting of state machines. (Ask any Node.js developer about Callback Hell.) Each piece is trivial to follow and yet hooks up beautifully.

And this segues into the next nice piece:

Trampolining and IPC

The way the code is used above, it assumes that cmd_proc_co yields only a flag that indicates if the command processor should exit or not. But this isn't the only use of it and, indeed, there's room for argument that we want far more from our system.

In an ideal world coroutines could yield to any other coroutine, but there are complex technical reasons for why this isn't common. (Some coroutine systems actually do this: the yield calls specify which specific coroutine to yield to.) One of the main problems with these kinds of systems, however, is that they get very hairy, very quickly, even if the underlying implementation wasn't itself already a problem. This is why most coroutine systems are yield-to-parent varieties. This doesn't mean, however, that we can't simulate the broader world.

To do this we'd only have to change the dispatch_command() function we handwaved above, and the parent code that invokes the cmd_proc_co. Indeed the processing loop would turn out to look something more like this:

    while (true)
    {
        char ch = get_char_from_serial();
        co_t rv = co_resume(cmd_proc_co, &ch);
        if (rv && co_resume(rv, NULL))
        {
            break;
        }
    }

The (not shown) dispatch_command() now returns a coroutine's co_t handle. If anything is returned at all, it is resumed and that return value determines if we end the processor or not. Literally a two-line change moves us from a value return to an active process return.

And here we have concurrency (albeit cooperative concurrency, not pre-emptive). The individual returned co_t implementations may be more function-like (calculate a result and yield) or they may contain within them complex state machines that yield in several locations using straightforward linear code. We don't know at the level of the mainline code, and more importantly we don't care.

This technique is called the “trampoline” and is a powerful way of doing concurrency in a more understandable, easier to reason about way.

Where we're going

Three little examples show how coroutines can fit in an application in three different use cases. And once you have a trampoline, you have, in effect, a task scheduler.

An operating system, in short, or at least the kernel of it.

Stopping just here at this point already brings with it benefits. You could build good small- to medium-sized embedded systems with these uses of coroutines (and a few other obvious extensions of them). Trampolines in particular can serve as the foundation for a very powerful tool for concurrent (but not parallel!) computing in a small space that's easy to reason about.

But what fun is stopping here?

After all, if there are standard techniques for doing things (like trampolining for concurrent dispatching) does it not stand to reason that this could be wrapped into its own API that hides the gory implementation details behind a facade of cleanliness?

That's what the next blog post will be about.

In an earlier post I'd mentioned that I'd written an RTOS in my spare time for fun and then incorporated a (slightly reworked) version of it into code at my job, much to my boss' amazement.

Since then I have been working quietly in the background on gussying up the OS for public release and am now at the stage where I can provide preliminary information on what it is, how it works, and why I'm making it the way I am.

Parts is parts

There are three major parts in this RTOS:

  • Its heart: the coroutine system that handles concurrency;
  • Its skeletal and circulatory systems: the event queue (well, event tree) mechanism that provides its signalling backbone; and,
  • Its organs: the component architecture that it is intended to support (complete with helpful macros to make most components trivially-written code).

Coroutines

Today's topic is an architectural overview of the heart of the system: coroutines. If you don't know what a coroutine is or why you'd want to use one, think of them as functions that can return from anywhere in their body and then resume from exactly that point. The Wikipedia article may help.

I've always been fond of coroutines. They allow linear logic flow in a concurrent system, eschewing the “backwards, inside-out” logic that reactor systems (callback-based event systems like Node.js or most GUI systems) force while eschewing the immense effort in defensive programming that pre-emptive tasking forces. They are easier, as a result, to reason about than either of the competing approaches, offer superior performance to pre-emptive tasking (though a bit more costly than reactors), and make resulting software more reliable and more easily modified.

They do have one significant problem compared to pre-emptive tasking, however: they do not pre-empt. You must release control flow to other coroutines explicitly in a timely fashion. This will necessitate some changes to habits born of pre-emptive systems.

A good way to learn coroutines and how to use them in a friendly environment is to install the Lua Programming Language and try them out there. The coroutines of COROS are strongly modelled off of Lua's.

COROS coroutines

The public-facing API of COROS' coroutines is this:

typedef struct _co_t *co_t;

typedef void (*co_function)(void *);

co_t co_create(co_function, void *, size_t, uint32_t *);
void co_destroy(co_t);
void *co_resume(co_t, void *);
void *co_yield(void *);
extern void *(*co_start)(co_t, void *);

A co_t is an opaque pointer to a structure describing a coroutine's runtime state (or its “identity”). It is obtained by calling co_create() and is used in all other API calls to indicate which coroutine is being operated upon.

co_create() takes a function (that will be the heart of the coroutine), an arbitrary pointer to some data that will be passed to the coroutine, a stack size (number of integer-equivalents in the stack), and a pointer to the stack area. Once this function is called, assuming the returned co_t isn't NULL, the coroutine is ready for business.

co_resume() (equivalently co_start()—the latter is a synonym for the former) will either start a not-yet-started coroutine or it will continue a coroutine that has yielded control. In the first case the void * that is passed in will show up as the void * parameter of the co_function being called. In the latter case it will be the return value of co_yield().

co_yield() yields control flow out of the coroutine (effectively a return statement). Its single void * argument becomes the return value of the co_resume() call that started or continued the coroutine. As stated above, whatever was passed into co_resume() will show up as the return value of co_yield(). This provides coroutines with a simple way to exchange values and/or commands.

co_destroy() does what the name says on the tin. It is not expected this will be often used, but it is provided for completeness' sake.

Coroutine implementation

COROS is intended to be relatively architecture-neutral. Its only stipulation is that the native pointer size is 32-bits (though I'm rethinking this and trying to come up with simple ways to change even this). As a result some of the heavy lifting of coroutine implementation is put into specific architecture/compiler implementation files. These files implement this API:

struct _co_t
{
    bool     static_stack;
    uint32_t *stack_store;
    uint32_t *stack_pointer;
    uint32_t *stack_canary;
    void     *data;
};

extern co_t co_active;

void co_halt_impl(void);
void co_init_impl(uint32_t **, uint32_t **, co_function, void *, uint32_t *, size_t);
__attribute__((naked)) void co_prologue_impl(void);
__attribute__((naked)) void *co_yield_impl(uint32_t **, void *);

__attribute__((naked)) signals here that we're using inline assembly and that there is no need for C's prologue and epilogue code. (The compilers that I have available all support this attribute, but if this turns out not to be enough, future versions of COROS will employ another mechanism here.)

struct _co_t provides the implementation of the co_t type. A flag indicates if the stack was passed in when co_create() was called and is used for cleaning up in co_destroy(). stack_store points to the base of the memory block being used as this coroutine's stack. stack_pointer is the current top of stack. stack_canary is a pointer to a special, known value whose change indicates that a stack overflowed.

co_halt_impl() implements the logic for a halted coroutine. It is in effect just an endless loop of coroutine yields that ensures that an ended but not destroyed coroutine doesn't crash the system or endlessly use up the CPU. It is architecture-neutral.

The remaining functions are not, and thus will be explained specifically in the context of an ARM Thumb target using the GCC compiler. (Implementation files are built that specifically.) Note that some code (mainly error handling) has been stripped to make this as easy to understand as possible. When COROS gets a public release (soon!) all the missing code will be in place.

co_init_impl()

#define SET_STACK_VAL(O, V) (stack_buffer[stack_buffer_count - (O)] = V)

void co_init_impl(uint32_t **stack_pointer, 
                  uint32_t **stack_canary, 
                  co_function co_cb, 
                  void *co_data, 
                  uint32_t *stack_buffer, 
                  size_t stack_buffer_count)
{
    SET_STACK_VAL(1, (uintptr_t)co_prologue_impl); // lr
    SET_STACK_VAL(2, 0);                           // r7
    SET_STACK_VAL(3, (uintptr_t)co_halt_impl);     // r6
    SET_STACK_VAL(4, (uintptr_t)co_data);          // r5
    SET_STACK_VAL(5, (uintptr_t)co_cb);            // r4
    SET_STACK_VAL(6, 0);                           // r11
    SET_STACK_VAL(7, 0);                           // r10
    SET_STACK_VAL(8, 0);                           // r9
    SET_STACK_VAL(9, 0);                           // r8

    *stack_pointer = &stack_buffer[stack_buffer_count - 9];
    *stack_canary = stack_buffer;
}

Initializing a coroutine once the architecture-independent code is executed, involves setting up the stack such that upon invocation appropriate values are in the right place for parameters. Most notably we set things up so that our own version of a prologue is called when the function returns and that certain values are available as local variables in the implementation of the prologue.

Which leads us to...

co_prologue_impl()

__attribute__((naked)) void co_prologue_impl(void)
{
    __asm volatile(
        "    mov lr, r6 \n"
        "    mov r0, r5 \n"
        "    bx r4      \n"
    );
}

Understanding this function requires a careful inspection of the previous. co_prologue_impl() was in the stack at the point the LR (link register) sat, so upon return from a coroutine, co_prologue_impl() is branched into. Here we mess with three registers. If you look at the comments in co_init_impl() you'll be able to follow along.

First, the LR is set to co_halt_impl(). Any return from this point onward returns into the aforementioned endless loop of yields. This ensures that should an exited coroutine ever be resumed by accident it will simply immediately yield back. Next the data item passed in to co_resume() is set as the first parameter of the coroutine function co_cb. Finally we branch to the co_cb function pointer passed into co_create() (a so-called tail call).

Why we perform all these bizarre gyrations thus far becomes clear when the final implementation function is called.

co_yield_impl()

This is the hardest part of the code to understand because it involves knowledge of the ARM Thumb ISA, how it differs from ARM Thumb2, what the calling conventions are for C compilers, and all the previous functions shown as well. Here it is in all its gory (sic).

__attribute__((naked)) void *co_yield_impl(uint32_t **, void *)
{
    __asm volatile (
        "    push {r4,r5,r6,r7,lr} \n"  // <1>
        "    mov r4, r8            \n"  // <2>
        "    mov r5, r9            \n"
        "    mov r6, r10           \n"
        "    mov r7, r11           \n"
        "    push {r4,r5,r6,r7}    \n"
        "    mov r2, sp            \n"  // <3>
        "    ldr r3, [r0]          \n"
        "    str r2, [r0]          \n"
        "    mov sp, r3            \n"
        "    mov r0, r1            \n"  // <4>
        "    pop {r4,r5,r6,r7}     \n"  // <5>
        "    mov r8, r4            \n"
        "    mov r9, r5            \n"
        "    mov r10, r6           \n"
        "    mov r11, r7           \n"
        "    pop {r4,r5,r6,r7,pc}  \n" // <6>
    );
}

The first thing we note with this bizarre function is that it has two parameters ... which we don't use.

Of course we do use them. They're buried in the stack frame and get shuffled around as part of the register layout. For details on where I'd recommend reading the ARM calling conventions for your compiler (which itself is likely a copy of ARM's own default conventions). For your sanity, however, I'd just recommend that you trust that this code is correct and functional.

No? You want details? OK. You'll be sorry, though!

<1> pushes all the user-saved registers into the stack in a known order (that matches the order they were manually placed in co_init_impl() so consult that function for details).

Starting at <2>, the remaining user registers get stored. The roundabout approach used is a nod to one of the differences between the Thumb and Thumb2 ISAs. Thumb2 would have permitted pushing all the registers with a single push instruction. Thumb only permits pushing registers r0 to r7 inclusive that way. “Extended” registers cannot be addressed that way, so we pull registers r8 to r11 into the registers we just stored and store them again.

At <3> we switch the stacks. This is comically verbose again because of Thumb. Thumb instructions don't permit ldr/str operations on sp (where Thumb2 does). So instead we copy the sp into r2, load r3 with the value of the first argument (stored in r0) which happens to be the new stack address, store r2 (the old sp) into that variable (which happens to be the stack_pointer member of a co_t kept for future co_yield() calls), then finally puts the passed-in stack pointer address into sp.

At this point we have switched stacks and are working in another coroutine's stack.

First we put the passed in data value (void *) into r0 as the return value of the function at <4>. We now reverse the operations we did above. <5> restores r8 to r11. <6> restores the remaining user-saved registers and stores what was the lr (link register, a.k.a. return address) value into the pc, causing us to branch to that address immediately.

Well that was ... different

We have now “returned” into an entirely different function's stack frame. The code is hairy when you look at it, but it is sound. Both parameters are used and the system is set up that calls to co_yield() switch back to the stack of the co_resume() that called them and, further, that co_resume() either uses the initial stack provided manually in co_init_impl() or the stack condition of the coroutine as it last yielded.

People who have implemented (or at least looked into) a pre-emptive tasking system will recognize most of what was done there. In this regard coroutines are not all that different from pre-emptive concurrency systems. Where they differ is the lack of need for synchronization primitives to ensure shared memory structures aren't pre-empted and botched mid-stream. Because the transitions are synchronous, there is no need of the usual system of semaphores, critical sections, mutexes, monitors, condition variables, etc. of pre-emptive tasking systems. In the rare cases these are needed, their implementation is far simpler and can be easily done ad hoc without the gory edge cases of their pre-emptive cousins.

In most programming, dynamic allocation and subsequent freeing of memory resources is normal. In embedded programming, dynamic SRAM allocation from the heap can be an application murderer (largely because having 128KB of SRAM is nigh luxurious and allocating 32KB of that to heap is crazy talk).

The issue that kills is heap fragmentation. If you're dynamically allocating and freeing objects willy-nilly, unless all of the objects happen to be exactly the same size you are going to eventually wind up in the ridiculous position of having enough free SRAM for an object, but no single piece big enough to hold it. Your application dies, starved for memory even though you have plenty of free space.

Response patterns

In response to this, several patterns of memory usage pop up in embedded software. The three most common patterns used by embedded systems programmers are:

  1. All-static, all the time.
  2. One-time dynamic allocation.
  3. Locally-paired allocation and deallocation.

I'll explain both what these patterns are (in brief) and then explain some of the issues each brings to the table: constraints on system capabilities and designs, so that the trade-offs you're bringing to the table when using these are made manifest.

(C++ users, when reading this, should s/malloc/new/ and s/free/delete/ with attendant minor syntax changes as needed in subsequent code reading.)

All-static, all the time

This one is simple and is most common in simpler embedded systems with fixed resources. Consider, for example, a simple device that reads an ADC and periodically calculates stats on the read data, perhaps logging those stats to a UART. Your ADC will, accordingly, need a buffer to write its data into and from which data will be read and analyzed for the reports.

Where do you put your buffer?

You put it in static space:

static uint16_t adc_buffer[64];

When you start your ADC, or analyze its results, your functions take the pointer to that static buffer like this:

start_adc(adc_buffer, 64);

Easy-peasy! If you want to do premature optimization (and thus be evil) you might even foolishly try to make adc_buffer a global variable and forego the trivial extra cost of passing a pointer.

Here's why you don't do that, though.

One-time dynamic allocation

All-static as a strategy works when you have fixed resources known at compile time. But what do you do when the resources can vary at run time and you're not able to predict at compile time? Let's say, for example, that the device mentioned above actually has up to 8 ADCs. You only know, however, which are connected when the software fires up. How do you cope with that?

If you went all-static you'd be declaring your adc_buffer more like this:

static uint16_t adc_buffer[8][64];

Your calls to start or analyze would then pick the right buffer to pass in based on which ADC you're reading:

start_adc(adc_buffer[3], 64);

This is, however, wasteful of resources if you're running in a device that only has 3 ADCs hooked up. A better approach would be something like this:

static uint16_t *adc_buffer;

You'd initialize it at application start-up by doing this:

adc_buffer = malloc(count_active_adcs() * 64);

(I am eliminating a check for that malloc return for clarity. If you don't check the return value of malloc in production code you need to be shot.)

You'd then use it like this, assuming you want the 4th ADC started:

start_adc(adc_buffer+3*64, 64);

(Of course really you'd use a symbolic name like BUFFER_SIZE instead of 64, and a variable like current_buffer instead of 3, but I'm trying to minimize excess code baggage here.)

This is a perfectly safe use of dynamic allocation because it is never freed. It is functionally the same as using static allocation, in terms of heap fragmentation, but it is more flexible in memory assignment. Indeed, for a variety of reasons, I tend to use this instead of static allocation in most of my projects, even simple ones with static resources. (One of the chief reasons: nothing ever stays simple and static over time.)

But it's still not universally applicable.

Locally-paired allocation and deallocation

A common situation I find myself in is having to allocate a temporary buffer to pass into a function for processing. The size of the buffer may vary from call to call and, worse, the buffer may be sizable: like say a 1K buffer. In a typical embedded environment this is far too large an entity to allocate on the stack (as a local variable). On the other hand, it's also a bad thing to have allocated statically. There may be a dozen functions that need this kind of buffering. If each one statically allocated its own (too-large because the size varies unpredictably each iteration) my SRAM would be eaten up before I can even link the app, not to mention run it, and if they tried sharing there's too great a chance that one steps on another's buffer. So we instead do this:

void my_funky_function(void *data, size_t data_bytes)
{
    uint8_t *buffer = malloc(data_bytes);
    do_something_with(data, data_bytes, buffer);
    free(buffer);
}

(Again, eschewing return value checks, etc. for brevity. Check your malloc in production code!)

This works without fragmenting memory, even if do_something_with() uses this same pattern internally, and maybe a further, deeper function also_do_with(). As long as the malloc and free calls are paired and nested this is not a problem.

In the following diagrams, _ is a bit of heap memory that is free. 1 is heap memory used by my_funky_function, 2 is used by do_something_with, and 3 is used by also_do_with, a function called by do_something_with. * is a bit of heap memory that is used by something out of scope.

When my_funky_function is entered, the heap will look something like this:

**_***__**____________________

When the memory for the buffer is allocated, and do_something_with is called, it will look like this:

**_***__**11111_______________

When do_something_with calls also_do_with the heap might look something like this:

**_***22**11111_______________

And finally while also_do_with is running, after its allocation, it might look like this:

**_***22**111113333333________

It's all nice and cosy. And here's the sequence as each function finishes and frees its resources:

**_***22**111113333333________   /* also_do_with() active      */
**_***22**11111_______________   /* do_something_with() active */
**_***__**11111_______________   /* my_funky_function() active */
**_***__**____________________   /* my_funky_function() exited */

In any sane heap implementation, the heap's layout will be identical upon exiting my_funky_function as it was before entering. There's no room for fragmentation and churn because the nested nature of allocation and deallocation ensures that the structure of the heap is reconstructed on the way out.

But...

All of this is fine and dandy and you can make some very nice applications that efficiently use SRAM without fragmentation this way. But there is one increasingly-common use case not represented here that some readers are already champing at the bit to correct.

What happens if you can't know in advance (even at application startup) how many items you need, and/or you can't put everything neatly into nested pairs?

I faced this problem head-on when writing an event queue system. Events in this system contain a lot of information:

  • time due
  • repetition interval
  • callback/coroutine
  • event data

And while there is an API for firing events from pre-allocated memory blocks, this API is inconvenient to use and makes the event system less attractive.

On the other hand, the first rendition of the event system that used calloc to allocate an event's resources, and then called free (both on the event and on its embedded data pointer) after the event's callback had been fired had a major problem: since the point of firing and the point of activation were often separated in time, and since allocating events themselves and event data blocks were interlaced and usually of different sizes, fragmentation of the heap was inevitable. And sure enough my test system would start being unable to allocate new events after only about a thousand iterations of the test.

So what's the solution?

Dynamically static allocation

I borrowed a page from the “One-time dynamic allocation” section above, with a twist: the “one-time” allocation was actually “exponentially-less-frequent multi-time allocation”.

It works like this. Given an event definition:

typedef struct event
{
    uint32_t activation_tick, repetition_interval;
    coroutine_t *callback;
    void *event_data;
    bool in_use;
} event_t;

Declare a pointer to the event_t in the static space of all the event queue management functions:

event_t *allocated_events;
size_t num_allocated_events;
#define INITIAL_SIZE 4

(INITIAL_SIZE is a number I pull out of my butt at first, but later tune according to measured high water marks to keep reallocations to a minimum.)

Allocating a new event in the system now looks something like this (keeping in mind that I'm not checking error returns and such for clarity):

event_t *allocate_event(void)
{
    if (!allocated_events) /* <1> */
    {
        allocated_events = calloc(INITIAL_SIZE, sizeof(event_t *));
        num_allocated_events = INITIAL_SIZE;
    }
retry:
    for (size_t i = 0; i < num_allocated_events; i++) /* <2> */
    {
        if (!allocated_events[i]) /* <3> */
        {
            allocated_events[i] = calloc(1, sizeof(event_t));
            allocated_events[i]->in_use = true;
            return allocated_events[i];
        }
        else if (allocated_events 
             && !allocated_events[i]->in_use) /* <4> */
        {
            allocated_events[i]->in_use = true;
            return allocated_events[i];
        }
    }
    double_allocated_events(); /* <5> */
    goto retry; /* <6> */
}

Wut

This needs some breaking down to explain and justify. (It may help, if you come from OOPsland, to look at the above again attaching the name “object pool” to the technique. Then the key concepts might fall into place, this being, in effect, the procedural precursor to object pools.)

<1> is a check that ensures that there are event slots available at all. If that pointer has not been initialized yet, we make an initial array based on an INITIAL_SIZE that has been selected with an eye toward typical usage patterns observed in the wild. Note that we've allocated an array of pointers to event_t here, not the actual event_t objects themselves!

<2> simply iterates over every item in allocated_events looking for the first available slot. An available slot is either one that hasn't been allocated an event_t object yet (<3>) or one that has been allocated but is no longer in use (<4>).

(I commit the dark evil here of returning from the middle of a function, yes. And I commit an even darker evil later. Both are because the logic is otherwise difficult to follow for no good reason beyond sticking to dogma.)

If that loop finishes, there were no event slots available so I double the size of the array (<5>) and try again (<6>).

(<6> is my second, even darker evil. I use a gasp! goto! Pull my programmer license stat! Even if the goto-less version of this obfuscates intent for no good reason!)

This trick of double-and-retry looks dangerous at first glance, but keep in mind that if the allocation fails, I actually test it in production code, and bug out. This is a very simplified version of the final product specifically to highlight a technique. The way the real code works it's guaranteed that there will be a slot available the second time, or the second time won't even execute.

So, this is a lot to take in. One question that will immediately come to mind is “why are you allocating individual events dynamically instead of making it an array of event_t? The reason for this lies in the secret sauce of double_allocated_events().

Secret sauce

void double_allocated_events(void)
{
    size_t current_size = num_allocated_events;
    size_t new_size = current_size * 2;
    allocated_events = realloc(allocated_events, 
                       new_size * sizeof(event_t *));
    memset(&allocated_events[current_size], 0, 
            current_size * sizeof(event_t *));
}

Events fly through the system as pointers. If you had an array of event_t, the moment realloc is called, there is a very real (almost certain!) chance that events in flight have suddenly been rendered bad. It is for this reason that the array of event slots is an array of pointers, each of which is individually allocated. While in theory this leads to some inevitable heap fragmentation, in practice (if the system is properly tuned with a good INITIAL_SIZE) the number of such reallocations is minimal and thus so is any attendant fragmentation. Even using a small INITIAL_SIZE value (2, say), leads to a minimum amount of fragmentation because the exponential nature of the doubling means your queue of event slots will rapidly grow to the size you need after which it will never have to be allocated again.

Deallocation

Deallocating an event now looks like this:

void deallocate_event(event_t *e)
{
    e->in_use = false;
}

Events are just flagged as available, and the next time a “dynamic” event is needed, an available event is returned.

Improvement

There is room for improvement in the above design. Specific improvements could include:

  1. Linear growth.
  2. Chunked event allocation.
  3. Event availability management.

None of these are currently implemented because the event queue system currently in place more than meets my needs in real-time embedded control. Should this change, however, there are obvious improvements which could be made that will not in any way change the API (and thus the clients) of the event queue system.

Linear growth.

I use a doubling scheme here. This means that if I use a single event past a threshold, I have a lot of slack space. For example with an INITIAL_SIZE of 4, if my high water mark is 5, I will have three event_t * spaces left doing nothing.

This is not a major problem in my systems thus far precisely because the slot array is an array of pointers. The slack space in the above example (taken from real world data) is 12 bytes on my target machine. I'm not yet at the point where I'm counting down to double-digit bytes for eking out memory. If I get there, I can use a different growth scheme like going up by INITIAL_SIZE instead of doubling (possibly using a linked list so that I'm not even reallocating). This will increase the number of (re)allocations (with commensurate risk of fragmentation) in exchange for lower overall memory usage.

Chunked event allocation.

Currently I calloc new events as I grow into the empty slots. Given that this can occur at random points in the execution time, it guarantees a small amount of fragmentation. Again real-world data suggests it's not a problem for me, but should it become a problem, I would improve this with chunk-allocation of events both the first time that the events are allocated and when they're doubled (or otherwise grown).

The technique would be to calloc(count, sizeof(event_t)) and then treat the result as an array of event_t, putting the pointer of each item into the allocated_events array. The advantage of this is that by allocating the chunks all at once I reduce fragmentation opportunities. The disadvantage is that I waste, in the previous example (high water 5 with an INITIAL_SIZE of 4), 60 bytes of RAM (3 unnecessarily-allocated events, each sized 20 bytes).

Event availability management.

Currently I read the event slots from the beginning of the array each time. This wastes a few cycles in a loop (sometimes twice!) which could be mitigated by either keeping track of the lowest available slot and starting from there, or by moving the pointers around to the beginning on availability.

The truth, however, is that allocation time costs are minimal and the code complexity of managing more efficient allocation is not warranted for my problem domain. I deal with events on the multi-millisecond scale. Walking the array of events on allocation is on the multi-microsecond scale. Three orders of magnitude difference in time means that it will be a while before this cost is large enough to warrant the extra complexity.

Summary

Dynamic memory in embedded systems is an application killer if not handled carefully. In this little essay I surveyed three very common techniques to rein in the madness (all static, one-time dynamic, nested allocate/deallocate) and introduced another less common one that also serves to rein in the reign of terror (along with outlining a few ways to enhance even that).

There is really no reason to be afraid of dynamic memory in embedded systems. Just be cautious and creative.

There's a lot of talk (including from me) about how unreliable software is and how it's embarrassing that it can't ever be gotten right. The reason for this, however, is pretty clear if you're inside of it making it. I'll illustrate with a problem I found and solved today.

The problem code

So, I spent all of Friday, all of Monday, and half of Tuesday trying to figure out a problem in a library I'm responsible for. There's a set of states being driven by a call to a “booster” function that simultaneously keeps the hardware alive and kicking (since I'm eschewing interrupts for assorted reasons) and that reports the ensuing state of the beast.

The code looked something like this (vertically compressed, virtualized):

switch (booster()) /* <1> */
{
case 1: handle_booster1(); break;
case 2: handle_booster2(); break;
case 3: handle_booster3(); break;
case 4: handle_booster4(); break;
default: /* no special handling */
};

This worked. Once. The first time that block of code was executed, the appropriate case was caught and handled. Every subsequent iteration had it fail. It fell through the switch matching nothing, 100% of the time.

And it was weird. No matter how I approached it, if I instrumented it (blinking a LED) it failed. If I debugged it, however, everything worked in the mechanisms behind the call to booster(). Everything worked leading up to booster(). Everything worked after booster(). But that switch statement only worked once.

Then, even more bizarrely, if there's a breakpoint set on <1> above, the switch statement worked 100% of the time. It just had to pause right there, then continue, and it all worked out fine.

The solution code

int rc = booster();
switch (rc) /* <1> */
{
case 1: handle_booster1(); break;
case 2: handle_booster2(); break;
case 3: handle_booster3(); break;
case 4: handle_booster4(); break;
default: /* no special handling */
};

This code works. 100% of the time. With or without debugging. Every iteration. There's no reason for this. Semantically this code is identical to the previous version. Hell, even an '80s-era peephole optimizer should be generating identical code in both cases. Yet the first case broke, the second case does not.

And this is the problem

What's here is obviously a compiler bug (and a pretty bizarre and ugly one at that). Now I work in embedded systems, so I face less of this problem, but even I get hit with this single, huge, elephant in the room of software development: We make (buggy) software with (buggy) software that is itself based on (buggy) software. (And usually sits on top of buggy hardware. But that's a story for another time.)

How do you make reliable software in that case? In my own far simplified case of writing software I have to deal with:

  1. Hardware. (Which is often buggy. Errata sheets are a thing for a reason.)
  2. Driver. Vendor-supplied or my own, they're usually buggy. And they're usually written in C. Which is itself a piece of buggy software. (Which is itself usually written in C which is a piece of buggy—stop me when you spot the pattern here.) Which itself uses a standard library that is buggy (and written in a buggy compiler). Which … (I think the point is clear now?)
  3. Standard C libraries. Which are buggy and built on buggy software.
  4. My own libraries. Which are buggy and built on buggy software.
  5. My application logic. Which is buggy and built on buggy software.

That's five (often recursive) layers of bugginess. And I'm programming “close to the metal”. Consider a more traditional programmer: Hardware → drivers (note: plural) → kernel → OS services → compiler → standard library → web browser → interpreter → application framework → network stack → … and that list goes on and on and on.

Why is anybody surprised at how slipshod software is?

And this is part of the solution

One of the major problems that this wobbly stack of fruit-flavoured gelatin we call software has at its foundation is the selection, at its core, of a single, terrible programming language: C. Even when C isn't the language of direct implementation, I guarantee you that throughout that stack, even the truncated version I supply, there is C code clutching at the software with its sloppy fingers. The drivers are likely written in C (with a smattering of assembler), as is the kernel and half or more of the OS services. The compiler was either written in C or it likely generated C in the back-end. The standard library was likely largely written in C. The web browser likely has a huge blob of C in it. The interpreter was likely implemented in C, as was the network stack.

And C is a terrible language for reliability. It was seemingly designed to trip up programmers by defaulting to bad technique. Can you write reliable software in C? Sure! You'll just be incredibly unproductive as you go out of your way to second-guess every aspect of the language: checking bounds on each operation, for example, or inspecting each and every return value (yes, printf() included!) to ensure correct operation.

And every time, in this exhausting, easy-to-overlook process you slip up, that's a bug waiting to slip through even in someone else's code entirely. Like my mysterious switch statement problem.

So stop using C unless you're forced to. And if you're forced to, start planning your exit from C. (And not in the direction of C++ either: it's literally got all of C's flaws and then adds a few hundred of its own!) There are languages out there which are good for writing stable, reliable software. On the low-level/imperative side there's the Modula family (Modula-3 being my historic personal favourite). There's also Ada (my current favourite in that level) and Rust.

On the high level side there's other languages you can look at. OCaml is pretty devastatingly reliable, yet fast, for example, despite being a “wasteful” functional language with honest-to-goodness automated memory management. My friend Paul Bone is working on the Plasma programming language which is also another non-C language (implemented in a non-C language) that should give, when complete, a good, solid platform for high-level, reliable software.

TL;DR LOL!

Stop using C. Stop using languages whose design notes have been informed by C. C was a fine language for its time, its place, and its target platform. It has no business being so central to software half a century later like we didn't learn anything.

We did. Start applying it!

There is a crisis in software development.

Wait, sorry, there are so many crises in software development that I need to be more specific.

There is a crisis in generating new programmers.

I know this sounds crazy in the face of an industry that employs, seemingly, more programmers than there are people on the face of the planet, but I stand by it nonetheless. This is because there's a certain kind of programmer that's missing: the programmer that's doing it for the love of programming.

There's a huge difference between the programmer who loves programming and the programmer (the so-called “tech-bro-dude”) who loves the dream of money as they churn out Yet Another CRUD Application And Ad Delivery Mechanism. And here's the thing.<1> Programmers who do it for the love are:

  1. the source of all actual advancement in the field; and,
  2. made, not genetically predetermined.

The kind of people who become programmers out of love are likely determined at least partially by genetics, but that kind of person could just as easily get fascinated by music, or by mechanical devices, or by any number of other fields of interest. It all depends on the on ramps available to catch (and hold) their attention long enough to plant that little seed of love.

“I's looking behind us now into history back …”

Back when I started programming—back when I fell in love with it—there was really only one game the beginner could play: BASIC. All the computers a beginner had access to had BASIC, whether this was DEC machines like the PDP-11 or personal computers like the Commodore PET. Many of the latter would start up in BASIC, in fact (as did some of the former!). You turned on the machine (or logged into your account) and there, after a (hopefully) brief burst of nonsense words that meant nothing to beginner's eyes (but often made important questions arise) came that lovely, welcoming, reassuring prompt:

Ok
_

This prompt told us what we needed to hear. That the machine was ready. And that everything was OK.

From here we could type commands, slavishly following some book or magazine (remember those?) perched precariously in our laps or to one side:

print 'Hello, world!'
Hello, world!
Ok
_

We could change the commands and see what happened. We could set variables and print them out. We could input values into variables and print them out. Maybe even alter them in the process!

Such power was at our fingertips and we got our results back instantly in a tight feedback loop of joy and dopamine. And then we got to programming.

10 input 'What's your name? ', name$
20 print 'Hello, ', name$, ' ',
30 goto 20

The specifics of syntax will have varied slightly from machine to machine, but the gist was the same. And at our fingertips we had power. Power to command a machine to repeatedly type the same thing over and over and over again for all eternity if we so chose. (We never did. We did, however, quite frequently alter the message to something profane. Because children.)

There are two key metrics that BASIC nailed from the get-go:

  1. Time to 'Hello, world!'
  2. Time to altering 'Hello, world!'

Very few languages before it or since had that heady mixture of instant feedback and (seeming) endless power. More nerds got hooked on programming, falling almost immediately in love, because of that feedback loop and, in the process, learned quickly. BASIC was my on ramp to programming and led very quickly to learning 6502 assembler, then Pascal, then another assembler dialect and, before long, a whole bunch of other things in computer science, software engineering, and programming.

I have BASIC to blame for, essentially, over 4 decades of my life. And my story is the story of thousands of people just like me scattered to the four winds in the software industry. It's not an accident that the early to mid 1980s were some of the most creative (if sometimes insane) periods of program design and development. It's what happens when you unleash nerds in love on an industry.

“Time counts and keeps counting …”

Over time, of course, the limitations of BASIC became manifest. Yes, amazing things were made with BASIC. (Like an astonishingly realistic flight simulator on a TSR-80 Model II I typed in from a magazine and then tinkered with.) Yes, BASIC was actually a powerful programming language that lurked, all muscular and jacked, behind the easy-seeming interface.

But it was flawed. So flawed that big names in computing dumped on it. And as much as we nerds who used it as an on ramp loved our first language (you never forget your first!), we knew, too, that it was flawed.

So we spread out. Pascal. Modula-2. That nifty “C” language we'd heard so much about. C++. Java. Python. Ruby. Language after language flowed beneath our fingertips and we accomplished so much more with these languages, with their power and their extensive communities and libraries. In the process we became software engineers and internalized a lot of engineering “best practices”.

But we forgot our roots.

We forgot to keep a good on-ramp for burgeoning programmer nerds. An easy way to do cool things with ever-more-powerful hardware that would attract the attention of our kind before they got swallowed up by video games, by music, by anything else out there that wasn't programming.

People tried to make pedagogical tools for teaching programming, but that was the problem. They were pedagogical tools, not authentic programming languages. In the distant past there was Logo, for example, and it was a powerful tool for teaching programming concepts … but not for programming. It was like those interminable “educational toys” that get foisted on parents as being good for their children; the kind children would tinker with for a while before going back to their toy cars, G.I. Joes, Barbies, and other such toys that were actually fun.

I saw lots of children put in front of Logo. I saw them fascinated by it for a while. Then I saw that interest fade as the limitations of the environment closed in on them. And I've seen the same thing with successor or competitor environments up to today's Kojo or Scratch. These kinds of tools are an onramp that leads to a brick wall. To get past that brick wall we're faced with engineering tools.

And as powerful and as capable as those engineering tools are, they are not fun for newcomers. And the interest drops off since to do “real programming” you have to do all this other confusing, boring, tedious stuff. The result is we get the wrong kind of programmer. We get the kind who hear that “coding is money” and come in to get themselves a slice of that pie. The kind who only want to learn things that look good on a resume so they can get a thicker slice next job.

The kind who have no joy in the art and the science of programming. The kind who think the height of technical achievement is finding ways to blast ads into the eyes and ears of everybody around them.

“… finding the trick of what's been and lost ain't no easy ride, but that's our track.”

We need to find that on ramp again. That thing that requires a minimum amount of effort (like launching an app) to get you programming like turning on a '70s or '80s era personal computer had. Yet that thing you can (and would want to!) do serious programming in here in the 2020s.

This is a tall order, I realize, but I also think it's an essential for our industry. Our industry has not meaningfully progressed in decades, hitching a ride on hardware improvements without really improving our software techniques in any profound way. (No, object-orientation is not a profound improvement. Functional programming might barely qualify. Barely.) And to improve our software techniques, we need to get our nerds back. The people who program because they love programming, not because they love stock options and paycheques. (I mean they'll love stock options and paycheques too, obviously, but that shouldn't be their primary motivation.)

Some people might point to things like FreeBASIC and they may have a point. Others may point to venerable projects like Tcl/Tk and may have a similar point. There's plenty of possible on-ramps, but they tend to fail in practice for reasons that are subtle because they're not being built to be on-ramps. Consider this for FreeBASIC, for example:

FreeBASIC gives you the FreeBASIC compiler program (fbc or fbc.exe), plus the tools and libraries used by it. fbc is a command line program that takes FreeBASIC source code files (*.bas) and compiles them into executables. In the combined standalone packages for windows, the main executable is named fbc32.exe (for 32-bit) and fbc64.exe (for 64-bit)

fbc is typically invoked by Integrated Development Environments (IDEs) or text editors, from a terminal or command prompt, or through build-systems such as makefiles. fbc itself is not a graphical code editor or IDE!

This is not an on-ramp. This is a compiler (an engineering tool!) that requires several non-coding steps to use. There's no feedback loop. There's no instant result from changes made in front. There's no way to test a piece of code you want to try out before formally entering it into your program. FreeBASIC is a very good language and piece of code … but it's not a viable on-ramp.

Tcl/Tk is better as an on-ramp. You get a prompt (although it's not as friendly and reassuring as Ok was!). You can do things live in that prompt. Hell, like some BASIC environments (chiefly on personal computers like the PET, but also in some PDP-8 and PDP-11 environments I used) you can even access commands of the underlying operating system and, importantly, capture their output and use them in your programs. It's all very powerful.

Pity it's such a pain in the ass to install.

No, really. I'm not joking. I've been working with computers since 1979 in some capacity or another, and I still have troubles figuring out which Tcl/Tk to install and how to get it to run and such. It's a powerful language (far more powerful than people give it credit for!) and it can accomplish amazing things, but it is definitively not a “download this app and run it and it just works” kind of experience. A prospective on-ramper is hit with decision paralysis right from the outset before even typing puts "Hello, world!". This is an on-ramper who is now at risk of walking off to pick up a guitar, or a paintbrush, or whatever else catches their fancy and another proper programmer lost to us.

And that's not the end of it all. There's still one more problem with FreeBASIC, Tcl/Tk and other projects in that vein: they're not web-native.

BASIC spoke the native language of computers of the time, as did Tcl/Tk: text and occasional bits of graphics or sound. The native language of computing today, for good or for bad, is the World Wide Web. And none of the viable on-ramps in terms of combining ease of use with power is web-native. Making them work with the web, indeed, is so fraught with difficulty that it's off-putting, not attractive.

“… we knows there'll come a night when they sees the distant light and they'll be coming home.”

So this is where we stand. As an industry we've stagnated. Most of our programmers want to put ads in front of our eyes, and that's the extent of their ambitions (because they're in it for the paycheque, and ads before eyes is what pays). The people who could change the status quo—who could introduce the world to whole new ways of looking at computers—are dying out and not being replaced by enough new blood to move us forward. We're losing the creative, interested, inquiring minds to other fields that still reward this and as a result we're in trouble.

We need an on-ramp. We need a tool that does for modern children what BASIC did for my generation. We need a tool that is:

  1. web-native, because that's where children live;
  2. low-friction to enter, so that children can slide on in with ease;
  3. high-power behind the scenes, so children don't quickly grow into the boundaries and get bored.

We need this so that the people who program out of love can see the distant light and come home.


<1> Yes, I apologize for that. It might not happen again.

I'll put the TL;DR summary of my experience with FreeBSD on a Raspberry Pi 4B here: https://www.youtube.com/watch?v=K4Fvsgv0bYw The best thing I can say about the experience is “at least it wasn't OpenBSD”. (More on that in the postscript to this essay.)

OK, so, that's a bit unfair. Not everything about the experience stank. And this is the problem. I really wanted to like this. Of the non-Linux, non-Windows alternatives, FreeBSD comes so close to being usable that it's incredibly frustrating that it fails so profoundly in the end.

I'll be dividing this review up into three parts: Plus, Minus, and Interesting. Plus has the good things about the experience of trying out FreeBSD on a Raspberry Pi 4B. Minus has the bad things. Interesting is a grab bag of observations that are not good, not bad, but, well, interesting. Things to note that may be bad or good for individual users, or which are just noteworthy in some way or another but not really a judgement. Then, at the end, I'll have a postscript, as mentioned, about OpenBSD and why there will not be a full review of OpenBSD on the Raspberry Pi 4B.

Plus

Despite the negative overall tone this review is going to have, there actually are positive things in my FreeBSD experience.

Installation

The first (and largest) of these is that FreeBSD is astonishingly easy to install. The last time I tried FreeBSD (on a PC) was version … I want to say 9, but it might have been 7. Then it was a problem. A big one. The only reason I continued into it as far as I did was I was curious to see where the rabbit hole ended. And I did finally get a working FreeBSD. FSVO “working”.

This time the process was utterly painless. I dded the image to an SD card, stuck that into the Raspberry Pi, plugged in the network cable and powered up the machine. Eventually nmap found a new host on my network and I was able to ssh into it using the account freebsd (password freebsd). From there everything just worked, modulo the different approach to things.

Documentation

Other nice things were that it was pretty easy to find information. Between the installed man pages and the FreeBSD web site, I was rarely left at a point where I had to decode what the next step would be. (This is not new. I had a similar experience the first time I tried FreeBSD on a PC.) The FreeBSD community seems to believe in documentation, which is a rarity in the field. (It's still documentation written by software people, so terrible, but it's actually present which is a huge bonus.)

pkg is kind of slick

In terms of functionality and ability, pkg is pretty slick. I rarely had to refer to its documentation on how to do things. The only thing that confused me a little until I checked the docs was what pkg list really means. (And the docs cleared that up instantly.) Were it not for a major, glaring, horror show of a problem I'd put pkg up there on the list of package managers I don't hate.

Minus

All that being said, the minuses in my opinion grossly outweighed the pluses. One was a showstopper for me, though arguably not for all possible users.

pkg problems

One of the very first things that is suggested is to type the command pkg list. This begins a rabbit hole of lovingly invoking every single one of the 8 fallacies of distributed computing. For some reason the package manager isn't installed by default on the boot image. Instead a placeholder script downloads it and installs it. (For all I know it may even build it.)

This was disastrous. Typing pkg list (or any pkg subcommand) resulted in a message saying that pkg isn't installed, that it would have to be downloaded before it could be used. It even said what URL it was downloading it from. Then nothing. For ages. No feedback. No progress indication. No pacifier of any kind. Then, about 15 minutes later, typically, it would just say it failed. Type pkg <whatever> again and it starts up the process all over again. And again. And again. And again. Whatever progress it had made was lost and it seemed the task just began from the beginning again.

Three hours later, just as I was about to give up and scrap the use of FreeBSD entirely, it finally came through. Then normal operations started to work and, thankfully, with both progress feedback and without the constant failure. The package lists got updated and the package database could be queried quickly and easily.

But…

That heralded the real problem: poor infrastructure combined with even worse handling of the poor infrastructure.

The infrastructure

Poor infrastructure is a given when you're a fringe project. This is unavoidable. Being fringe means having fewer resources (in this case mirror sites with high bandwidth). The result is an infrastructure that is painfully slow outside of narrow blessed areas.

I'm not in a blessed area. My download speeds for packages ranged from about 8-32KB/s. Picture downloading a full GCC implementation (the embedded systems spin) at that rate. We're talking download times of over an hour. (About 1.5 hours, actually, predicted by the progress pacifier.)

Only when you smear out a download over that time, you increase the chances of there being an interrupting glitch dramatically. Indeed about 15 minutes on average was how long it took before it just gave up.

The handling

Which is where we segue neatly into the handling of the infrastructure. If your infrastructure sucks (through no fault of your own!) you have to be smarter about its use.

The FreeBSD developers are not being smart about its use.

When (and not if) the download fails, instead of saving progress thus far and continuing, they throw away the download. You start over again from scratch. So if you get 10% of the way through, instead of continuing with 90% left over, they continue with 100% left over. Repeatedly and at length.

In practice this means that the package will never be installed.

There is no excuse for this handling either. We've known how to do restartable downloads for decades. If I partially install a package using APT or YUM, for example, killing it (or having it die) mid-install, it picks up from where it last was and continues. Because it uses signatures at the end to make sure that the whole file got picked up without an invalid package getting in the loop. (An even smarter system could use hash trees to identify the specific blocks that got damaged and re-download justt those blocks.) Restarting from scratch on something that failed is just going to make it fail again with no progress.

From the 8 fallacies of distributed computing this is the first three put together into one flaw. It's a flaw that all package managers used to have, but most today don't. Except, apparently, pkg. And it's the flaw that made me in the end have to drop FreeBSD as an option on my Raspberry Pi 4B not because I didn't like it, but because it literally couldn't actually do what I needed from it.

Which is a real pity.

Performance

The very badly-implemented package manager aside, there was one more weird problem that made FreeBSD look dubious: looooooooooooooooooong delays in things.

  • It took forever from boot to being able to SSH in.
  • Every time I used su, there was a palpable delay of up to ten seconds before it would pop up the password prompt.
  • Even signing in by SSH once the SSH daemon was running took forever and a day.

If I were to take an educated guess, the hardware RNG is not being used, so FreeBSD's kernel is using software means to fill the entropy needs. This wasn't a showstopper problem (and may have had a solution findable had the pkg problem not stopped me in my tracks), but it was definitely not a good look. And really, given that this is a build for a specific machine with fixed, known-in-advance hardware, there isn't a good excuse for it if there was, indeed, a solution to it.

Interesting

Some observations that aren't praise or critiques (or may be one or the other depending purely on taste).

sudo

The FreeBSD approach to system use and administration is very old school. Where literally every Linux distribution I've tried on the Raspberry Pi 4B used the sudo approach (to the point that logging in as actual root was actually a problem in some cases!), FreeBSD defaults to old-school root/user divides where you have to su to the root account and do maintenance tasks there. It took me a while to unlearn the reflex habit of typing sudo before admin commands. Note that you can install sudo if you really want to do it the other way, but this is not the native way of doing things and I'm not sure that it's fully supported. I'm not sticking around (for reasons already cited) to figure out if using sudo is viable or not, but it does at least seem to exist which suggests it might be.

“Ports” vs. “Packages”

Another case where FreeBSD differs from most other OSes is the schizophrenic divide it has between its “Ports” and its “Packages” infrastructure. “Packages” are precompiled binaries of software (in the vein of Debian's APT or Red Hat's YUM suites) while “Ports” are a massive tree of Makefiles that you use to download and build software according to presupplied recipes (and configuration scripts) more in the style of Gentoo and that family of Linux distros. Packages tend to be easier to work with while Ports covers more software. Which you use is a matter of taste.

It is generally strongly advised in the FreeBSD docs to pick one or the other and to not try and mix them. I chose Packages because the notion of compiling everything (with the inevitable screw-ups caused by bad configuration files—the reason why my previous trial of FreeBSD ended, in fact) just gives me hives. You may choose differently and it's interesting that FreeBSD supports both. (It's a pity, however, that the two don't appear to coexist peacefully.)

Conclusion

FreeBSD is not ready for mass consumption yet, but it is frustratingly close. The FreeBSD community has put a lot of work into improving it and it shows. Unfortunately it doesn't show enough. There is still the traditional Unix cowboy coding attitude showing through in both the utilities they provide and the infrastructure that supports the system, and that results in a system that is personally completely unusable.

But it's close. Indeed it's so close that if you're in a place that is well-supported by their infrastructure it may actually be ready for you. It may be a viable alternative. But until the developers of the tooling and infrastructure learn and understand the 8 fallacies of distributed computing and incorporate this understanding into their tooling, FreeBSD remains an also-ran for the majority of their prospective user base.

Which is too bad, because the world needs more competition in OS space, not less, and FreeBSD is oh-so-close to being viable as that.

Postscript: At least it wasn't OpenBSD

As bad as the FreeBSD experience was, the OpenBSD experience never happened because it's a fucking disaster from before you even install the operating system. I was six clicks deep into that before I finally hit the ARM64 support page for OpenBSD. That page contained link after link after link after link after link after link to things that had absolutely nothing to do with OpenBSD on ARM64 platforms. They were mostly link after link after link after link after link to product home pages that didn't mention OpenBSD in any way, shape, or form. Finally, in the bottom three lines of a very long web page there were links to:

Getting and installing OpenBSD/arm64:

The latest supported OpenBSD/arm64 release is OpenBSD 7.0. Here are the OpenBSD/arm64 installation instructions.

Snapshots are made available from time to time, in this location as well as on a few mirrors. Here are the OpenBSD/arm64 snapshot installation instructions as well.

What's a snapshot? Fuck knows. Which is the preferred approach? Don't look here for clues. I mean why the fuck would you want to make installation of your product easy to gather more users? Obviously you want to make it as difficult and confusing as possible so that you can only have the 1337 users, leaving behind the people who actually just want to do things with their hardware. (You know, the fools who think computers are tools, not an end unto themselves!)

Oh, note that the instructions are delivered in plain text. Despite, you know, having links inside of them. That's a bonus “fuck you” for the end-user, as is the fact that you require the serial console to install and run the images … if you ever figure out how the fuck the images are supposed to be loaded(foreshadowing!: hint, it's install70.img).

I'm going to say this again for the people in the bleachers, unlike literally every operating system I tested for the Raspberry Pi 4B, including FreeBSD!, OpenBSD requires the serial terminal to get anything working. (Almost as bad, to be fair, is Fedora and Void, two Linux distributions that need a screen and keyboard to configure before you can use them. No headless installs.)

At this point I stopped even considering OpenBSD (as I did Fedora and Void). BSD fans are always going on about how, supposedly, it was legal shenanigans that killed BSD in favour of Linux. The reality is that BSD killed itself by sucking. FreeBSD is frustratingly close to being usable and really just needs an actual infrastructure in place, but OpenBSD is almost the Platonic Form of an utterly useless piece of shit for 99.44% of people who might be interested in installing something other than The Latest Version of Linux™.

I make no secret of my love for weird programming languages. Whether spending time on Rosetta Code adding bizarre sorting algorithms in bizarre programming languages, or writing tutorials for how to use modules in a Prolog dialect, or helping a friend build a new programming language from scratch, programming languages are the thing that have fascinated me from my early days with Commodore and TRS-80 BASIC, through assorted assemblers, and then an explosion of other (often very obscure) languages over the decades. The very idea that we could use something that look, at least at a trivial glance, like actual words to tell computers what to do is just a deep joy in my life of programming.

But...

I generally hate software and the process of creating it. This is largely because most people who make software are either:

  1. unable to make good, reliable software;
  2. unwilling to make good, reliable software; and/or
  3. forced to make shoddy, unreliable software by outside forces.

One of the major problems, I feel, in the making of software, is a very bad choice in programming languages. Most especially the choice to use C or one of its descendent languages. Thus I'd like to first explain why I think C is a bad base language and why I view it as a blight on computing, and why even the descendants that improve on it miss the mark. After that I'll introduce some languages I think do the job properly.

The case against C

C was a powerhouse of a language for its original, intended target. That original intended target was for porting the then-nascent Unix operating system from a DEC PDP-7 to a DEC PDP-11.

For reference, the PDP-7 had 4K (18-bit) words standard memory and could be expanded (at tremendous cost) to 64K. The PDP-11 starts at 8KB (bytes) memory and over the years could go as high as 4MB. If you have a USB charger it has more computing horsepower in it than the PDP-7. If you have a USB 2.0 hub it has about as much computing horsepower (possibly more) as a higher-end PDP-11.

This was the environment in which C was created. The compiler had to be small and simple because the machines were cramped. This was OK, however, because the amount of complexity you could have in your programs was similarly limited. For writing small kernels and simple utilities—which all original Unix utilities were—C was acceptable. Not great. Acceptable. It's when you start programming at scale that the C 'philosophy' (read: 'attitude') fails.

So where does C fail us?

Modularity

The problem is that C doesn't scale. At all. The larger your program gets, the more obvious this becomes. You start relying on ad hoc rules of thumb and weird boilerplate to get around its limitations. A prime example of this is the eternal …

#ifndef FAUX_MODULE_HEADER_INCLUDED
#define FAUX_MODULE_HEADER_INCLUDED

.
.
.

#endif /*!FAUX_MODULE_HEADER_INCLUDED*/

… you will find in every piece of C code of substance in some form or another.

This is C's answer to modularity. It's three lines of code added to every header file, practically, that exists only because C doesn't do actual modularity. The closest thing C gets to a declaration/implementation divide is a header file (.h) and a corresponding source file (.c). And that divide is purely one of convention. Convention that starts getting messy very quickly.

But it goes beyond weird-looking boilerplate. Modularity in C simply doesn't exist unless you force it to by convention. Convention that's actually hard to pull off in many cases.

In theory, for example, you're going to want to have only public API declarations in your .h files and only private API declarations and implementations in your .c files. In practice, however, there's nothing enforcing this. Header files are brought in through blind textual substitution via the #include directive to the pre-processor. The compiler does not even see header files. It sees the expanded text that the pre-processor provides.

Inevitably people start putting things into headers that are actually code, not just declarations. Sometimes they're even almost forced into it. Any time, for example, that you work with inline functions (or their elder macro predecessors), it's sufficiently painful that a lot of the time (all of the time with macros) it's just easier to put the implementation in a header file.

And now you've polluted the declaration/implementation divide.

Modularity and encapsulation is the foundation of sane programming. While C doesn't directly prevent modularity, it does little to support it either. And this continues through many of its children. C++ isn't much better (and indeed because of how clumsy the class construct is, it often has even more code than C in its headers). Objective-C (and later Swift), along with D, share much of this flawed approach. (Other C-descended languages like Java and C# make attempts to fix this, but are constrained by what C programmers are used to. This makes them better but not great.)

Type safety

C has one data type: integers. There are some mild differences in interpretation of them: sometimes by width (though it took an astonishingly long time to even have standard ways of specifying type widths!), sometimes by sign interpretation, and sometimes by interpretation as a pointer.

On top of that one data type it has ways of grouping integers (of various sizes) into a namespace (i.e. struct and union declarations).

And finally it has a few bits of syntax sugar around pointer-flavoured integers to make them act somewhat like strings (without being actual strings) or to make them act somewhat like arrays (without being actual arrays).

Frustratingly, the one tool that C always had to provide integers of varying widths—bitfields—is so lackadaisically defined that they are in effect utterly useless. Which means that even though bit fields sound like they would be a perfect way to specify, well, bit fields in hardware registers, in practice they cannot be used reliably for that so instead ugly macros or enums are used instead, involving manual (and thus error-prone) shifting and masking procedures to manipulate them.

Now there is some attempt to prevent conversion willy-nilly from one flavour of integer to another, but it's trivially overridden and indeed, because of C's limitations, it's often necessary to override this: lacking any form of genericity pretty much forces casting as a regular tool, for example. The result is that it's easy for an invalid use of casting to slip past and into production code.

Arrays and strings

And then there's the problem of the faux-arrays and faux-strings. I'll use array sizes as an example of what's wrong with C's approach to types in general.

int array1[5] = { 1, 2, 3, 4, 5 };
int *array2 = { 1, 2, 3, 4, 5 };

/* stuff goes here */

void func1(int input[])
{
    for (int i = 0; i <= 5; i++) { printf("%d ", input[i]); }
    printf("\n");
}

What happens if you call func1(array1)? What happens with func1(array2)? What happens is that both do exactly the same thing because array1 and array2 are the same thing internally: pointer-flavoured integers that point to the head of a block of memory.

Oh, and both will have unexpected outputs because we walked one step past the size of the “array”. If the test had been i < 1000000 the function would have merrily continued printing 999995 invalid values because C's faux-arrays have no intrinsic size. The int input[] parameter is fooling you. It's really a const int *input. (Note: const in C isn't. Or might be. Sometimes. It depends on your hardware. No, really!)

There is no way for a function that receives an “array” in C to know the size of the array. What we really needed was this:

void func2(const int *input, int input_size)
{
    for (int i = 0; i < input_size; i++) { printf("%d ", input[i]); }
    printf("\n");
}

And here we face another problem: getting the size of the faux-array. The wrong way for array1 is func2(array1, sizeof(array1)); Because the sizeof operator returns the number of bytes array1 occupies, but int isn't a byte. So instead we need this mouthful: func2(array1, sizeof(array1) / sizeof(array1[0]));

All this ugly crud needed because C doesn't have actual arrays!

And it gets worse for array2. func2(array2, sizeof(array2)); won't do what you want (and its problem will be the opposite of the array1 failure). But there's no quick, easy programmatic way to calculate the number of elements as with array1. Indeed short of just magically knowing the size and manually inserting it into the call to func2(), it's not possible.

So it gets even uglier, with spooky-action-at-a-distance rearing its ugly head to use the second faux-array.

Strings are no better. (I leave it as an exercise for the student to explain why “an array of bytes terminated by a NUL character” is a maladaptive solution to string representation. There is more than one reason. Enjoy digging them all up!)

And a cast of thousands

These flaws in C are just a taste of C's failures as a language for serious software engineering. There are entire books that could be written that go into deep detail on its flaws (like its lack of memory safety, its utter mismatch to real-world hardware, its perennial security flaws, …). And really, it's not the main point of this essay. The main point is languages that are good for software engineering. It was just necessary to point out why C (and most of its offshoots) are a bad choice.

The imperative world: Ada

If I had absolute control over a low-level project from beginning to end, I would choose Ada as the implementation language. Oft-derided as “too complicated” when it was first introduced (1983 was the first published standard), it has aged well against its competitors. Yes, Ada-83 was more complicated than the C compilers of the time, but … it did more for software engineering.

First and foremost, Ada built on the legacy of Niklaus Wirth's languages in the Modula family to provide an actual module system instead of ad-hoc conventions around textual substitution. The “specification” file of a “package” (roughly equivalent to a header file) is rigorously limited only to declarations, and even determines which features are “visible” outside of the package and which aren't. Only type and function/procedure (these are different in Ada) declarations can go into a specification. Attempts to put in anything else will result in error. The “body” file of a package contains private declarations and definitions of all functions, public or otherwise. As a result there is no choice but to use only the public API. No 'clever' way for a package user to circumvent the public-facing API to use implementation details, freeing that package to change implementations without changing the front-end.

Oddly, this constrained approach to programming turns out to be very liberating.

Type safety too is vastly superior in Ada. Integer types are constrained by value ranges, representation sizes, and other such restrictions. If a given integer must fall in the range of -2 to +5672, it can be constrained to those values and those values only. Enumerations are their own type and not just a fancy way of making names for integer values. Subsets can be defined at need (of both enumerations and integers). Pointers have their own semantics (and it is literally a syntax error to “leak” dynamically-allocated memory outside of a stack frame, unlike C). Arrays and strings carry with them the information needed to use them: sizes, for example. It also has genericity mechanisms so that casting (although available for the rare times it's needed) is not needed and type safety is maintained.

Is Ada perfect? Naturally not. It is a human product and, even worse, a product of bureaucracy (the source of much of its initially-derided “complexity”). It cannot be perfect. But it is certainly a language that supports actual software engineering instead of mere coding. (Note: supports, not permits. There are no languages which don't permit software engineering.) And, in addition, over time, it's shown that its initial “complexity” was well-thought out. Ada-83 was significantly more complicated than the later C89 was. But it did more out of the box. It covered more ground. And over the years C's standard has expanded and expanded, making C an ever-more-complicated language without actually adding a whole lot of new capabilities to it. The modern C17 standard is only a little bit smaller than the modern Ada 2012 standard. Ada, from 1983 to 2012 added object orientation and contracts, primarily, making very few changes to the language to do so. C from 1983 to 2017 made a lot more (sometimes-breaking) changes (like function prototypes, void pointers, inline functions, variadic macros, threading primitives, etc.) that still amount to being less capable than Ada … while having a standard that is roughly the same size.

“Modern” C is better, yes, than pre-ANSI C, but it's still a mess (and most compilers don't really support all of the standard anyway—even C11 support isn't guaranteed across all live compilers!). In all respects, Ada is the superior choice for engineering solid software. Similar criticisms can be levied against the C++ suite of standards, except that the C++ standard is roughly an order of magnitude larger than Ada's … while being, again, strictly less capable.

The declarative world: Logtalk

The declarative world has likely been nodding along with my critiques of C and smugly patting their favoured language while doing so. And if that favoured language is, say, SML or OCaml or the like, they even have reason to be smug. But Prolog programmers don't have a reason to be smug. Is Prolog better than C?

Let's take a look.

Prolog is, without a doubt, a far higher-level programming language than C that can operate on abstractions that C users could only dream of working with. It's a powerhouse of a language and rendered particularly amazing given its age.

But it didn't age well.

I won't be getting into the virtues and flaws of dynamic vs. static type systems. Suffice it to say that I see reasons for both approaches and use both approaches at need. That being said, I lean toward static in my preferences: as static as possible, but no more. Prolog, however, is militantly dynamic. Indeed making a static Prolog (like Mercury kinda/sorta is) tends to drastically change the semantics of Prolog and reduce the power. It's almost as if great power comes with great responsibility …

What I can critique, however, is the module system used by most Prolog implementations. (Not all Prologs even have a module system. Those that do, including the popular SWI-Prolog package, conform largely, with minor variations, to the Quintus Prolog one.) Look in particular at the stupid module tricks I outlined in SWI-Prolog's module system.. Count the number of times the whole point of modules is missed in the way they're handled.

This is why I've largely dropped pure Prolog from my toolkit. Aside from quick tests of things, I rarely program in plain Prolog. No, I instead use Logtalk where Prolog once featured, and I use this chiefly because Logtalk does modularity right.

In many ways even better than some of my all-time favourite languages: Ada, Dylan, Modula-3, Eiffel, etc.

At its easiest, Logtalk can be used as a succ(Prolog, Logtalk). (What!? If C++ can use a syntax gag, so can I!). In specific instead of messing around with Prolog's inconsistent, incoherent, and fundamentally broken module system, wrapping what would be a Prolog module in an object declaration instead is going to give you better, more portable results.

Over and above using object as a module replacement, however, Logtalk gives us a veritable, tasty stew of modular abstractions including (naturally) object, but also protocol, and category. These can be mixed and matched in relationships shown by keywords like implements or imports or extends or instantiates or specializes. Further, objects (module stand-ins) can be parameterized at need while all coding constructs (object, protocol, and category) can be created statically at compile time or programmatically at run time. And, naturally, given that this is a Prolog superset, this is all open to introspection at run-time.

Logtalk has taken all the concepts of object-oriented and modular programming and turned all the dials to 11. While still being, at its core, relatively simple and elegant. It's far simpler for a Prolog user to learn Logtalk, for example, than it is for a C user to learn C++.

On this front alone, modularity, Logtalk leaves even Ada behind in the dust, making, ML-like, modules first-class constructs in the language. Further engineering focus, however, includes one of my favourite things: built-in documentation.

Any language can have a third-party documentation tool. (Most languages, even oddball “dead” ones like SNOBOL4, have them, in fact.) Very few languages, however, make documentation part of the language like Logtalk does. And the impact of making it part of the language is that documentation is far more tightly tied to the language than it is when using third-party solutions like Doxygen, making the resulting documentation more accurately paired with language constructs.

And it accomplishes all of this while keeping the powerful aspects of Prolog front and centre. Logtalk doesn't diminish Prolog in the same way that, say, Mercury tends to. It manages to elevate the power of Prolog and turns Prolog into a tool for serious software engineering.

Choice and consequences

In conclusion, this is not an attempt to tell you to use Ada or Logtalk. Ada and Logtalk happen to fit my needs for engineering-focused languages, but they may not fit yours. What is almost guaranteed, however, is that most common languages also don't fit yours, even when you think they do. (There's a whole lot of 'my cold dead fingers' attitudes in language choices out in the programming world, largely the result of ignorance of other languages.)

What I'm saying, however, is that your choice of language will constrain your ability to program. Some languages—the ones focused more on engineering software—will constrain your ability to do stupid things, dangerous things, or just plain make silly mistakes. Others—the ones focused more on making your immediate, in-the-moment coding simpler at the expense of safety, security, or correctness, will constrain your ability to make quality software that works.

Choose the right one, whichever it may be.