Dynamically static allocation in embedded systems

In most programming, dynamic allocation and subsequent freeing of memory resources is normal. In embedded programming, dynamic SRAM allocation from the heap can be an application murderer (largely because having 128KB of SRAM is nigh luxurious and allocating 32KB of that to heap is crazy talk).

The issue that kills is heap fragmentation. If you're dynamically allocating and freeing objects willy-nilly, unless all of the objects happen to be exactly the same size you are going to eventually wind up in the ridiculous position of having enough free SRAM for an object, but no single piece big enough to hold it. Your application dies, starved for memory even though you have plenty of free space.

Response patterns

In response to this, several patterns of memory usage pop up in embedded software. The three most common patterns used by embedded systems programmers are:

  1. All-static, all the time.
  2. One-time dynamic allocation.
  3. Locally-paired allocation and deallocation.

I'll explain both what these patterns are (in brief) and then explain some of the issues each brings to the table: constraints on system capabilities and designs, so that the trade-offs you're bringing to the table when using these are made manifest.

(C++ users, when reading this, should s/malloc/new/ and s/free/delete/ with attendant minor syntax changes as needed in subsequent code reading.)

All-static, all the time

This one is simple and is most common in simpler embedded systems with fixed resources. Consider, for example, a simple device that reads an ADC and periodically calculates stats on the read data, perhaps logging those stats to a UART. Your ADC will, accordingly, need a buffer to write its data into and from which data will be read and analyzed for the reports.

Where do you put your buffer?

You put it in static space:

static uint16_t adc_buffer[64];

When you start your ADC, or analyze its results, your functions take the pointer to that static buffer like this:

start_adc(adc_buffer, 64);

Easy-peasy! If you want to do premature optimization (and thus be evil) you might even foolishly try to make adc_buffer a global variable and forego the trivial extra cost of passing a pointer.

Here's why you don't do that, though.

One-time dynamic allocation

All-static as a strategy works when you have fixed resources known at compile time. But what do you do when the resources can vary at run time and you're not able to predict at compile time? Let's say, for example, that the device mentioned above actually has up to 8 ADCs. You only know, however, which are connected when the software fires up. How do you cope with that?

If you went all-static you'd be declaring your adc_buffer more like this:

static uint16_t adc_buffer[8][64];

Your calls to start or analyze would then pick the right buffer to pass in based on which ADC you're reading:

start_adc(adc_buffer[3], 64);

This is, however, wasteful of resources if you're running in a device that only has 3 ADCs hooked up. A better approach would be something like this:

static uint16_t *adc_buffer;

You'd initialize it at application start-up by doing this:

adc_buffer = malloc(count_active_adcs() * 64);

(I am eliminating a check for that malloc return for clarity. If you don't check the return value of malloc in production code you need to be shot.)

You'd then use it like this, assuming you want the 4th ADC started:

start_adc(adc_buffer+3*64, 64);

(Of course really you'd use a symbolic name like BUFFER_SIZE instead of 64, and a variable like current_buffer instead of 3, but I'm trying to minimize excess code baggage here.)

This is a perfectly safe use of dynamic allocation because it is never freed. It is functionally the same as using static allocation, in terms of heap fragmentation, but it is more flexible in memory assignment. Indeed, for a variety of reasons, I tend to use this instead of static allocation in most of my projects, even simple ones with static resources. (One of the chief reasons: nothing ever stays simple and static over time.)

But it's still not universally applicable.

Locally-paired allocation and deallocation

A common situation I find myself in is having to allocate a temporary buffer to pass into a function for processing. The size of the buffer may vary from call to call and, worse, the buffer may be sizable: like say a 1K buffer. In a typical embedded environment this is far too large an entity to allocate on the stack (as a local variable). On the other hand, it's also a bad thing to have allocated statically. There may be a dozen functions that need this kind of buffering. If each one statically allocated its own (too-large because the size varies unpredictably each iteration) my SRAM would be eaten up before I can even link the app, not to mention run it, and if they tried sharing there's too great a chance that one steps on another's buffer. So we instead do this:

void my_funky_function(void *data, size_t data_bytes)
{
    uint8_t *buffer = malloc(data_bytes);
    do_something_with(data, data_bytes, buffer);
    free(buffer);
}

(Again, eschewing return value checks, etc. for brevity. Check your malloc in production code!)

This works without fragmenting memory, even if do_something_with() uses this same pattern internally, and maybe a further, deeper function also_do_with(). As long as the malloc and free calls are paired and nested this is not a problem.

In the following diagrams, _ is a bit of heap memory that is free. 1 is heap memory used by my_funky_function, 2 is used by do_something_with, and 3 is used by also_do_with, a function called by do_something_with. * is a bit of heap memory that is used by something out of scope.

When my_funky_function is entered, the heap will look something like this:

**_***__**____________________

When the memory for the buffer is allocated, and do_something_with is called, it will look like this:

**_***__**11111_______________

When do_something_with calls also_do_with the heap might look something like this:

**_***22**11111_______________

And finally while also_do_with is running, after its allocation, it might look like this:

**_***22**111113333333________

It's all nice and cosy. And here's the sequence as each function finishes and frees its resources:

**_***22**111113333333________   /* also_do_with() active      */
**_***22**11111_______________   /* do_something_with() active */
**_***__**11111_______________   /* my_funky_function() active */
**_***__**____________________   /* my_funky_function() exited */

In any sane heap implementation, the heap's layout will be identical upon exiting my_funky_function as it was before entering. There's no room for fragmentation and churn because the nested nature of allocation and deallocation ensures that the structure of the heap is reconstructed on the way out.

But...

All of this is fine and dandy and you can make some very nice applications that efficiently use SRAM without fragmentation this way. But there is one increasingly-common use case not represented here that some readers are already champing at the bit to correct.

What happens if you can't know in advance (even at application startup) how many items you need, and/or you can't put everything neatly into nested pairs?

I faced this problem head-on when writing an event queue system. Events in this system contain a lot of information:

And while there is an API for firing events from pre-allocated memory blocks, this API is inconvenient to use and makes the event system less attractive.

On the other hand, the first rendition of the event system that used calloc to allocate an event's resources, and then called free (both on the event and on its embedded data pointer) after the event's callback had been fired had a major problem: since the point of firing and the point of activation were often separated in time, and since allocating events themselves and event data blocks were interlaced and usually of different sizes, fragmentation of the heap was inevitable. And sure enough my test system would start being unable to allocate new events after only about a thousand iterations of the test.

So what's the solution?

Dynamically static allocation

I borrowed a page from the “One-time dynamic allocation” section above, with a twist: the “one-time” allocation was actually “exponentially-less-frequent multi-time allocation”.

It works like this. Given an event definition:

typedef struct event
{
    uint32_t activation_tick, repetition_interval;
    coroutine_t *callback;
    void *event_data;
    bool in_use;
} event_t;

Declare a pointer to the event_t in the static space of all the event queue management functions:

event_t *allocated_events;
size_t num_allocated_events;
#define INITIAL_SIZE 4

(INITIAL_SIZE is a number I pull out of my butt at first, but later tune according to measured high water marks to keep reallocations to a minimum.)

Allocating a new event in the system now looks something like this (keeping in mind that I'm not checking error returns and such for clarity):

event_t *allocate_event(void)
{
    if (!allocated_events) /* <1> */
    {
        allocated_events = calloc(INITIAL_SIZE, sizeof(event_t *));
        num_allocated_events = INITIAL_SIZE;
    }
retry:
    for (size_t i = 0; i < num_allocated_events; i++) /* <2> */
    {
        if (!allocated_events[i]) /* <3> */
        {
            allocated_events[i] = calloc(1, sizeof(event_t));
            allocated_events[i]->in_use = true;
            return allocated_events[i];
        }
        else if (allocated_events 
             && !allocated_events[i]->in_use) /* <4> */
        {
            allocated_events[i]->in_use = true;
            return allocated_events[i];
        }
    }
    double_allocated_events(); /* <5> */
    goto retry; /* <6> */
}

Wut

This needs some breaking down to explain and justify. (It may help, if you come from OOPsland, to look at the above again attaching the name “object pool” to the technique. Then the key concepts might fall into place, this being, in effect, the procedural precursor to object pools.)

<1> is a check that ensures that there are event slots available at all. If that pointer has not been initialized yet, we make an initial array based on an INITIAL_SIZE that has been selected with an eye toward typical usage patterns observed in the wild. Note that we've allocated an array of pointers to event_t here, not the actual event_t objects themselves!

<2> simply iterates over every item in allocated_events looking for the first available slot. An available slot is either one that hasn't been allocated an event_t object yet (<3>) or one that has been allocated but is no longer in use (<4>).

(I commit the dark evil here of returning from the middle of a function, yes. And I commit an even darker evil later. Both are because the logic is otherwise difficult to follow for no good reason beyond sticking to dogma.)

If that loop finishes, there were no event slots available so I double the size of the array (<5>) and try again (<6>).

(<6> is my second, even darker evil. I use a gasp! goto! Pull my programmer license stat! Even if the goto-less version of this obfuscates intent for no good reason!)

This trick of double-and-retry looks dangerous at first glance, but keep in mind that if the allocation fails, I actually test it in production code, and bug out. This is a very simplified version of the final product specifically to highlight a technique. The way the real code works it's guaranteed that there will be a slot available the second time, or the second time won't even execute.

So, this is a lot to take in. One question that will immediately come to mind is “why are you allocating individual events dynamically instead of making it an array of event_t? The reason for this lies in the secret sauce of double_allocated_events().

Secret sauce

void double_allocated_events(void)
{
    size_t current_size = num_allocated_events;
    size_t new_size = current_size * 2;
    allocated_events = realloc(allocated_events, 
                       new_size * sizeof(event_t *));
    memset(&allocated_events[current_size], 0, 
            current_size * sizeof(event_t *));
}

Events fly through the system as pointers. If you had an array of event_t, the moment realloc is called, there is a very real (almost certain!) chance that events in flight have suddenly been rendered bad. It is for this reason that the array of event slots is an array of pointers, each of which is individually allocated. While in theory this leads to some inevitable heap fragmentation, in practice (if the system is properly tuned with a good INITIAL_SIZE) the number of such reallocations is minimal and thus so is any attendant fragmentation. Even using a small INITIAL_SIZE value (2, say), leads to a minimum amount of fragmentation because the exponential nature of the doubling means your queue of event slots will rapidly grow to the size you need after which it will never have to be allocated again.

Deallocation

Deallocating an event now looks like this:

void deallocate_event(event_t *e)
{
    e->in_use = false;
}

Events are just flagged as available, and the next time a “dynamic” event is needed, an available event is returned.

Improvement

There is room for improvement in the above design. Specific improvements could include:

  1. Linear growth.
  2. Chunked event allocation.
  3. Event availability management.

None of these are currently implemented because the event queue system currently in place more than meets my needs in real-time embedded control. Should this change, however, there are obvious improvements which could be made that will not in any way change the API (and thus the clients) of the event queue system.

Linear growth.

I use a doubling scheme here. This means that if I use a single event past a threshold, I have a lot of slack space. For example with an INITIAL_SIZE of 4, if my high water mark is 5, I will have three event_t * spaces left doing nothing.

This is not a major problem in my systems thus far precisely because the slot array is an array of pointers. The slack space in the above example (taken from real world data) is 12 bytes on my target machine. I'm not yet at the point where I'm counting down to double-digit bytes for eking out memory. If I get there, I can use a different growth scheme like going up by INITIAL_SIZE instead of doubling (possibly using a linked list so that I'm not even reallocating). This will increase the number of (re)allocations (with commensurate risk of fragmentation) in exchange for lower overall memory usage.

Chunked event allocation.

Currently I calloc new events as I grow into the empty slots. Given that this can occur at random points in the execution time, it guarantees a small amount of fragmentation. Again real-world data suggests it's not a problem for me, but should it become a problem, I would improve this with chunk-allocation of events both the first time that the events are allocated and when they're doubled (or otherwise grown).

The technique would be to calloc(count, sizeof(event_t)) and then treat the result as an array of event_t, putting the pointer of each item into the allocated_events array. The advantage of this is that by allocating the chunks all at once I reduce fragmentation opportunities. The disadvantage is that I waste, in the previous example (high water 5 with an INITIAL_SIZE of 4), 60 bytes of RAM (3 unnecessarily-allocated events, each sized 20 bytes).

Event availability management.

Currently I read the event slots from the beginning of the array each time. This wastes a few cycles in a loop (sometimes twice!) which could be mitigated by either keeping track of the lowest available slot and starting from there, or by moving the pointers around to the beginning on availability.

The truth, however, is that allocation time costs are minimal and the code complexity of managing more efficient allocation is not warranted for my problem domain. I deal with events on the multi-millisecond scale. Walking the array of events on allocation is on the multi-microsecond scale. Three orders of magnitude difference in time means that it will be a while before this cost is large enough to warrant the extra complexity.

Summary

Dynamic memory in embedded systems is an application killer if not handled carefully. In this little essay I surveyed three very common techniques to rein in the madness (all static, one-time dynamic, nested allocate/deallocate) and introduced another less common one that also serves to rein in the reign of terror (along with outlining a few ways to enhance even that).

There is really no reason to be afraid of dynamic memory in embedded systems. Just be cautious and creative.