Why software reliability is hard

January 18, 2022

There's a lot of talk (including from me) about how unreliable software is and how it's embarrassing that it can't ever be gotten right. The reason for this, however, is pretty clear if you're inside of it making it. I'll illustrate with a problem I found and solved today.

The problem code

So, I spent all of Friday, all of Monday, and half of Tuesday trying to figure out a problem in a library I'm responsible for. There's a set of states being driven by a call to a “booster” function that simultaneously keeps the hardware alive and kicking (since I'm eschewing interrupts for assorted reasons) and that reports the ensuing state of the beast.

The code looked something like this (vertically compressed, virtualized):

switch (booster()) /* <1> */
{
case 1: handle_booster1(); break;
case 2: handle_booster2(); break;
case 3: handle_booster3(); break;
case 4: handle_booster4(); break;
default: /* no special handling */
};

This worked. Once. The first time that block of code was executed, the appropriate case was caught and handled. Every subsequent iteration had it fail. It fell through the switch matching nothing, 100% of the time.

And it was weird. No matter how I approached it, if I instrumented it (blinking a LED) it failed. If I debugged it, however, everything worked in the mechanisms behind the call to booster(). Everything worked leading up to booster(). Everything worked after booster(). But that switch statement only worked once.

Then, even more bizarrely, if there's a breakpoint set on <1> above, the switch statement worked 100% of the time. It just had to pause right there, then continue, and it all worked out fine.

The solution code

int rc = booster();
switch (rc) /* <1> */
{
case 1: handle_booster1(); break;
case 2: handle_booster2(); break;
case 3: handle_booster3(); break;
case 4: handle_booster4(); break;
default: /* no special handling */
};

This code works. 100% of the time. With or without debugging. Every iteration. There's no reason for this. Semantically this code is identical to the previous version. Hell, even an '80s-era peephole optimizer should be generating identical code in both cases. Yet the first case broke, the second case does not.

And this is the problem

What's here is obviously a compiler bug (and a pretty bizarre and ugly one at that). Now I work in embedded systems, so I face less of this problem, but even I get hit with this single, huge, elephant in the room of software development: We make (buggy) software with (buggy) software that is itself based on (buggy) software. (And usually sits on top of buggy hardware. But that's a story for another time.)

How do you make reliable software in that case? In my own far simplified case of writing software I have to deal with:

Hardware. (Which is often buggy. Errata sheets are a thing for a reason.)
Driver. Vendor-supplied or my own, they're usually buggy. And they're usually written in C. Which is itself a piece of buggy software. (Which is itself usually written in C which is a piece of buggy—stop me when you spot the pattern here.) Which itself uses a standard library that is buggy (and written in a buggy compiler). Which … (I think the point is clear now?)
Standard C libraries. Which are buggy and built on buggy software.
My own libraries. Which are buggy and built on buggy software.
My application logic. Which is buggy and built on buggy software.

That's five (often recursive) layers of bugginess. And I'm programming “close to the metal”. Consider a more traditional programmer: Hardware → drivers (note: plural) → kernel → OS services → compiler → standard library → web browser → interpreter → application framework → network stack → … and that list goes on and on and on.

Why is anybody surprised at how slipshod software is?

And this is part of the solution

One of the major problems that this wobbly stack of fruit-flavoured gelatin we call software has at its foundation is the selection, at its core, of a single, terrible programming language: C. Even when C isn't the language of direct implementation, I guarantee you that throughout that stack, even the truncated version I supply, there is C code clutching at the software with its sloppy fingers. The drivers are likely written in C (with a smattering of assembler), as is the kernel and half or more of the OS services. The compiler was either written in C or it likely generated C in the back-end. The standard library was likely largely written in C. The web browser likely has a huge blob of C in it. The interpreter was likely implemented in C, as was the network stack.

And C is a terrible language for reliability. It was seemingly designed to trip up programmers by defaulting to bad technique. Can you write reliable software in C? Sure! You'll just be incredibly unproductive as you go out of your way to second-guess every aspect of the language: checking bounds on each operation, for example, or inspecting each and every return value (yes, printf() included!) to ensure correct operation.

And every time, in this exhausting, easy-to-overlook process you slip up, that's a bug waiting to slip through even in someone else's code entirely. Like my mysterious switch statement problem.

So stop using C unless you're forced to. And if you're forced to, start planning your exit from C. (And not in the direction of C++ either: it's literally got all of C's flaws and then adds a few hundred of its own!) There are languages out there which are good for writing stable, reliable software. On the low-level/imperative side there's the Modula family (Modula-3 being my historic personal favourite). There's also Ada (my current favourite in that level) and Rust.

On the high level side there's other languages you can look at. OCaml is pretty devastatingly reliable, yet fast, for example, despite being a “wasteful” functional language with honest-to-goodness automated memory management. My friend Paul Bone is working on the Plasma programming language which is also another non-C language (implemented in a non-C language) that should give, when complete, a good, solid platform for high-level, reliable software.

TL;DR LOL!

Stop using C. Stop using languages whose design notes have been informed by C. C was a fine language for its time, its place, and its target platform. It has no business being so central to software half a century later like we didn't learn anything.

We did. Start applying it!