Languages for software engineering

December 29, 2021

I make no secret of my love for weird programming languages. Whether spending time on Rosetta Code adding bizarre sorting algorithms in bizarre programming languages, or writing tutorials for how to use modules in a Prolog dialect, or helping a friend build a new programming language from scratch, programming languages are the thing that have fascinated me from my early days with Commodore and TRS-80 BASIC, through assorted assemblers, and then an explosion of other (often very obscure) languages over the decades. The very idea that we could use something that look, at least at a trivial glance, like actual words to tell computers what to do is just a deep joy in my life of programming.

But...

I generally hate software and the process of creating it. This is largely because most people who make software are either:

unable to make good, reliable software;
unwilling to make good, reliable software; and/or
forced to make shoddy, unreliable software by outside forces.

One of the major problems, I feel, in the making of software, is a very bad choice in programming languages. Most especially the choice to use C or one of its descendent languages. Thus I'd like to first explain why I think C is a bad base language and why I view it as a blight on computing, and why even the descendants that improve on it miss the mark. After that I'll introduce some languages I think do the job properly.

The case against C

C was a powerhouse of a language for its original, intended target. That original intended target was for porting the then-nascent Unix operating system from a DEC PDP-7 to a DEC PDP-11.

For reference, the PDP-7 had 4K (18-bit) words standard memory and could be expanded (at tremendous cost) to 64K. The PDP-11 starts at 8KB (bytes) memory and over the years could go as high as 4MB. If you have a USB charger it has more computing horsepower in it than the PDP-7. If you have a USB 2.0 hub it has about as much computing horsepower (possibly more) as a higher-end PDP-11.

This was the environment in which C was created. The compiler had to be small and simple because the machines were cramped. This was OK, however, because the amount of complexity you could have in your programs was similarly limited. For writing small kernels and simple utilities—which all original Unix utilities were—C was acceptable. Not great. Acceptable. It's when you start programming at scale that the C 'philosophy' (read: 'attitude') fails.

So where does C fail us?

Modularity

The problem is that C doesn't scale. At all. The larger your program gets, the more obvious this becomes. You start relying on ad hoc rules of thumb and weird boilerplate to get around its limitations. A prime example of this is the eternal …

#ifndef FAUX_MODULE_HEADER_INCLUDED
#define FAUX_MODULE_HEADER_INCLUDED

.
.
.

#endif /*!FAUX_MODULE_HEADER_INCLUDED*/

… you will find in every piece of C code of substance in some form or another.

This is C's answer to modularity. It's three lines of code added to every header file, practically, that exists only because C doesn't do actual modularity. The closest thing C gets to a declaration/implementation divide is a header file (.h) and a corresponding source file (.c). And that divide is purely one of convention. Convention that starts getting messy very quickly.

But it goes beyond weird-looking boilerplate. Modularity in C simply doesn't exist unless you force it to by convention. Convention that's actually hard to pull off in many cases.

In theory, for example, you're going to want to have only public API declarations in your .h files and only private API declarations and implementations in your .c files. In practice, however, there's nothing enforcing this. Header files are brought in through blind textual substitution via the #include directive to the pre-processor. The compiler does not even see header files. It sees the expanded text that the pre-processor provides.

Inevitably people start putting things into headers that are actually code, not just declarations. Sometimes they're even almost forced into it. Any time, for example, that you work with inline functions (or their elder macro predecessors), it's sufficiently painful that a lot of the time (all of the time with macros) it's just easier to put the implementation in a header file.

And now you've polluted the declaration/implementation divide.

Modularity and encapsulation is the foundation of sane programming. While C doesn't directly prevent modularity, it does little to support it either. And this continues through many of its children. C++ isn't much better (and indeed because of how clumsy the class construct is, it often has even more code than C in its headers). Objective-C (and later Swift), along with D, share much of this flawed approach. (Other C-descended languages like Java and C# make attempts to fix this, but are constrained by what C programmers are used to. This makes them better but not great.)

Type safety

C has one data type: integers. There are some mild differences in interpretation of them: sometimes by width (though it took an astonishingly long time to even have standard ways of specifying type widths!), sometimes by sign interpretation, and sometimes by interpretation as a pointer.

On top of that one data type it has ways of grouping integers (of various sizes) into a namespace (i.e. struct and union declarations).

And finally it has a few bits of syntax sugar around pointer-flavoured integers to make them act somewhat like strings (without being actual strings) or to make them act somewhat like arrays (without being actual arrays).

Frustratingly, the one tool that C always had to provide integers of varying widths—bitfields—is so lackadaisically defined that they are in effect utterly useless. Which means that even though bit fields sound like they would be a perfect way to specify, well, bit fields in hardware registers, in practice they cannot be used reliably for that so instead ugly macros or enums are used instead, involving manual (and thus error-prone) shifting and masking procedures to manipulate them.

Now there is some attempt to prevent conversion willy-nilly from one flavour of integer to another, but it's trivially overridden and indeed, because of C's limitations, it's often necessary to override this: lacking any form of genericity pretty much forces casting as a regular tool, for example. The result is that it's easy for an invalid use of casting to slip past and into production code.

Arrays and strings

And then there's the problem of the faux-arrays and faux-strings. I'll use array sizes as an example of what's wrong with C's approach to types in general.

int array1[5] = { 1, 2, 3, 4, 5 };
int *array2 = { 1, 2, 3, 4, 5 };

/* stuff goes here */

void func1(int input[])
{
    for (int i = 0; i <= 5; i++) { printf("%d ", input[i]); }
    printf("\n");
}

What happens if you call func1(array1)? What happens with func1(array2)? What happens is that both do exactly the same thing because array1 and array2 are the same thing internally: pointer-flavoured integers that point to the head of a block of memory.

Oh, and both will have unexpected outputs because we walked one step past the size of the “array”. If the test had been i < 1000000 the function would have merrily continued printing 999995 invalid values because C's faux-arrays have no intrinsic size. The int input[] parameter is fooling you. It's really a const int *input. (Note: const in C isn't. Or might be. Sometimes. It depends on your hardware. No, really!)

There is no way for a function that receives an “array” in C to know the size of the array. What we really needed was this:

void func2(const int *input, int input_size)
{
    for (int i = 0; i < input_size; i++) { printf("%d ", input[i]); }
    printf("\n");
}

And here we face another problem: getting the size of the faux-array. The wrong way for array1 is func2(array1, sizeof(array1)); Because the sizeof operator returns the number of bytes array1 occupies, but int isn't a byte. So instead we need this mouthful: func2(array1, sizeof(array1) / sizeof(array1[0]));

All this ugly crud needed because C doesn't have actual arrays!

And it gets worse for array2. func2(array2, sizeof(array2)); won't do what you want (and its problem will be the opposite of the array1 failure). But there's no quick, easy programmatic way to calculate the number of elements as with array1. Indeed short of just magically knowing the size and manually inserting it into the call to func2(), it's not possible.

So it gets even uglier, with spooky-action-at-a-distance rearing its ugly head to use the second faux-array.

Strings are no better. (I leave it as an exercise for the student to explain why “an array of bytes terminated by a NUL character” is a maladaptive solution to string representation. There is more than one reason. Enjoy digging them all up!)

And a cast of thousands

These flaws in C are just a taste of C's failures as a language for serious software engineering. There are entire books that could be written that go into deep detail on its flaws (like its lack of memory safety, its utter mismatch to real-world hardware, its perennial security flaws, …). And really, it's not the main point of this essay. The main point is languages that are good for software engineering. It was just necessary to point out why C (and most of its offshoots) are a bad choice.

The imperative world: Ada

If I had absolute control over a low-level project from beginning to end, I would choose Ada as the implementation language. Oft-derided as “too complicated” when it was first introduced (1983 was the first published standard), it has aged well against its competitors. Yes, Ada-83 was more complicated than the C compilers of the time, but … it did more for software engineering.

First and foremost, Ada built on the legacy of Niklaus Wirth's languages in the Modula family to provide an actual module system instead of ad-hoc conventions around textual substitution. The “specification” file of a “package” (roughly equivalent to a header file) is rigorously limited only to declarations, and even determines which features are “visible” outside of the package and which aren't. Only type and function/procedure (these are different in Ada) declarations can go into a specification. Attempts to put in anything else will result in error. The “body” file of a package contains private declarations and definitions of all functions, public or otherwise. As a result there is no choice but to use only the public API. No 'clever' way for a package user to circumvent the public-facing API to use implementation details, freeing that package to change implementations without changing the front-end.

Oddly, this constrained approach to programming turns out to be very liberating.

Type safety too is vastly superior in Ada. Integer types are constrained by value ranges, representation sizes, and other such restrictions. If a given integer must fall in the range of -2 to +5672, it can be constrained to those values and those values only. Enumerations are their own type and not just a fancy way of making names for integer values. Subsets can be defined at need (of both enumerations and integers). Pointers have their own semantics (and it is literally a syntax error to “leak” dynamically-allocated memory outside of a stack frame, unlike C). Arrays and strings carry with them the information needed to use them: sizes, for example. It also has genericity mechanisms so that casting (although available for the rare times it's needed) is not needed and type safety is maintained.

Is Ada perfect? Naturally not. It is a human product and, even worse, a product of bureaucracy (the source of much of its initially-derided “complexity”). It cannot be perfect. But it is certainly a language that supports actual software engineering instead of mere coding. (Note: supports, not permits. There are no languages which don't permit software engineering.) And, in addition, over time, it's shown that its initial “complexity” was well-thought out. Ada-83 was significantly more complicated than the later C89 was. But it did more out of the box. It covered more ground. And over the years C's standard has expanded and expanded, making C an ever-more-complicated language without actually adding a whole lot of new capabilities to it. The modern C17 standard is only a little bit smaller than the modern Ada 2012 standard. Ada, from 1983 to 2012 added object orientation and contracts, primarily, making very few changes to the language to do so. C from 1983 to 2017 made a lot more (sometimes-breaking) changes (like function prototypes, void pointers, inline functions, variadic macros, threading primitives, etc.) that still amount to being less capable than Ada … while having a standard that is roughly the same size.

“Modern” C is better, yes, than pre-ANSI C, but it's still a mess (and most compilers don't really support all of the standard anyway—even C11 support isn't guaranteed across all live compilers!). In all respects, Ada is the superior choice for engineering solid software. Similar criticisms can be levied against the C++ suite of standards, except that the C++ standard is roughly an order of magnitude larger than Ada's … while being, again, strictly less capable.

The declarative world: Logtalk

The declarative world has likely been nodding along with my critiques of C and smugly patting their favoured language while doing so. And if that favoured language is, say, SML or OCaml or the like, they even have reason to be smug. But Prolog programmers don't have a reason to be smug. Is Prolog better than C?

Let's take a look.

Prolog is, without a doubt, a far higher-level programming language than C that can operate on abstractions that C users could only dream of working with. It's a powerhouse of a language and rendered particularly amazing given its age.

But it didn't age well.

I won't be getting into the virtues and flaws of dynamic vs. static type systems. Suffice it to say that I see reasons for both approaches and use both approaches at need. That being said, I lean toward static in my preferences: as static as possible, but no more. Prolog, however, is militantly dynamic. Indeed making a static Prolog (like Mercury kinda/sorta is) tends to drastically change the semantics of Prolog and reduce the power. It's almost as if great power comes with great responsibility …

What I can critique, however, is the module system used by most Prolog implementations. (Not all Prologs even have a module system. Those that do, including the popular SWI-Prolog package, conform largely, with minor variations, to the Quintus Prolog one.) Look in particular at the stupid module tricks I outlined in SWI-Prolog's module system.. Count the number of times the whole point of modules is missed in the way they're handled.

This is why I've largely dropped pure Prolog from my toolkit. Aside from quick tests of things, I rarely program in plain Prolog. No, I instead use Logtalk where Prolog once featured, and I use this chiefly because Logtalk does modularity right.

In many ways even better than some of my all-time favourite languages: Ada, Dylan, Modula-3, Eiffel, etc.

At its easiest, Logtalk can be used as a succ(Prolog, Logtalk). (What!? If C++ can use a syntax gag, so can I!). In specific instead of messing around with Prolog's inconsistent, incoherent, and fundamentally broken module system, wrapping what would be a Prolog module in an object declaration instead is going to give you better, more portable results.

Over and above using object as a module replacement, however, Logtalk gives us a veritable, tasty stew of modular abstractions including (naturally) object, but also protocol, and category. These can be mixed and matched in relationships shown by keywords like implements or imports or extends or instantiates or specializes. Further, objects (module stand-ins) can be parameterized at need while all coding constructs (object, protocol, and category) can be created statically at compile time or programmatically at run time. And, naturally, given that this is a Prolog superset, this is all open to introspection at run-time.

Logtalk has taken all the concepts of object-oriented and modular programming and turned all the dials to 11. While still being, at its core, relatively simple and elegant. It's far simpler for a Prolog user to learn Logtalk, for example, than it is for a C user to learn C++.

On this front alone, modularity, Logtalk leaves even Ada behind in the dust, making, ML-like, modules first-class constructs in the language. Further engineering focus, however, includes one of my favourite things: built-in documentation.

Any language can have a third-party documentation tool. (Most languages, even oddball “dead” ones like SNOBOL4, have them, in fact.) Very few languages, however, make documentation part of the language like Logtalk does. And the impact of making it part of the language is that documentation is far more tightly tied to the language than it is when using third-party solutions like Doxygen, making the resulting documentation more accurately paired with language constructs.

And it accomplishes all of this while keeping the powerful aspects of Prolog front and centre. Logtalk doesn't diminish Prolog in the same way that, say, Mercury tends to. It manages to elevate the power of Prolog and turns Prolog into a tool for serious software engineering.

Choice and consequences

In conclusion, this is not an attempt to tell you to use Ada or Logtalk. Ada and Logtalk happen to fit my needs for engineering-focused languages, but they may not fit yours. What is almost guaranteed, however, is that most common languages also don't fit yours, even when you think they do. (There's a whole lot of 'my cold dead fingers' attitudes in language choices out in the programming world, largely the result of ignorance of other languages.)

What I'm saying, however, is that your choice of language will constrain your ability to program. Some languages—the ones focused more on engineering software—will constrain your ability to do stupid things, dangerous things, or just plain make silly mistakes. Others—the ones focused more on making your immediate, in-the-moment coding simpler at the expense of safety, security, or correctness, will constrain your ability to make quality software that works.

Choose the right one, whichever it may be.