Crash-only-Software


.noise
Hey there,
It’s been a while since my last post. I’m working on multiple projects now, so scheduling for new articles is kinda messed up. Also, I have this little side-op running, calling it “project happy ever after”, and if everything works out, I’ma open up my life for a 2-player speedrun.
So, what are we going to look into today? Well, I happened to be very upset that my Ubuntu didn’t have a built-in sticky notes app, so I went ahead and wrote one. Yeah, there are some half-assed snaps in the software center, but looking at these repositories makes me wanna bite my keyboard.
Problem is, sticky notes isn’t just your average text editor. We don’t have any means to [ save document as ] or [ load from file ]. So what can we do? We could periodically check for changes in data and write them to disk. Well, we’ll kinda do, as you’ll see, but it seems like "periodically” won’t do. We need to constantly write to disk, ideally only small chunks of data. A sticky-notes app is probably small enough that you could write the whole application to disk in the blink of an eye, but that won’t do for any software, as we’ll discuss.
First things first, if you wanna clone the repository and play around with sticky notes, feel free to just grab it in my Github repository. And pls. don’t bite your keyboard, omae. I didn’t take the time to setup a proper project structure and kind of made it all up as I went. I chose Java because it is easy to use and build GUIs with, no other reason.
And secondly, here is something for your ears, currently trying to play it myself on some equipment I soldered up - only problem is my voice can’t be fine tuned that well (yet).
What is crash-only-software
Crash-only-software is a useful design philosophy that’s mostly applied in embedded programming, but there are use cases for this in all kinds of applications. The most basic example would be a microcontroller, let’s assume Raspberry Pi here. Because a RbP will run until you plug it’s power off, there is no way of knowing how long it will be up. And yes, technically there IS a way of shutting down a RbP, but I’ve never seen anyone do it.
So, when designing software to run on a Raspberry, we need to take into account that a power-off could occur any minute. That being said, we can’t wait for the user to click [ save ] or gracefully terminating their database connection. We’ll have to do this stuff on the fly.
There are many micro controllers and programs with similar problems. The music box in your car won’t know when you’re about to shut down the car, a rocket explodes instead of shutting down, and a sticky notes app will probably just run in the background until the computer is turned off.
How to design crash-only ?
A good advice for designing crash-only would be to follow two rules:
When something changes, we immediately save it
We don’t assume anything when starting up
The first rule is easy to understand, and we’ll look at sticky notes for reference here. So, if the user inputs anything, the first thing we do is saving the data. This sounds kinda nuts, I know. When you think about it, many software’s waiting for you to tell it to save. Take Gimp for example. Gimp won’t save/export anything unless you explicitly tell it to, and in a way that is how production software should behave to be efficient. However, if you used Gimp on a system that could crash any moment, you’d surely be sad about loosing an hour’s work in an instant. By saving user input directly, we’ll make sure the delta between input and save is as small as possible. Worst case, a single letter is missing because the software crashed while saving the content between key presses.
But it goes beyond that. Sticky notes will write out it’s position on the screen when it changes, also it’s size and the last active tab. I’m yet to try it out on multiple screen systems, and maybe this’ll break, but that leads us to rule #2:
We don’t assume things when starting. We go all the way and initialize like it’s our very first boot. That means we load anything in that we previously saved, but we’re not dependent on it. Take for example my attempt on loading the actual sticky notes:
public void loadNotes(){
noodlez= fileHandler.loadNotes();
//sanity checks
if(noodlez == null){
noodlez = new ArrayList<String>();
noodlez.add("Welcome to noodleZoup");
}
if(noodlez.size() < 5){
for(int c = noodlez.size(); c < 5; c++){
noodlez.add("");
}
}
noteArea.setText(noodlez.get(config.getActiveTab()));
buttons.get(config.getActiveTab()).setBackground(Color.orange);
buttons.get(config.getActiveTab()).setForeground(Color.DARK_GRAY);
}
Here, we use our filehandle to load data, but this data could also be null, the same as if we just did our first run on a new system. In this case, we need to create the objects we want to use, 5 ArrayLists in this case.
The config in this example works in exactly the same way, the initialization just took place at an earlier stage. The point I’m trying to make is, I could’ve just required the user to have a file with 5 ArrayLists in it, or always load default data when starting up. But that is not the design philosophy here. We don’t want the user to install our application via an installer and have their system in a specific state that serves our purpose. What if we didn’t have any read/write permissions on the system? Are we expected to just fail and abort with a lame-ass error code, asking to "contact support for help”? Of course not!
I encourage you to just take a look at these 4 classes and try implementing a piece of software like this yourself. I won’t go over the code in detail this time, because this isn’t about writing a basic text input, but rather a way of thinking.
When to apply Crash-only
Probably the hardest part is knowing when this style of coding is appropriate, and there are more use cases than meets the eye. As I already mentioned, a micro controller is crash-only per design. Your arduino needs to find into a well defined state each time you power it up, and there can’t be any artifacts left from the last run before power went down. As seen in sticky notes, it is nevertheless useful or even required to save details about the last run, be that the actual notes or really any data.
As a rule of thumb, ask yourself does my software have a goal? Usually, a micro controller will just run forever, in contrast to let’s say Powerpoint or VSCode, which you would close after finishing your work. Another example would be a shell application using ssh. Each of these applications will sooner or later find itself in a [ finished ] state when it’s time to clean up and exit.
The shell itself, however, is a good example of a crash-only, since it just runs and waits without a [ finished ] state.
Also, there is software that is kind of similar to crash-only by design. I’m talking about video games. Usually, you’d select an option in the menu and instantaneously get a response from the software (vs needing to restart the whole game). Often times games would rather use toggles than selectors, like [ autoaim ] or [ show tutorials ], which makes the game somewhat faster (to use). Especially in multiplayer games, options selected are directly saved somewhere (even if it’s on the server you’re playing on). And if the game crashes in the middle of a session, it’s normally no big deal to just rejoin a match. Also, your chosen options would normally be restored with ease, but even if they where gone, the game would just load up the default configuration.
I like to imagine a rocket. The rocket doesn’t really care whether it’s been locked on 5 times already, it has to function NOW. Also it would probably not even care what it’s carrying or whether it’s failing on it’s way to [ the moon ] / [ a military target ] - as soon as it’s restarting, it has to function again properly.
A good sci-fi example would be an android waking up, not knowing where or who it is. Would it just shut-down again, going to sleep mode? That’d be pretty boring, right? It would work until it suffers critical system failure / no battery. Then after repairing, it would just start anew, with or without it’s old data.
As you probably already figured, often times clean software design patterns and crash-only are one and the same, since good software should be able to recover from about anything, even having no data at startup. However, often times companies build way to complicated software with way to many requirements. Didn’t properly shut down subsystem A? Well, too bad a lock wasn’t released. Didn’t close database connection? Well, time to restart the whole system and walk through all regression tests again. I’m mumbling here, but my point should be clear.
Optimizing the software
If you walked through my example code, you’d probably be wondering how to optimize it. After all, saving after each input doesn’t really sound that efficient. And you’d be right to assume that. So, how could we make the software more robust and efficient? Here are some thoughts, all hypothetically:
Write to file using pointers
This one should be self explanatory. Writing the full data set is not very efficient and also gets slower the more data you collect. Sticky notes and my 2025 laptop can probably manage just fine, but in critical, real time software with huge datasets, we’d be better off using pointers into the files that we save, replacing only the parts that we changed. This adds complexity, but reduces overhead in return.
However! We’d have to be extra careful not to mess up the save file and ultimately defy the purpose of crash-only-software, f.e. by creating a deadlock. One of the benefits of overwriting our save file is that the OS usually waits until it is written before it removes the old one.
Add a writing delay
- This one is pretty easy to achieve, to. Just build a timer that delays writing. Idealy, have the timer receive a kind of signal so it knows we are going to write data in next duty cycle. After the first signal is received, all other calls just return. Reduces overhead.
Implement multithreading
This will add to the stability of our software. Ideally, design each thread crash-only, so that in case of a crash of a subcomponent, the program can restore eventually. Try making the threads independend from one another.
However! Overdoing this will leech CPU time and won’t help. In case of our sticky notes, 3 threads would be the maximum I’d create. One for our frame and the GUI, one for file reading/writing and one maybe for logic. If we went full ham here and did a thread for every panel or component, we wouldn’t do us a favor, also it would get pretty hard to maintain (not to mention all the thread synchronization and signal handling, yikes)
If you have anything to add, feel free to leave a comment, I will add it to this list. Kudos!
Final thougths
As always, one should consider when and how to apply these design philosophies. A Game Boy probably doesn’t care which cartridge is plugged in (or whether one is in at all), but a robot arm might want to nope out of starting when it’s in a dangerous position from not properly shutting down before. A car should probably not start running when there is a wheel missing, even if it’s possible to recover from that state.
Point is, there are situations where you can’t start, and there are situations where a graceful shutdown is required. Don’t use crash-only as a golden hammer, as with every design philosophy.
Thanks for reading through till the end, I hope I could impart what I understand as crash-only-software and you took something useful away from here.
See you, Space Cowboy
Subscribe to my newsletter
Read articles from clockwork directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
