Prevent Software Disasters by asking ‘What if?’

Dan AbelDan Abel
6 min read

Software fails in production far more than it should - yielding to pressures of use, both intentional or accidental.

Here's a product team game (and process) to help teams identify and avoid the disasters that might befall their product.

If you want to help your team gain a shared perspective and vocabulary on what’s important, and to build plans to make their software more resilient, keep reading!

Before I present the game, let me tell you why it's so useful.

When we make software, we forget the important

When building software we think in specifics: the things it should do for a user; the problem it solves. Great software needs to be more than that.

It needs to work night and day. Scale to support many, many users, whilst protecting secrets and assets. And when some disaster looms, it needs to be resilient enough to not crumble.

This does not always happen. Many are the tales of software failing. Breaking under load. Oversharing secrets. Data loss.

We can do better

Teams deliver better when they think and talk about disasters to be avoided. A shared understanding and vocabulary of the operational challenges leads to better designs, prioritisation and software.

In addition, this allows teams to take greater ownership often leading to learning and a more rugged (and less risky) system in production.

The Game

The aim of this game is to capture the biggest risks to your product. You’ll do this by working through a list of‘what if’ disaster cards and wildcard challenges. You’ll quickly talk about which ones apply and rate them by how disastrous they could be.

@ what if cards asking 'What if an important customer disputes our data or costs?' and 'What if the service is used far more than expected?'

You’ll then rank your cards to find the key disasters your team needs to guard against. These then can contribute to your backlog of work or your risk action plans.

What you’ll need

You’ll need the deck of what-if cards, your Product team and a space to play - either real life or remote works, though the set up is slightly different

In real life: you’ll need the cards printed out, some sharpies and blu tack, a table to work on and a white board to rank and display your disasters on.

To play remotely you’ll need to set up a remote whiteboard with the set of cards, a play area and a place to rank and display your disasters - here’s an example Mural board.

pictorial description of the start of the game showing cards, a graph and a reminder of the game rules

Somewhere in your space you’ll need to draw out a big graph with ‘Impact’ and ‘Likelihood’ as the two axes

You should discard the Delivery and Scaling Challenges card for your first game. they are to be used for extended games - details at the end.

How to play

Step 1: Get everyone on the same page

Before the game can properly start everyone in the group needs clarity of what the group is de-risking. Clarify the product or feature and how it delivers value by answering three questions:

  1. Who are its users?
  2. How do the users get value?
  3. How is it funded?

Step 2: Find your key disasters

The group should pick three cards and for each one ask: ‘What if?’

After a brief discussion (writing notes on the cards can be helpful) the group should gauge the risk of this disaster to their product. Note that financial, repetitional, exposure, loss, or health + safety are all valid risks - this is not just about technical disasters!

Ask:

  1. What's the likelihood and impact?
  2. What controls and mitigations would help

Once the three cards are reviewed the team should place them on the likelihood and impact graph - deciding which of them looks most disastrous.

Repeat swiftly

You should be able to work through all your cards in sets of 3, building a map of your quantified risks.

Screenshot 2022-06-25 at 12.52.11.png

You may come across cards where you already have preventions in place - my tip here is to record that but rate the card as if it was not defended. This will allow you to see how important that defence and maintaining it is.

Step 3: Review learning

Once out of cards, the team should review their ranked disasters.

Are there surprises? Questions? Worries?

Build an action plan

Pick out the critical disasters to be avoided. Ask what can be done to manage, mitigate or control them.

You have your disasters. What then?

I’ve found two good places for managing and progressing software and product operational risks.

Risk Registers

Risk registers are a place to record and manage key dangers - those that are important to understand monitor, mitigate and control

Risk registers are super useful to share risks outside of a delivery team - they give the wider business a heads up of both the most tricksy risks the team needs to detail with, as well as risks that go beyond the team's ability to control or guard against.

Operational Requirements (and Service Levels)

If the team can manage, control or mitigate the risk through technology, infrastructure and software it should become a requirement that can be built and measured.

Not all risks can be managed by building a feature. Your risks might guide you to put in place infrastructure, SLOs and production monitoring. You might find cross-cutting concerns that can be managed through scanners, linters or checklists.

And that’s it?

Not quite. This is just the start. Most software gets built in interactive and incremental ways. Risk avoidance needs to work the same way - threaded through delivery.

With the team now thinking about the disasters that might happen to their product, the conversation should continue as work is done. The team might want to put checkpoints in place to review their decisions as they learn more, or use production measurements, reports and monitoring to indicate upcoming problems or successful controls.

Many teams have regular learning and review sessions where discussions might fit well. If not, why not try it?

Appendix 1: Running bigger games and extensions

The ‘What If?’ card deck also includes a set of ‘Delivery and Scaling Challenges’. These can be used for an extended Game of Disasters that also looks at challenges to your delivery plan.

Playing the Delivery and Scaling extension

To play this version you will need to add in extra time for the group to think more beyond the operational risks and think about the delivery commitments and work.

As well as describing the product, the team playing needs to have a shared vision of the delivery. You can do this by drawing out a future timeline on a large whiteboard.

Build a reference for discussion. Have the whole team help, by adding future team happenings, external events, commitments and expectations.

Once drawn out and discussed, play the game as usual, using the delivery risk and wildcards.

Playing with the full set of cards

It’s possible to use the full deck of cards to look for operational and delivery risks at the same time. Expect this to take a few hours - plan for snacks and breaks!

Appendix 2: Resources, further reading and research

Resources

Further reading

This game is similar to a few other risk discovery 'games'.

0
Subscribe to my newsletter

Read articles from Dan Abel directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dan Abel
Dan Abel