Master of disaster

Andrew PlummerSenior Developer

The Stack

Transparent cube floating in the clouds with blue sky behind

When you paddle in the murky waters of business continuity, you must surely begin to imagine sharks lurking in every rockpool. Because an event that ultimately spells disaster for a business can come from something huge – such as the recently very pertinent fire or flood risk – or something apparently insignificant – a tiny bug worming its way into the system, perhaps.

So, given that Isotoma senior devops engineer Andrew Plummer is knee-deep in business continuity approaches for our Amazon Web Services (AWS) clients, how are his nerves?

“My job isn’t really about guaranteeing to our customers that they’re protected come what may,” he says. “It’s more about making sure they’ve got the full picture about what measures we’ve taken and what those measures protect against. It’s ultimately over to them to judge what’s suitable for their business – and their budget.”

Andrew’s recent wholesale review and update of our approach to business continuity means our clients can (to stretch the metaphor to breaking point) swim along fairly happily, knowing where the dangers are.

But how robust are your own plans, should disaster strike? Have you discussed with your cloud developers what you’d do if something went wrong? No? Well, it’s all about balancing likelihood with investment.

“We frame business continuity for our AWS clients by thinking about what sorts of disasters might happen and what sorts of prevention steps you could take,” explains Andrew. “Then, together, we weigh up if these steps are worth it, based on the likelihood of that disaster, the potential impact on their business and what that could mean for their own customers, and their reputation."

Even with the most bullet-proof systems, things can go wrong. At the lowest end, a developer could write a bug that gets into production, causing the application to delete some information from your database that wasn’t meant to be deleted.

The solution? Configure the database to take backups and keep them for a while.

As Andrew says, “The likelihood of this happening is relatively high and the cost of prevention is very low. So, we just do this as a matter of course.”

But at the other extreme, what would you do if AWS was somehow totally vaporised? How could you guard against that? Well, you’d have to back everything up entirely to somewhere outside AWS and make sure the application could be completely restored.

“In practice, few of our customers would want to go to the kind of effort you’d need for the full ‘multi-cloud’ approach – especially given the very low likelihood of AWS disappearing altogether,” says Andrew. “Multi-cloud spreads your system across, for example, AWS and Azure, so if either one hits the skids in any way, the other takes over seamlessly, with no impact on the service you’re providing to your customers.

“And although that takes very deep pockets, a more cost-effective compromise could be, for example, a weekly database backup from AWS to Azure.”

Like this compromise, the important work is in the myriad possibilities that sit between the two poles of doing nothing and investing in the full belt, braces and more belts and braces approach of multi-cloud. And how much you spend on prevention and backups will be different for everyone, depending on your business case.

Brief guide to AWS global infrastructure

Before we dive into everything that could go wrong, let’s take a quick detour into the AWS global infrastructure, to help us understand where the risks are.

Broadly, AWS is made up of AWS Regions and Availability Zones (AZs). AWS accounts are logically-isolated collections of resources and control planes tackling the work within the physical regions and AZs. Part of the control plane’s job is to enforce that logical isolation and make sure those different accounts are run separately – and safely.

Here’s a very top level explanation.

Availability Zones: One or more data centers, housed separately, each with back-up power supply and their own networking and connectivity.

AWS Regions: Geographical areas, made up of multiple Availability Zones.

AWS control plane: The machinery involved in making changes to a system and making those changes effective in the right place.

What could go wrong?

Let’s take a lighthearted look at the all-too-serious potential disasters that could, theoretically, bring down your application:

Loss of AWS

Weird billionaire buys AWS and it is somehow permanently destroyed.

Loss of AWS Region

Comet destroys a region.

AWS control plane compromise

A control plane is deliberately tripped up by a country under hostile dictatorship, which can then see and change everything across AWS.

Ransomware

Data is encrypted and ransomed. That pesky dictatorship strikes again.

Data centre fire

Fire destroys a data centre, permanently taking out an Availability Zone.

Attrition

A server dies, losing its data.

Transient

Weird, flappy failure in AWS that persists for an unknown amount of time.

Fat finger

Someone with a fat finger accidentally deletes some resources.

Compromise

An attacker steals your admin login changes the data within the system

Software failure

A piece of software goes wrong and it deletes or corrupts some data.

What can I do?

This array of potential horrors gives you a framework to ask the two key questions:

For each application you have in AWS, what measures are already in place, and which disasters from the list above do they defend against?
What else could be done to defend against the disasters you’re not protected from? How much would that cost?

“As well as helping clients to think about what could go wrong, the conversation is also an opportunity to clear up adjacent risk – and any conclusions that people might have leapt to,” says Andrew.

“For example, we can explain what’s in and out of scope and understand what data is critical for business continuity, and where that’s stored. This would include the code for their application, plus important documentation such as user guides or customer onboarding.

“Our goal is to be explicit and flush out any assumptions. But for the customer, the goal is to help them worry constructively about the right things and make an informed choice, picking solutions from a menu that they understand. Ultimately, we want everyone to relax, knowing that everything reasonable has been taken care of.”

AWS is popular because it’s flexible, fast, cost-effective – and robust. Thinking about business continuity is like writing a will; no one wants to think about bad times or sad times. But a realistic chat, with no scaremongering, about mitigation could be the most important conversation you have all year. And if you’d like to have it with us, get in touch.

Back to The Stack