Last week Amazon Web Services (AWS), likely the world’s largest cloud provider, suffered an error. The impact of this disrupted thousands of websites and services around the world.
We’d like to take you through what happened, and how you can minimise disruption to your own systems.
Many of last week’s headlines were alarmist:
- Amazon outage breaks large parts of the internet – Engadget
- Internet down: Many of the world’s biggest websites stop working… – The Independent
- Amazon cloud service outage breaks parts of the internet – LA Times
And technically, they’re true. Many websites and services stopped working. And ultimately, it comes back to a typo by an AWS employee. AWS’ size and swift growth then exacerbated the problem. They’re so big that they weren’t as familiar with their systems, and they hadn’t undertaken some maintenance tasks in “many years”. Remediation therefore took longer than it would have otherwise.
But it’s not really AWS that “broke” things. If the websites and services affected had good redundancy and business continuity planning, they probably would have stayed up. That was the real problem: a lack of planning and asking the “what-if” questions.
Cloud is fallible – but you can minimise the risk
When considering availability and reliability, there are some great advantages to cloud.
- Data centres storing your files typically have back-up power and multiple internet connections. That increases the chances of the services remaining available.
- Backups can be automated by the provider.
- Cloud providers can keep copies of your data and systems in more than one data centre. If problems impact one, the other(s) can take over.
But you can’t just assume you’ll get these benefits by signing up to a provider. You need to check and confirm they’re offered, as part of a comprehensive business continuity plan.
Hosting your data and services in multiple locations – with good backups – is the best approach
AWS only had problems with one “region” – data and services hosted in one location. But it had a particularly big impact, as it’s the default region you get when you sign up to AWS.
Many organisations don’t expand to another region, either because they don’t know, or it’s cheaper to stick with one. But it can cost a business a lot down the line.
Signing up for multiple regions helps. And last week, organisations with the same data in more than one location didn’t have the same problems.
A great example of this is our documentation host IT Glue. IT Glue uses AWS, but has designed its services so that one region going down doesn’t stop them. (We also store backups of our IT Glue documentation here, just in case a bigger problem hit them.)
Another of our partners, ShareFile, did have problems. ShareFile powers our file-sending and sharing service CloudDrive. While our CloudDrive clients’ files aren’t stored on AWS (they’re right here in New Zealand), the authentication service does run from there. That meant that while the files were safe and available here in NZ, the login to access them was slow.
We’ve asked ShareFile if and how they plan to fix this. But the way we’ve designed things, the files themselves were never in peril. Had the problems continued, we could have retrieved them for clients from the multiple locations in which they’re stored.
When designing your system, think about what would happen if it failed
When designing a system, it’s best to draw on knowledge and advice from people who’ve been there. And it’s that knowledge and advice we strive to give you. We go through the business continuity implications of any option, be it cloud, on-premise or hybrid.
We can’t guarantee 100% availability – no one can. But we can maximise that availability. And we can minimise your risk of lost time, data and productivity. You’ll keep working, even if something in the system isn’t.