By now, just about everyone’s been affected by – or at the very least, heard of – the Amazon EC2 (Elastic Cloud) outage that occurred from April 21 to 22, 2011. Unfortunately, the incident took down many businesses and websites, including the well-known Heroku, Quora, Reddit, Hootsuite, Foursquare and Engine Yard. This event, combined with a number of other cloud failures has brought the dependability of cloud services into question.
Experts and observers say that the two-day EC2 outage was the worst in the history of cloud computing. Regular users were surprised and outraged at how many websites and online services actually rely on EC2 to operate. While the entire EC2 did not shut down, as had been the popular belief, significant portions of Amazon’s customer base were affected.
However, this hasn’t been the only time Amazon Web Services have gone down. Back in July 2008, Amazon’s S3 Service experienced an eight hour failure.
Amazon’s cloud infrastructure is organized in to regions, pretty much analogous to a data center. For example, US-East-1 refers to Amazon’s Northern Virginia data center. US-West-1 refers to the Silicon Valley data center.
Regions are divided into availability zones, which are analogous to a cluster, or a grouping of physical and logical resources. Availability zones are differentiated by letters, for example, US-East-1a, US-East-1b. Each designation is customer-specific.
Amazon’s virtual machine service is the Elastic Compute Cloud (EC2). Clients who provision an EC2 instance receive an allocation of instance storage, which is transient. This exists as long as the virtual machine/instance exists. Amazon offers a persistent storage service known as Elastic Block Store (EBS), which is essentially network-attached storage.
Lack of Communication a Problem
One of the most glaring issues in this situation is the lack of communication from Amazon. Status updates were too vague to be useful and the company failed to provide customers with adequate background information. Amazon waited an unacceptable 40 minutes before posting the first status message regarding the outage. The information that was released was too imprecise to be useful to clients dependent on the service.
For instance, Amazon mentioned “a small percentage of instances/volumes,” they also discussed “impacted availability zones” and “multiple availability zones,” rather than specifically pinpointing the affected zones. Some commented that it would have been useful if Amazon had provided an overview, with each status update listing the functions affected, versus the ones which have been repaired.
Vulnerability & Risk Management
Many of Amazon’s customers are running in a single region, so when the availability zones in the US-East-1 region went down, the impact of the failure was especially severe. According to Lydia Leong, a long-time researcher in the IT industry:
“You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue… There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation – the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.”
Although many customers came away from this experience outraged, and understandably so, there are some lessons that we can take away from the Amazon cloud outage.
- Cloud outages are inevitable. Amazon’s service level agreement (SLA) promises 99.95% uptime, not 100%.
- If you’re in the cloud, you need to prepare disaster recovery and failover strategies. Customers who dispersed their infrastructure throughout different availability zones were not as negatively impacted as those who were completely dependent on a single zone.
- Customers must understand the SLAs before they agree to the contract. Many customers did not know the amount of uptime they were guaranteed until it was too late.
- Rely on experienced service providers who are familiar with architecting cloud solutions and can answer support questions.
- Don’t rely on blind trust. Up until Amazon’s high profile cloud failure, many customers have been hopping on the bandwagon and putting their faith and trust in the cloud.
- Management and maintenance are still necessary, even if organizations have moved to the cloud.
- Don’t assume that issues of resiliency, backup and disaster recovery are the responsibility of cloud providers.
“When you use a cloud service, whether you are consuming an application (backup, CRM, email, etc.), or just using raw compute or storage, how is that data being protected? A lot of companies assume that the provider is doing regular backups, storing data in geographically redundant locations or even have a hot site somewhere with a copy of your data. Here’s a hint: ASSUME NOTHING. Your cloud provider isn’t in charge of your disaster recovery plan, YOU ARE!”
In April 2011, Amazon’s cloud offering went down, leaving many popular services and websites unable to function and calling into question the reliability of cloud services. This article looks at the outage, Amazon’s cloud infrastructure and service offering and the responses that the failure garnered. It highlights the importance of setting up risk management and disaster recovery strategies in advance.
CCSK Exam Preparation
In preparation for the Certificate of Cloud Security Knowledge (CCSK), a security professional should be comfortable with topics related to this post, including:
- Enterprise and Information Risk Management (Domain 2)
- Contract Enforceability (Domain 3)
- Disaster Recovery (Domain 7)
- Technical Support (Domain 8)