Organizations are increasingly deploying in the cloud; moving applications from datacenters to the cloud or creating new applications for the cloud. Today, clouds convey an image of high availability, reliability and scalability unsurpassed in computing history. With sophisticated technology and advanced practices, vendors portray their SaaS, IaaS and PaaS solutions built “in the cloud” as almost impermeable to disasters and other acts of God.
Except when they’re not.
Case in point, the recent violent thunderstorm that burst Amazon’s cloud. In doing so they impacted Netflix, Instagram and Pintrist. The failure also impacted the PaaS solution Heroku and some of their clients.
More precisely, at 11:21pm EDT on June 29th Amazon’s US-EAST-1 region data center in Northern Virginia lost power due to a violent thunderstorm. This caused one of their availability zones running Elastic Computing (EC2), Elastic Beanstalk (EBS) and Relational Data Services (RDS) in that region to become unavailable. This in turn caused Amazon’s customers applications running in the impacted zone to become unavailable. While some customer’s services were restored in a few hours, others took almost a day.
All Amazon was saying initially was their primary and secondary generators both failed, but more recently published a detailed post-mortem indicating power loss and software bugs were to blame. To add insult to injury, this outage comes two weeks after Amazon experienced an outage for its cloud services also caused a by power loss.
What’s interesting in these cases, like so many disasters, is the outage was preventable on multiple levels. Amazon’s competitor Joyent had a data center nearby and their customer’s weren’t impacted. While Amazon would never blame their customers for problems, they do advertise eight regions around the world where customers can deploy their applications to avoid this specific scenario.
Lessons for Deploying in the Cloud
Because we live in a world of natural disasters we have to be prepared for contingencies. This is especially important if not only your production hosting but all of your development and deployment tools are running in the cloud (true agile cloud development).
What can we learn from these recent events and how can you avoid the same fate as these high-profile applications? It turns out there are three deployment strategies that can be used to help you improve availability. All of these rely on the time-honored principle of not putting all your eggs in one basket:
- Multi-Region Deployments
- Multi-Vendor Deployments
- Hybrid Cloud Deployments
The simplest and most cost effective way is to deploy your applications across multiple, geographically disperse regions within your cloud provider. If your customers are in the US, deploy your application to both East and West coast data centers. While you’ll need to evaluate the performance, cost and complexity impacts to your application and your customer’s experience, a little extra cost and complexity is certainly preferable to an outage.
If you’re concerned about being solely reliant on a single vendor, the solution is to deploy your applications using multiple vendors cloud solutions, again ideally in geographically disperse locations. While more expensive to setup and maintain, this solutions insulates you from mistakes or outages at a single cloud provider.
Hybrid Cloud Deployments
Similar to multi-vendor deployments, deploying your application across private and public clouds also insulates you from issues at a single physical location. This is an ideal solution for organizations with existing data centers but wouldn’t be ideal for a new start-up with little or no infrastructure.
All of these solutions rely on DNS geographic load balancing solutions to route traffic away from troubled instances, such as Amazon’s ELB. This in turn ensures your applications are available to your customers even when there’s an outage affecting one geographic region. Real time Application Performance Monitoring is another critical piece to ensure your teams are notified the minute something is amiss so they can effectively handle the issue. Finally, a rehearsed process for what to do when a situation like this happens is also critical. DevOps teams need to prepare and drill for how to handle these situations so when they occur (and they eventually will), teams know what to do, how to respond and more importantly, how to communicate status to external stakeholders.
Although these components are more expensive and complex to setup and maintain, these typically pale when compared to the lost revenue and impact on customer experience and brand during an outage. Perhaps this episode will serve as a wake-up call to other organizations leveraging the cloud to re-assess their deployment models and disaster recovery plans.
Lessons for Platform as a Service (PaaS) Vendors
Infrastructure as a Service (IaaS) vendors such as Amazon can justifiably claim their customers are ultimately responsible for creating a deployment and disaster recovery plan using their infrastructure. The fact that Heroku went down when one of Amazon’s regions went down is still a poor reflection on Heroku.
This has implications for all PaaS vendors because their customers pay extra to have the complexity of high availability and disaster recovery details handled for them. While PaaS vendors can cite productivity and reduced costs using their platforms, if they can’t effectively design for an implement highly available solutions it calls into question some of their value proposition.
This incident also highlights the dependencies between PaaS solutions and their underlying IaaS partners in ensuring high availability. PaaS customers should ask for more details from PaaS vendors about how their solutions are designed for high availability and disaster recovery.
Organizations deploying their applications in the cloud must still plan for high availability and disaster recovery – these aren’t automatically taken care of just by moving to the cloud. Based on Amazon’s two outages in the last month, organizations may want to revisit their deployment strategies to ensure they could successfully withstand an outage with minimal impacts to their customers.
Update July 16
In addition to Amazon, Netflix has posted their own post-mortem on the outage indicating an issue with their middle-tier load balancing software contributed to their customer’s outage.