Organizations are increasingly deploying in the cloud; moving applications from datacenters to the cloud or creating new applications for the cloud. Today, clouds convey an image of high availability, reliability and scalability unsurpassed in computing history. With sophisticated technology and advanced practices, vendors portray their SaaS, IaaS and PaaS solutions built “in the cloud” as almost impermeable to disasters and other acts of God.

Except when they’re not.

Case in point, the recent violent thunderstorm that burst Amazon’s cloud. In doing so they impacted Netflix, Instagram and Pintrist. The failure also impacted the PaaS solution Heroku and some of their clients.

More precisely, at 11:21pm EDT on June 29th Amazon’s US-EAST-1 region data center in Northern Virginia lost power due to a violent thunderstorm. This caused one of their availability zones running Elastic Computing (EC2), Elastic Beanstalk (EBS) and Relational Data Services (RDS) in that region to become unavailable. This in turn caused Amazon’s customers applications running in the impacted zone to become unavailable. While some customer’s services were restored in a few hours, others took almost a day.

All Amazon was saying initially was their primary and secondary generators both failed, but more recently published a detailed post-mortem indicating power loss and software bugs were to blame. To add insult to injury, this outage comes two weeks after Amazon experienced an outage for its cloud services also caused a by power loss.

What’s interesting in these cases, like so many disasters, is the outage was preventable on multiple levels. Amazon’s competitor Joyent had a data center nearby and their customer’s weren’t impacted. While Amazon would never blame their customers for problems, they do advertise eight regions around the world where customers can deploy their applications to avoid this specific scenario.

Lessons for Deploying in the Cloud

Because we live in a world of natural disasters we have to be prepared for contingencies. This is especially important if not only your production hosting but all of your development and deployment tools are running in the cloud (true agile cloud development).

What can we learn from these recent events and how can you avoid the same fate as these high-profile applications? It turns out there are three deployment strategies that can be used to help you improve availability. All of these rely on the time-honored principle of not putting all your eggs in one basket:

  1. Multi-Region Deployments
  2. Multi-Vendor Deployments
  3. Hybrid Cloud Deployments

Multi-Region Deployments

The simplest and most cost effective way is to deploy your applications across multiple, geographically disperse regions within your cloud provider. If your customers are in the US, deploy your application to both East and West coast data centers. While you’ll need to evaluate the performance, cost and complexity impacts to your application and your customer’s experience, a little extra cost and complexity is certainly preferable to an outage.

Multi-Vendor Deployments

If you’re concerned about being solely reliant on a single vendor, the solution is to deploy your applications using multiple vendors cloud solutions, again ideally in geographically disperse locations. While more expensive to setup and maintain, this solutions insulates you from mistakes or outages at a single cloud provider.

Hybrid Cloud Deployments

Similar to multi-vendor deployments, deploying your application across private and public clouds also insulates you from issues at a single physical location. This is an ideal solution for organizations with existing data centers but wouldn’t be ideal for a new start-up with little or no infrastructure.

All of these solutions rely on DNS geographic load balancing solutions to route traffic away from troubled instances, such as Amazon’s ELB. This in turn ensures your applications are available to your customers even when there’s an outage affecting one geographic region. Real time Application Performance Monitoring is another critical piece to ensure your teams are notified the minute something is amiss so they can effectively handle the issue. Finally, a rehearsed process for what to do when a situation like this happens is also critical. DevOps teams need to prepare and drill for how to handle these situations so when they occur (and they eventually will), teams know what to do, how to respond and more importantly, how to communicate status to external stakeholders.

Although these components are more expensive and complex to setup and maintain, these typically pale when compared to the lost revenue and impact on customer experience and brand during an outage. Perhaps this episode will serve as a wake-up call to other organizations leveraging the cloud to re-assess their deployment models and disaster recovery plans.

Lessons for Platform as a Service (PaaS) Vendors

Infrastructure as a Service (IaaS) vendors such as Amazon can justifiably claim their customers are ultimately responsible for creating a deployment and disaster recovery plan using their infrastructure. The fact that Heroku went down when one of Amazon’s regions went down is still a poor reflection on Heroku.

This has implications for all PaaS vendors because their customers pay extra to have the complexity of high availability and disaster recovery details handled for them. While PaaS vendors can cite productivity and reduced costs using their platforms, if they can’t effectively design for an implement highly available solutions it calls into question some of their value proposition.

This incident also highlights the dependencies between PaaS solutions and their underlying IaaS partners in ensuring high availability. PaaS customers should ask for more details from PaaS vendors about how their solutions are designed for high availability and disaster recovery.

Summary

Organizations deploying their applications in the cloud must still plan for high availability and disaster recovery – these aren’t automatically taken care of just by moving to the cloud. Based on  Amazon’s two outages in the last month, organizations may want to revisit their deployment strategies to ensure they could successfully withstand an outage with minimal impacts to their customers.

Update July 16

In addition to Amazon, Netflix has posted their own post-mortem on the outage indicating an issue with their middle-tier load balancing software contributed to their customer’s outage.

Share this Article:

Share Button
Ryan Shriver (9 Posts)

Ryan Shriver is a Managing Consultant with Dominion Digital, an award-winning process and technology-consulting firm. Based in Richmond, VA he leads their Innovative Products solution that helps people deliver the right products at the right time, quickly with high quality. With services related to experience, value, speed and quality, he helps clients holistically focus on improving their customer's experience.

Working with Internet technologies since 1995 and practicing Agile since 2001, Ryan has deep experience in systems architecture and large-scale agile product development. He's presented internationally on topics including Scaling Agile, Measurable Business Value and Agile Engineering. He's founder and program chair of Innovate Virginia and posts is thoughts, articles and conference presentations to the agile engineer.

Connect with Ryan Shriver:


Related Posts:

1 comment for “Deploying in the Cloud: Lessons When Clouds Burst

  1. July 9, 2012 at 9:12 AM

    Great article. And it all makes sense right. I think a lot of customers don’t want or feel they should add the cost and complexity to get a certain level high availability. Some of the big players should definitely follow the points you make even though all should consider them especially if they think the cloud is going to give them MORE availability than the customer could provide themselves or disaster avoidance. I’d still fault Amazon to some degree because it seems that proper periodic planned testing could have prevented or caught these failure zones before the unplanned downtime occurred.

Leave a Reply

Your email address will not be published. Required fields are marked *


four − = 0