Here is an interesting question. How can the undisputed leader in a category, who is experiencing rapid growth, also be guilty of some combination of neglect and arrogance that may damage the reputation and therefore the future success of the category in its entirety? First of all, the details. On Monday night (Christmas Eve) starting at around 3:30 PM US Eastern Time, applications using the Elastic Load Balancing Service (EBS) at Amazon’s US East data center in Virginia experienced outages. Those applications included Netflix, Scope, and the PaaS cloud Heroku.
Amazon’s Position in the Public Cloud Computing Market
The Wall Street Journal quoted some research from Baird Equity Research that said estimated the AWS contributed $1.5B in revenue to Amazon this year, about triple what it contributed in 2010, and Baird further estimated that AWS revenue will double to $3B in two years. Although comparable numbers for other public cloud computing vendors are hard to come by, these numbers arguably make AWS into both the revenue share and unit share leader of the public cloud computing market. Netflix is quoted in the same WSJ article that it relies upon AWS for 95% of it needs for computation and cloud storage. It has been separately reported that Netflix runs over 5,000 concurrent Amazon images in various Amazon data centers. Other high profile online web properties like Foursquare, Pinterest, and Scope also apparently rely either heavily or exclusively upon AWS.
So we have a very interesting situation. We have a vendor, Amazon, whose service is so flexible and affordable that putting tactical workloads that do not need constant availability and constant excellent response time on that service is nearly a no-brainer. And we have companies whose very revenue and existence depends upon continuous availability and excellent user experience relying almost exclusively upon this service.
These issues need to be looked at in light of Amazon’s SLA. Amazon’s SLA was last updated in October of 2008 (which in and of itself indicates a problem), and states “AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year“. Let’s analyze this SLA in light of the Christmas Eve outage:
- Amazon states that it will use “reasonable commercial efforts” to meet this SLA. That give Amazon an escape for any outage. Amazon can simply say that it used reasonable commercial efforts and the outage happened anyway, so tough luck. It is not known whether or not Amazon has ever invoked this excuse to avoid giving service credits, but the escape clause exists.
- Amazon states that it will provide 99.95% up time for a calendar year. That allows for (1-.9995)*365*24 or 4.38 hours of downtime in a year. The Christmas Eve outage apparently lasted a day and a half (36 hours). So we have to assume Netflix and other customers got some service credits. But obviously, the value of those credits pale in comparison to the damage in terms of revenue and reputation that occurred to Netflix and other online properties.
However the fact that your service can be down for 4.38 hours a year on Amazon and that Amazon stays within its SLA under these circumstances is not the real problem. The real problem is that Amazon has no SLA for performance. So Amazon can be up, but if resource contention of any kind in the Amazon infrastructure is at fault for the poor response time of an application running in the Amazon cloud, Amazon entirely washes its hands of any responsibility on that front.
Customer Reaction to Amazon Outages
The same WSJ articles that reported on the outages also reported that Amazon customers like Scope whose CEO was quoted as saying “I am looking into what options I have” are clearly looking to insulate themselves from the impact of Amazon outages upon their businesses. This is where the potential damage to Amazon in particular and public cloud computing in general starts to get real. At the other end of the spectrum from running in the Amazon cloud lies the option of standing up your own data center and taking control of your operational reliability and performance into your own hands. Many enterprises already pursue a strategy of “develop and test and Amazon, and then deploy internally”. In support of this approach Hotlink offers a management solution that allows for the seamless management of instances across VMware, Hyper-V and Amazon, and the seamless migration of instances between the three environments.
There is one other customer reaction to these outages which is even more dangerous to public cloud computing. That reaction on the part of the customer is to assume that it is the customer’s responsibility to code around the unreliability in the Amazon infrastructure. In the Netflix blog “Chaos Monkey Released Into The Wild“, Netflix chronicles how it tries to make its code resilient to failure, and how it has written a “Chaos Monkey” whose job it is to randomly take down individual Netflix services to ensure that the entire Netflix service is not vulnerable to any single point of failure. This same blog speculates that what Netflix really needs is a “Zone Monkey” that takes down an entire Netflix instance on an Amazon Zone and make sure that an entire Zone failure is a recoverable event (which it was not on Christmas Eve).
Public Cloud Computing Reliability is Not the Customer’s Problem
This is where Amazon’s apparent approach to reliability and performance endangers the whole notion of public cloud computing. Imagine if your electricity company said that it was up to you to buy a generator to cover your needs for electricity if the power went out. Imagine if your water utility said that it was up to you to keep a water tank in your back yard, in case the water supply went out. This entire idea that the vendor of the service does not stand behind the availability and quality of that service (as evidenced in Amazon’s worthless SLA), and that this it is somehow the customer’s responsibility to code and or design around the vagaries of the public cloud infrastructure is wrong and dangerous to the future of public cloud computing.
It is wrong and dangerous to the future of public cloud computing because it is going to create the perception in the minds of enterprise customers (who are somewhat skeptical of running important applications in public clouds anyway) that public clouds are not to be trusted with important workloads. Since Amazon is the high profile market leader that it is in the public cloud market, Amazon’s failures to step up with a quality SLA is going to damage not just Amazon, but the entire notion of public cloud computing. The fact that a vendor like Virtustream offers a response time based SLA for SAP running in its cloud is just not going to matter if Amazon ruins the reputation of the entire public cloud computing concept.
Update – Amazon Explanation and Apology
On its blog, Amazon has issued an explanation and apology for the December 24 2012 ELB Service Event. The upshot is that a developer deleted state data from production servers thinking that he was only deleting it from non-production servers. Amazon has admitted that this occurred because of a flaw in their change management procedures (they did not require change management approval prior to the incident and now do), and have apologized for the mistake. This leaves Amazon struggling with the tradeoff between agility and change management just like many enterprises do, and also does not resolve the issue of the lack of a truly useful and meaningful SLA.
The Christmas Eve Amazon outage that resulted in Netflix being unavailable for 36 hours results from an unacceptable attitude on the part of Amazon towards reliability and performance. Unless Amazon steps up to the plate with a meaningful SLA, Amazon risks damaging its own growth, and the entire concept of public cloud computing.