Here is an interesting question. How can the undisputed leader in a category, who is experiencing rapid growth, also be guilty of some combination of neglect and arrogance that may damage the reputation and therefore the future success of the category in its entirety? First of all, the details. On Monday night (Christmas Eve) starting at around 3:30 PM US Eastern Time, applications using the Elastic Load Balancing Service (EBS) at Amazon’s US East data center in Virginia experienced outages. Those applications included Netflix, Scope, and the PaaS cloud Heroku.

Amazon’s Position in the Public Cloud Computing Market

The Wall Street Journal quoted some research from Baird Equity Research that said estimated the AWS contributed $1.5B in revenue to Amazon this year, about triple what it contributed in 2010, and Baird further estimated that AWS revenue will double to $3B in two years. Although comparable numbers for other public cloud computing vendors are hard to come by, these numbers arguably make AWS into both the revenue share and unit share leader of the public cloud computing market. Netflix is quoted in the same WSJ article that it relies upon AWS for 95% of it needs for computation and cloud storage. It has been separately reported that Netflix runs over 5,000 concurrent Amazon images in various Amazon data centers. Other high profile online web properties like Foursquare, Pinterest, and Scope also apparently rely either heavily or exclusively upon AWS.

So we have a very interesting situation. We have a vendor, Amazon, whose service is so flexible and affordable that putting tactical workloads that do not need constant availability and constant excellent response time on that service is nearly a no-brainer. And we have companies whose very revenue and existence depends upon continuous availability and excellent user experience relying almost exclusively upon this service.

Amazon’s SLA

These issues need to be looked at in light of Amazon’s SLA. Amazon’s SLA was last updated in October of 2008 (which in and of itself indicates a problem), and states “AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year“. Let’s analyze this SLA in light of the Christmas Eve outage:

  • Amazon states that it will use “reasonable commercial efforts” to meet this SLA. That give Amazon an escape for any outage. Amazon can simply say that it used reasonable commercial efforts and the outage happened anyway, so tough luck. It is not known whether or not Amazon has ever invoked this excuse to avoid giving service credits, but the escape clause exists.
  • Amazon states that it will provide 99.95% up time for a calendar year. That allows for (1-.9995)*365*24 or 4.38 hours of downtime in a year. The Christmas Eve outage apparently lasted a day and a half (36 hours). So we have to assume Netflix and other customers got some service credits. But obviously, the value of those credits pale in comparison to the damage in terms of revenue and reputation that occurred to Netflix and other online properties.

However the fact that your service can be down for 4.38 hours a year on Amazon and that Amazon stays within its SLA under these circumstances is not the real problem. The real problem is that Amazon has no SLA for performance. So Amazon can be up, but if resource contention of any kind in the Amazon infrastructure is at fault for the poor response time of an application running in the Amazon cloud, Amazon entirely washes its hands of any responsibility on that front.

Customer Reaction to Amazon Outages

The same WSJ articles that reported on the outages also reported that Amazon customers like Scope whose CEO was quoted as saying “I am looking into what options I have” are clearly looking to insulate themselves from the impact of Amazon outages upon their businesses. This is where the potential damage to Amazon in particular and public cloud computing in general starts to get real. At the other end of the spectrum from running in the Amazon cloud lies the option of standing up your own data center and taking control of your operational reliability and performance into your own hands.  Many enterprises already pursue a strategy of “develop and test and Amazon, and then deploy internally”. In support of this approach Hotlink offers a management solution that allows for the seamless management of instances across VMware, Hyper-V and Amazon, and the seamless migration of instances between the three environments.

There is one other customer reaction to these outages which is even more dangerous to public cloud computing. That reaction on the part of the customer is to assume that it is the customer’s responsibility to code around the unreliability in the Amazon infrastructure. In the Netflix blog “Chaos Monkey Released Into The Wild“, Netflix chronicles how it tries to make its code resilient to failure, and how it has written a “Chaos Monkey” whose job it is to randomly take down individual Netflix services to ensure that the entire Netflix service is not vulnerable to any single point of failure. This same blog speculates that what Netflix really needs is a “Zone Monkey” that takes down an entire Netflix instance on an Amazon Zone and make sure that an entire Zone failure is a recoverable event (which it was not on Christmas Eve).

Public Cloud Computing Reliability is Not the Customer’s Problem

This is where Amazon’s apparent approach to reliability and performance endangers the whole notion of public cloud computing. Imagine if your electricity company said that it was up to you to buy a generator to cover your needs for electricity if the power went out. Imagine if your water utility said that it was up to you to keep a water tank in your back yard, in case the water supply went out. This entire idea that the vendor of the service does not stand behind the availability and quality of that service (as evidenced in Amazon’s worthless SLA), and that this it is somehow the customer’s responsibility to code and or design around the vagaries of the public cloud infrastructure is wrong and dangerous to the future of public cloud computing.

It is wrong and dangerous to the future of public cloud computing because it is going to create the perception in the minds of enterprise customers (who are somewhat skeptical of running important applications in public clouds anyway) that public clouds are not to be trusted with important workloads. Since Amazon is the high profile market leader that it is in the public cloud market, Amazon’s failures to step up with a quality SLA is going to damage not just Amazon, but the entire notion of public cloud computing. The fact that a vendor like Virtustream offers a response time based SLA for SAP running in its cloud is just not going to matter if Amazon ruins the reputation of the entire public cloud computing concept.

Update – Amazon Explanation and Apology

On its blog, Amazon has issued an explanation and apology for the December 24 2012 ELB Service Event. The upshot is that a developer deleted state data from production servers thinking that he was only deleting it from non-production servers. Amazon has admitted that this occurred because of a flaw in their change management procedures (they did not require change management approval prior to the incident and now do), and have apologized for the mistake. This leaves Amazon struggling with the tradeoff between agility and change management just like many enterprises do, and also does not resolve the issue of the lack of a truly useful and meaningful SLA.

Summary

The Christmas Eve Amazon outage that resulted in Netflix being unavailable for 36 hours results from an unacceptable attitude on the part of Amazon towards reliability and performance. Unless Amazon steps up to the plate with a meaningful SLA, Amazon risks damaging its own growth, and the entire concept of public cloud computing.

Share this Article:

Share Button
Bernd Harzog (324 Posts)

Bernd Harzog is the Analyst at The Virtualization Practice for Performance and Capacity Management and IT as a Service (Private Cloud).

Bernd is also the CEO and founder of APM Experts a company that provides strategic marketing services to vendors in the virtualization performance management, and application performance management markets.

Prior to these two companies, Bernd was the CEO of RTO Software, the VP Products at Netuitive, a General Manager at Xcellenet, and Research Director for Systems Software at Gartner Group. Bernd has an MBA in Marketing from the University of Chicago.

Connect with Bernd Harzog:


p5rn7vb

Related Posts:

19 comments for “Is Amazon Ruining Public Cloud Computing?

  1. Garry Martin
    December 28, 2012 at 12:43 PM

    Your uptime calculation is incorrect; you’ve used 99.5% and not 99.95%. The uptime guarantee at 99.95% would kick in after 4.38 hours of downtime in the Service Year.

  2. December 28, 2012 at 1:14 PM

    Yet another bizarrely inaccurate retelling of what happened. Go read Janko at GigaOM. Bottom line this was a partial outage that affected a small number of minor device types for 18 hours and a larger number of devices for about 8 hours, while the web browser site and streaming was working throughout.

    The SLA also has no effect on availability. You can’t mandate availability, and getting compensated for an outage is all an SLA can do. Hardening a service by using it a lot and fixing what breaks is how you get to availability in the underlying infrastructure. Building architectures that are antifragile is how you get to highly available services.

  3. December 28, 2012 at 1:23 PM

    I live in a suburban area where I have well water and a pump powered by my electricity. If the power goes out I loose both power and water. My electric company can be fairly quick at fixing problems. But, when the power is out it’s on me to get what I need. I can imagine exactly what you are talking about and the example is a bad one.

    Everything goes down at one point or another. Sometimes it happens at really bad times. Sometimes in ways you don’t think will happen. For example, I used to work at a telecom company and a construction company accidentally damaged the fiber lines to the building. It didn’t matter that there were 3 uplink providers. All 3 fiber connections were damaged.

    If an enterprise, or anyone else for that matter, wants to build something that isn’t going to go down easy you need to architect for that. Any part of your stuff can go away at any time. This can include a whole data center.

    This isn’t always cost effective so many people will have to pull back and not go to the full extent. In that case you’ll have some downtime. The only real failings I see are in companies that can put the money behind it but fail to architect a fault tolerant system or in the expectation users have of the uptime of really inexpensive products.

  4. December 28, 2012 at 5:50 PM

    My water service has not had 99.95% availability where I live in Scottsdale, AZ, an affluent community. People have learned to just put up with the situation.

    My electricity in Scottsdale certainly has not beat 99.95% uptime. UPSes cover the gaps and sometimes the UPSes are not able to cover the size of the gaps that, at times over the summer, can be hours worth of downtime by the power company (APS).

    Amazon is looked upon as the “cheap” alternative by bean counters. With cheap you get cost savings, not necessarily an expectation of unbelievable uptime (the wide-eyed and surprised bean counters at NetFlix are probably looking at the situation as I write this).

    If I’m not mistaken, it’s the Amazon installation in Virginia that is causing most of the bad-press problems for Amazon so until that is fixed, Amazon will continue to have noted downtime.

    Datto

  5. December 29, 2012 at 8:03 AM

    The power supply does fail sometimes, of course. If a power supply failure is life threatening, as in hospitals, you do get a generator. Business critical server rooms generally have generators too, and large enterprises have redundant power feeds. Basically, that’s the equivalent of paying for multiple Amazon regions and making sure you can fail over quickly.

    It’s not clear that 100% uptime is within the current state of the art, or at least not affordably. Amazon, I expect, learns something new from each outage, so hopefully the failure rate will fall over time.

  6. December 30, 2012 at 3:31 PM

    I think its a good point to raise awareness of issues of moving to the public cloud, and Amazon’s SLA wording has been rightly highlighted before – but I can’t quite follow the direct reference to public utilities and service.

    “Imagine if your electricity company said that it was up to you to buy a generator”.. while this does appear odd for consumers (although quite often those who experience regular power outages often have their own backups), it is not an odd request for enterprises. Server rooms have UPSes, large organisations will maintain independent power supplies and generators for vital services. Netflix build in system testing to validate their own systems.

    I’d agree this latest outage emphasizes the need for understanding of the SLA, and should focuses any organisation on properly assessing their own SLA to customers and their attitude to risk in assigning services to the public cloud. But I’d stop short of saying that Amazon’s attitude is unacceptable up until there is an outage that properly tests their SLA.

  7. January 2, 2013 at 3:04 PM

    IMHO there is a shared responsibility when using public cloud services.

    This means the service provider should have SLAs/SLOs for reliability/durability of data being intact, safe, secure, intact as well as for accessibility (being able to get to it), and optional performance (e.g. response time, activity based). This means the service provider does what is needed to meet those obligations as advertised and agreed to by terms of service use.

    However, this also means that service providers also should step up and help educate their consumers, providing tools, wizards, menus and other tools for them to decide what service levels and options to select. IMHO Amazon AWS is doing a good job in this area, granted it could be better.

    Otoh, either the consumer of the service, for free or for fee has the shared responsibility of making informed decisions, using the tools and information available from the provider. This includes actually reading the terms of use, leveraging education and other content to make decisions or asking for help, as opposed to simply going with the lowest cost option. I hope that this sounds familiar, as it is the same as what applies to making decisions on virtual or physical servers, storage, networks, hardware or software.

    I’m an AWS (EC2/EBS, S3 and Glacier), Bluehost and Rackspace cloud customer with different SLA for the services that I use. It is also my responsibility to determine when I use those services, how I use them, how I configure or leverage those just as I would any other resources. This also means I subscribe to the notion that anything-particular technology where people are involved will fail, so what can be done to mitigate that risk, while using technology in ways to enable business and productivity.

    Btw, I like the water and electricity examples, I too have my own well and when the power goes out which it does several times a year, the water pump does not work. However in addition to having a couple of jugs of water around just in case, I also have a 12KVA generator that automaticly comes on to provide electical power for the water pump and other things. My responsibility is then to call and see if the power company knows if there is an outage (or check online), as well as make sure the propane fuel tanks are topped off, as well as spare oil filter and oil for the genset. Off course I could have save the expense and dealt with the downtime, waiting for somebody else to fix things or configured around it.

    Here are some additional thoughts and perspectives:

    Cloud conversations: confidence, certainty and confidentiality
    http://storageioblog.com/?p=3476

    Only you can prevent cloud data loss (with poll on shared responsibility)
    http://storageioblog.com/?p=3125

    Amazon Web Services (AWS) and the NetFlix Fix?
    http://storageioblog.com/?p=3246

    The blame game: Does cloud storage result in data loss?
    http://storageioblog.com/?p=2170

    Cheers gs @storageio

  8. Sinclair stockman
    January 3, 2013 at 11:30 AM

    The cloud should really be no different from any other commercial technology construct. Based upon its level of critically, it should be properly engineered, it’s behaviours well understood, risks associated with it transparanet, and the users of the technology should be clear on what they are paying for and what level of performance can be deliver by the choice they have made.
    The truth is that for many, though I cannot understand how Netflix ended up in this category, a starry eyed approach is taken towards the use of cloud based technology, and scant attention is paid to reliability and performance – all too often it is assumed that someone else is taking care of it. in a commercial arrangement, both the provider and the purchaser have an obligation to understand the risks associated with the transaction. Given the outage time, it would appear that the Amazon platform was not fit for the purpose that Netflix bought it for. This type of scenario is nothing new, and experienced programme managers and companies have approaches which can derisk catastrophic failures. They have to have, if they want to survive, because accidents always happen. Yes, I do have a backup for my electricity, they are called batteries (telephone exchanges have had battery and diesel generator backups for decades, which is why the telephone network normally far out performs the electricity grid for availability).
    The lesson here is that Cloud Computing is now entering into the field of mature commercial technologies, and the successful companies will be those who understand what it really takes to make systems work at the requid levels of availability.

  9. January 8, 2013 at 2:29 PM

    >Imagine if your electricity company said that it was up to you to buy a generator to cover your needs for electricity if the power went out.

    Yup, exactly how it was at the beginning of the electricity age.

    Also, look under your table – is there a UPS device there to power your PC in case of power failure?

  10. January 9, 2013 at 2:06 PM

    Quoting “Public Cloud Computing Reliability is Not the Customer’s Problem”. Are you serious? How is it any different to having servers in a colo, that you’re responsible. Any application that you’re going to build, on any platform, regardless of whether you do it on your own private equipment or on the public cloud, your architects and developers MUST design for reliability. Of course everyone wants as stable and as reliable platform as they can possibly have, but ultimately you have to assume that at some point something will break. Depending on your tolerance for downtime, you need to consider how much effort goes into keeping your applications up and running when things do break.

  11. Bharzog
    January 9, 2013 at 3:09 PM

    If you have a server in colo, your are responsible for the hardware on up. All you are renting is ping, power and pipe. Your are not renting an Elastic Load Balancing service or anything like that. Amazon’s IaaS service is very appealing precisely because it offers these value added services. But when they break, it creates serious problems which gets back to the whole SLA question.

  12. Bharzog
    January 9, 2013 at 3:15 PM

    I agree the enterprises data centers should have backup power for obvious reasons. But the point of public cloud computing is to take much more of the worry and maintenance out of the hands of the customer than has ever been the case before. What I really object to is the notion that it is the customer’s responsibility to code around the vagaries of the cloud provider’s service, as Netflix has so publicly done with its Chaos Monkey.

  13. Bharzog
    January 9, 2013 at 3:16 PM

    Thank you. I fixed it.

  14. January 20, 2013 at 3:56 PM

    Outages will be conquered. DevOps will prevail. Best practices, governance and cloud management will overtake these ‘acts of god’ so the future of cloud will be built on robust policies. Weigh in on ‘Outages In The Cloud’ at thecloudist.com

Leave a Reply

Your email address will not be published. Required fields are marked *


× 1 = six