Is it time to plan for the virtual future in our virtual designs? Happy New Year and welcome to 2013!! What a year 2012 turned out to be for virtualization and cloud computing in general. Microsoft Hyper-V, Red Hat, and VMware have all made quite a few enhancements to the hypervisor, and we have finally reached a point where we really have some good competition between hypervisors. Also, the competition boundaries are being expanded to include much more than just the hypervisor itself as we start to focus on the ecosystem as a whole. Toward the end of 2012 the industry had really begun presenting multi-hypervisor management capabilities and solutions. I see this area as something to really watch in 2013, and I propose this question. Continue reading Virtual Future in our Virtual Designs
After a recent snowstorm, and due to pending work on our generator, I had to dig out paths to the generator, the propane tank, etc. We normally dig out a few paths for moving wood around our yard, access to oil, the driveway, etc. But when we finished, we dug a moat around our entire house. This got me thinking about cloud security. The ongoing desire to put moats between us and the attackers. But what is us, in the cloud? Can we prevent the attacks? What are the current moat style technologies in play today? Continue reading Cloud Security: On Moats
Nivio have announced a DaaS solution aimed at SME space. Offering access to Microsoft Windows on any device, rentable applications, and data storage in the cloud, it sounds as if Nivio’s service could be just the ticket for the tablet wielding, dead-PC shunning organisations with a workforce who have their own devices, and need to team collaboration with access to Windows based applications.
The thing is, this road has been trodden before: it is a rocky one. OnLive attempted to offer a solution and failed. Even Desktone had a strategy that attempted to directly appeal to this segment but found the return on effort too miserly.
Yet, Nivio have created a service offering delivering Windows applications to Windows, Mac, iOS and Android devices. A web service providing common file storage to store user and group files for that can be syncronised to devices to work offline for editing directly, or automatically made available within the public cloud hosted Windows desktop service. A desktop service that has an on-demand, rentable application interface. User management is in your own hands. While Nivio are targeting their market at the 20-50 user sized organisation space which suggests small business, Nivio are getting a number of calls from project teams in larger organisations.
What are Nivio doing that is different? Will this model be successful? What, if anything, can be learned by other DaaS providers, and what in turn could be learned by Nivio?
Herewith we fearlessly predict some important events and trends for the virtualization and cloud computing industry. May we also wish everyone had a Happy Holiday Season and a prosperous 2013. Continue reading 2013 Virtualization and Cloud Computing Predictions
Here is an interesting question. How can the undisputed leader in a category, who is experiencing rapid growth, also be guilty of some combination of neglect and arrogance that may damage the reputation and therefore the future success of the category in its entirety? First of all, the details. On Monday night (Christmas Eve) starting at around 3:30 PM US Eastern Time, applications using the Elastic Load Balancing Service (EBS) at Amazon’s US East data center in Virginia experienced outages. Those applications included Netflix, Scope, and the PaaS cloud Heroku.
Amazon’s Position in the Public Cloud Computing Market
The Wall Street Journal quoted some research from Baird Equity Research that said estimated the AWS contributed $1.5B in revenue to Amazon this year, about triple what it contributed in 2010, and Baird further estimated that AWS revenue will double to $3B in two years. Although comparable numbers for other public cloud computing vendors are hard to come by, these numbers arguably make AWS into both the revenue share and unit share leader of the public cloud computing market. Netflix is quoted in the same WSJ article that it relies upon AWS for 95% of it needs for computation and cloud storage. It has been separately reported that Netflix runs over 5,000 concurrent Amazon images in various Amazon data centers. Other high profile online web properties like Foursquare, Pinterest, and Scope also apparently rely either heavily or exclusively upon AWS.
So we have a very interesting situation. We have a vendor, Amazon, whose service is so flexible and affordable that putting tactical workloads that do not need constant availability and constant excellent response time on that service is nearly a no-brainer. And we have companies whose very revenue and existence depends upon continuous availability and excellent user experience relying almost exclusively upon this service.
These issues need to be looked at in light of Amazon’s SLA. Amazon’s SLA was last updated in October of 2008 (which in and of itself indicates a problem), and states “AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year“. Let’s analyze this SLA in light of the Christmas Eve outage:
- Amazon states that it will use “reasonable commercial efforts” to meet this SLA. That give Amazon an escape for any outage. Amazon can simply say that it used reasonable commercial efforts and the outage happened anyway, so tough luck. It is not known whether or not Amazon has ever invoked this excuse to avoid giving service credits, but the escape clause exists.
- Amazon states that it will provide 99.95% up time for a calendar year. That allows for (1-.9995)*365*24 or 4.38 hours of downtime in a year. The Christmas Eve outage apparently lasted a day and a half (36 hours). So we have to assume Netflix and other customers got some service credits. But obviously, the value of those credits pale in comparison to the damage in terms of revenue and reputation that occurred to Netflix and other online properties.
However the fact that your service can be down for 4.38 hours a year on Amazon and that Amazon stays within its SLA under these circumstances is not the real problem. The real problem is that Amazon has no SLA for performance. So Amazon can be up, but if resource contention of any kind in the Amazon infrastructure is at fault for the poor response time of an application running in the Amazon cloud, Amazon entirely washes its hands of any responsibility on that front.
Customer Reaction to Amazon Outages
The same WSJ articles that reported on the outages also reported that Amazon customers like Scope whose CEO was quoted as saying “I am looking into what options I have” are clearly looking to insulate themselves from the impact of Amazon outages upon their businesses. This is where the potential damage to Amazon in particular and public cloud computing in general starts to get real. At the other end of the spectrum from running in the Amazon cloud lies the option of standing up your own data center and taking control of your operational reliability and performance into your own hands. Many enterprises already pursue a strategy of “develop and test and Amazon, and then deploy internally”. In support of this approach Hotlink offers a management solution that allows for the seamless management of instances across VMware, Hyper-V and Amazon, and the seamless migration of instances between the three environments.
There is one other customer reaction to these outages which is even more dangerous to public cloud computing. That reaction on the part of the customer is to assume that it is the customer’s responsibility to code around the unreliability in the Amazon infrastructure. In the Netflix blog “Chaos Monkey Released Into The Wild“, Netflix chronicles how it tries to make its code resilient to failure, and how it has written a “Chaos Monkey” whose job it is to randomly take down individual Netflix services to ensure that the entire Netflix service is not vulnerable to any single point of failure. This same blog speculates that what Netflix really needs is a “Zone Monkey” that takes down an entire Netflix instance on an Amazon Zone and make sure that an entire Zone failure is a recoverable event (which it was not on Christmas Eve).
Public Cloud Computing Reliability is Not the Customer’s Problem
This is where Amazon’s apparent approach to reliability and performance endangers the whole notion of public cloud computing. Imagine if your electricity company said that it was up to you to buy a generator to cover your needs for electricity if the power went out. Imagine if your water utility said that it was up to you to keep a water tank in your back yard, in case the water supply went out. This entire idea that the vendor of the service does not stand behind the availability and quality of that service (as evidenced in Amazon’s worthless SLA), and that this it is somehow the customer’s responsibility to code and or design around the vagaries of the public cloud infrastructure is wrong and dangerous to the future of public cloud computing.
It is wrong and dangerous to the future of public cloud computing because it is going to create the perception in the minds of enterprise customers (who are somewhat skeptical of running important applications in public clouds anyway) that public clouds are not to be trusted with important workloads. Since Amazon is the high profile market leader that it is in the public cloud market, Amazon’s failures to step up with a quality SLA is going to damage not just Amazon, but the entire notion of public cloud computing. The fact that a vendor like Virtustream offers a response time based SLA for SAP running in its cloud is just not going to matter if Amazon ruins the reputation of the entire public cloud computing concept.
Update – Amazon Explanation and Apology
On its blog, Amazon has issued an explanation and apology for the December 24 2012 ELB Service Event. The upshot is that a developer deleted state data from production servers thinking that he was only deleting it from non-production servers. Amazon has admitted that this occurred because of a flaw in their change management procedures (they did not require change management approval prior to the incident and now do), and have apologized for the mistake. This leaves Amazon struggling with the tradeoff between agility and change management just like many enterprises do, and also does not resolve the issue of the lack of a truly useful and meaningful SLA.
The Christmas Eve Amazon outage that resulted in Netflix being unavailable for 36 hours results from an unacceptable attitude on the part of Amazon towards reliability and performance. Unless Amazon steps up to the plate with a meaningful SLA, Amazon risks damaging its own growth, and the entire concept of public cloud computing.
The Virtualization Practice will be moving from our internal virtual environment and cloud configuration to an external hosted cloud configuration, at least temporarily. However, what we have found is that not all clouds are alike (we all knew that), and that some of our processes were not cloud friendly but what does it mean for moving to the cloud? How do we manage our change to these processes as we move to the cloud? Continue reading Change: Moving to the Cloud