Google has delivered live migration in its Google Compute Engine cloud offering. Now comes word from Barb Darrow at Gigaom that Amazon is working on live migration as well. Is it possible that the cloud bambies are waking up to the fact that not all applications are stateless and that for many applications, shutting down instances is simply unacceptable? Are the cloud bambies waking up to enterprise requirements for availability and performance management?
Some Realities of Enterprise Application Architectures
While a great deal of excellent work is going on in the building and deployment of rapidly evolving web-scale applications in various public clouds, these net new applications are being built with the cloud in mind. This means that they are being built to withstand the vagaries of public clouds. Netflix’s Chaos Monkey (which randomly kills instances of your application to make sure that you have no single point of failure) is a great example of going to extreme lengths to build things that can withstand the issues that are part of commercial public cloud environments.
But the reality is that most enterprise applications are not net new web-scale applications built to assume a public cloud execution environment. In fact, a lot of enterprise application architectures look like the image below.
So, not only are most enterprise application architectures complex, but they are also, by and large, stateful. That means that stateful transactions are flowing through them, and that if a component of the application system goes down, then the transactions (of which there could be many) running on that component fail.
The Role of VMware vMotion
Sometimes an entire company can get built on one product, which may have only one or two important features. It is fair to say that the ability to consolidate physical operating systems into virtual machines and the ability to move those virtual machines from one host to another are the features that built VMware vSphere. And vSphere is the product that is the foundation of the entire VMware company. So why is vMotion so incredibly important?:
- If a host requires maintenance, a system admin can vMotion the VMs off of that host to another host, perform the maintenance, and then migrate the VMs back. This obviates the need for admins to work night shifts, as maintenance can now be done during the day and during the production window. If you want to know why some of VMware’s admins are its most loyal customers, perhaps the fact that vSphere lets them get more sleep has something to do with that loyalty.
- vMotion enables Distributed Resource Scheduler (DRS), which allows for automatic balancing of CPU and memory utilization across the hosts in a vSphere cluster. This has made a huge number of problems in the IT environment caused by temporary resource utilization spikes go away. This again saved administrators a large amount of time and aggravation.
- vMotion enables High Availability (HA), which senses when a host is about to go down and automatically migrates its VMs to a different host. It also senses when a guest has gone down and automatically restarts it.
- vMotion enables Site Recovery Manager (SRM), a new disaster recovery option for enterprises. SRM puts into place a mechanism by which to vMotion all of the running VMs in a data center to a DR site and then to incrementally keep them up to date, so that they can be brought up in case of a failure of the primary site.
These capabilities have all proven to lie somewhere between “extremely valuable” and “critical” to the enterprise accounts that have standardized upon vSphere as their data center management solution. These very same capabilities are behind the fact that the number of servers virtualized with vSphere continues to grow.
The Cloud Bambies and Enterprise Applications
By “cloud bambies,” we mean people who believe that all enterprise workloads have already migrated to public clouds such as Amazon Web Services, Google Compute Engine, or Microsoft Azure, or that they will shortly do so. There are several times more virtual machines among VMware’s on-premises customer base than there are cloud images at Amazon and Microsoft Azure combined, so the idea that everything has already migrated to the cloud is ludicrous.
The more interesting question is “What will the rate of migration be, and what are the impediments to that migration?” The following factors are likely to heavily influence the rate at which enterprise applications migrate to public clouds:
- Security: At the most recent Gigaom Structure Conference, Amazon CTO Verner Vogels said, “If you care about the security of your data, store it at Amazon.” According to Urs Hölzle, Google’s VP of Infrastructure, “We have a security team that no one else could afford.” It is clear that the public cloud vendors have made tremendous strides when it comes to security. However, it also clear that for certain applications, with certain types of data, in certain industries, customers either have a strong preference for their data being on-premises, or they feel required by regulation to have their data on-premises.
- Availability Management: This is where live migration and vMotion come into play. From a customer’s perspective, live migration is a crucial tool that enables continuous availability, and continuous availability is a requirement for many stateful applications. It is also important to note that what Google has done with live migration does not go far enough. All Google has done is enable live migration as an option for Google to use during maintenance procedures. In other words, if Google wants to take down the host that your application is running on, then Google can use live migration to migrate your workload to another host. That is completely different from allowing customers to migrate their image to another host at their discretion.
- Performance Management: This is an area in which the public cloud vendors have a long way to go. The problem is, quite frankly, one of opaqueness. If you want to see the kinds of problems that layers of opaqueness can create, go read the blog post up on VentureBeat in which the founder of Rap Genius outlines his travails. The cloud vendor is simply not forthcoming about what is really going on in the infrastructure supporting its customers’ applications. Amazon can publish all of the CloudWatch metrics that it wants to, but until customers can understand the fine-grained utilization of the physical compute, networking, and storage resources as well as the end-to-end latency of requests for work in the cloud, customers are simply not going to trust public clouds for response-time-sensitive applications. Customer-initiated live migration plays an important role here as well. When response time suffers in the public cloud, many customers assume that it is caused by resource contention on the part of a “noisy neighbor.” With a stateless application, they kill the affected instance and start a new one under the assumption that it will end up on a less-conflicted host. Giving customers the ability to live migrate images when their APM products detect a response-time slowdown would be a huge step in the right direction for public cloud vendors.
vSphere is the standard data center management software for most large enterprises. Enterprises rely on vMotion and the capabilities it enables to ensure the continuous availability and acceptable performance of business-critical stateful application systems. Those application systems are never going to get migrated to public clouds unless they can be managed in the same manner in which they are currently being managed on-premises. Cloud providers must implement live migration and make it and its enabled services available to customers in the same manner as is the case with vSphere if they wish to capture these workloads.