At VMworld 2013, one of the three imperatives for IT was to “Replace Management with Automation”. This builds upon some things that VMware has been doing for quite some time, and restarts an initiative that was first kicked off by Paul Maritz a few years ago. One the one hand VMware has been replacing manual management with automation for quite some time. On the other hand, VMware has talked about automated problem resolution in the past but never delivered. All of this inevitably leads to the notion of automated service assurance.
Automated Service Assurance Defined
Since this article is devoted to automated service assurance, let’s define it at the outset. Automated Service Assurance means the following things:
- Availability of key services is automatically assured. This means that there is sufficient redundancy of hardware in the environment so that if one component fails another takes its place automatically – with that automation provided by the virtualization platform and the associated management tools.
- The notion of key services extends to more than hardware. It extends to the availability of individual virtual machines, the operating systems running in them, and the applications running on the guest operating systems.
- Allocation of resources is automated in a manner that delivers Service Assurance. At the minimum this means that if a server is running out of memory, workloads get moved to free up some memory. Fully implemented, this means that when resources are scarce, the highest priority workloads get the resources first.
- Finally and most importantly, application response time and throughput is assured. That means that if an application should deliver a response time of 500MS or less 99% of the time at a transaction rate of 10,000 transactions per hour then these metrics are both measured and then assured if this is a high priority application.
The History of Automated Service Assurance
If you think about the core values that VMware brings to its customers (after the hard dollar ROI of server consolidation), many of the core benefits of vSphere have to do with automated service assurance in some way. For example:
- VMware HA is able to restart workloads on a new host if the host actually fails and also detect if a guest OS has failed, and restart that individual virtual machine. Therefore HA provides a crucial level of automated service assurance on the front of availability.
- VMware DRS is able to detect hosts that are running out of resources and then “load balance” the work so that workloads are moved from hosts that are under stress to hosts that have headroom.
- VMware FT (Fault Tolerance) is able to keep two virtual machines in complete synchronization so that if one fails the second one can take its place and be in the exact same state with all in-flight transactions as the first one was.
So from the perspective of the above three aspects of Automated Service Assurance VMware has been delivering on this front for quite some time. Furthermore features like vMotion, DRS and and HT are in very wide use in the VMware customer base, so VMware has also been delivering substantial Automated Service Assurance benefits to its customers for quite some time.
The Current State of Automated Service Assurance
Giving VMware full credit for the level of automation that it has brought to availability and resource allocation, we still have a long way to go. For example if you want the scarce resources on your servers (all four of them – CPU, memory, networking and storage) to go first to your most important applications then the only way to implement that is with VMTurbo which offers precisely this capability.
If what you really want to do is address the application response time and throughput use case listed above, there is no way to do this today. Solving this problem would have to start with measuring application response time and throughput for every application you care about, and then using these metrics as an input to an automated resource allocation engine like the one from VMTurbo.
VMware also gets credit for delivering a great deal of automation on the private and hybrid cloud provisioning front. With vCloud Automation Center (vCAC) VMware can allow for an infrastructure, platform or n-tier application service to be ordered out of a service catalog, and then fully provisioned in a completely automated manner (with the last mile of the automated provisioning done by a next generation IT Automation solution like Puppet). But as of yet there is no glue between the automatic provisioning of high level services and the quality that those services must deliver to their business constituents.
Automated Service Assurance and “Replacing Management with Automation”
By stating that IT should replace management with automation VMware is implying that VMware will deliver the solutions that will enable this to occur. And in so doing VMware is jumping right into one of the most interesting debates in the modern management software industry. The debate boils down to the “monitor everything” point of view vs. the “preventative maintenance” point of view. Let’s contrast these points of view:
- The “Monitor Everything” Point of View. Splunk is perhaps the leading proponent of this point of view. Splunk says not only should you monitor everything at the level of machine data (log files, operations management data, security data, and application performance data), but that you should put all of this data into their back end data store so that they can index it all for you and allow for easy queries across what used to be extremely disparate domains. The underlying premise is that you need to monitor everything and you need to monitor it frequently, and you need to monitor it at a fine grained level of detail precisely because if you do not have all of that data, management of a highly dynamic, abstracted and shared environment is impossible. The bottom line to this point of view is that monitoring is a big data problem and you had better put the big data back end and the analytics in place to deal with it.
- The “Preventative Maintenance” Point of View. VMTurbo is the poster child for this point of view. While VMTurbo does not claim to make it unnecessary to monitor your security logs, VMTurbo does claim that operations management does not in general need to be a big data problem. VMTurbo claims (and delivers) that if you tell it the priority order of your workloads, then it will allocate scarce resources to those workloads automatically by either moving the workloads or changing the virtual resource allocations to those workloads. VMware has no such priority based workload movement or resource allocation capabilities in its product today so we will just have to see what the future holds.
Replacing Management with Automation also implies a great many things beyond the monitoring use cases discussed above. It implies things like:
- Everything that is now done manually, or through scripts, or for that matter through Puppet or Chef will be done with policies (that may invoke Puppet or Chef).
- Whenever anything is done through automation, the associated service level policy is attached to whatever has just been provisioned or changed.
- That service level policies rise in value to express what people who run and use applications care about
- Most importantly, that problems can be correctly identified, and that fixes to them can be automatically and correctly applied.
The Holy Grail of Automated Problem Resolution
The last bullet directly above leads to the single most difficult aspect of “Replacing Management with Automation”. Those difficulties are encapsulated in the following points:
- Management is today about events and metrics. Monitoring collects logs about events and metrics of various kinds.
- Management software is for the most part unable to deterministically relate specific log entries and specific levels of specific metrics to specific problems
- In other words, any system that relies upon interpreting logs and various operations metrics is going to be rife with false negatives (missed problems) and false positives (false alarms).
- This is why most systems administrators either turn off all of the alarms and alerts or ignore them until they get a call from a human being saying that there really is a problem
- Therefore given the current state of the art, there is no way to know from the available monitoring information whether there really is a real problem or not
- Worse yet, even given a set of metrics that are clearly out of bounds, translating those out of bound metrics into the correct action is an as of yet unresolved computer science problem.
- Therefore the most realistically achievable form of automated problem resolution is in fact automated problem prevention which is precisely what VMTurbo is delivering today.
Replacing management with automation is a bold vision statement on the part of VMware in pursuit of the goal of fully automated service assurance. However as of today, the quality of the data and the ability to translate abnormalities in poor quality data into the correct set of automated actions makes this into a worth goal, and not a near term realistic product deliverable.