VMware Articulates a Compelling Management Vision – Automated Service Assurance

VMware has made it known for quite some time that virtualization, private clouds (IT as a Service), hybrid clouds, and public clouds will create the need for a new management stack, and that VMware intends to be an aggressive supplier of such a new management stack. However, what VMware has never before said is precisely what would be different about this new management stack (other than it explicitly supporting vSphere) than all of the other management stacks that have existed for all of the other computing platforms in the world.

Well in his keynote at VMworld 2011, the CEO of VMware, Paul Maritz, articulated a simple and profound strategy for re-inventing how computer systems are managed:

  • The Old Way – First Monitor, then when something goes wrong, Alert, and then manually Respond to the issue.
  • The New WayMonitor, then when something goes wrong, automatically Respond (fix it automatically), and then Alert to notify the humans that it has been fixed.
Monitor -> Respond -> Alert = Service Assurance
For two years we have been writing about the concept of Service Assurance on this site. We have framed Service Assurance in application performance terms. What this means is that Service Assurance needs to start with the idea of measuring application performance in terms meaningful to the users of the application (response time), setting a service level around the average response time and the variation in response times, and then having a system that can automatically take the right action to restore the response time to the correct level when a deviation occurs.
The Service Assurance Challenge
While the idea of a system that automatically heals itself, or automatically delivers the right level of service to your most important applications and customers is incredibly appealing, the reality is that his is very hard to accomplish. To be clear at the minimum the following issues must be addressed:
  • The problem must be accurately identified. For example the problem of an degradation in the response time of one application is different from the problem of the degradation in the response time of many or all applications.  The degradation of response time on average is a different problem than the increase in the variation of response times even with the average remaining the same. An increase in the error rate for the transactions in the application is likely a completely different problem than is anything having to do with response time.
  • Once the problem had been identified, the most likely cause of the problem needs to be determined. Problems roughly fall into two categories – those that are in the application itself, and those that are in the infrastructure that supports the application. Application level problems are easily found for modern applications with tools that instrument code in production like AppDynamics, New Relic, and dynaTrace (now part of Compuware). However finding issues in the infrastructure that are causing issues in application response time is very difficult due to how hard it is to tie specific issues with applications to specific behaviors in infrastructure.
Stochastic (Statistical) vs Deterministic Root Cause
The question of identifying the action that should be taken when an application has an issue (degradation in response time, increase in the variation of response time, or increase in error rate (an error rate of 100% means that the application is no longer available as every transaction is failing)), can be addressed by one of two fundamental approaches.
The most desirable approach to addressing issues in the infrastructure that are causing an application to have a problem would be to use deterministic root cause techniques to identify the issue in the infrastructure. However, we do not currently have the instrumentation in place to be able to do deterministic root cause. The instrumentation that we lack is a method of collecting the data as to what infrastructure elements are responsible for any particular transaction or set of transactions in an application. For example, this would require being able to trace a transaction from its inception in a web server, through the entire stack of the application (this is easily done by a good APM tool), and then from the database server through the SAN to the spindle on the storage array and back.
Since we cannot today deterministically link an application to the elements of the infrastructure that support it, we have to rely upon stochastic (statistical) means to link infrastructure issues to application issues. The simplest of these methods is time based correlation, where one assumes that if something in the infrastructure is behaving badly at the same time as the application is becoming slow that the bad behavior in the infrastructure is the cause of slowdown in the application.
However, statistical methods also suffer from some fundamental limitations. The most important of these is the  you simply cannot assume that because application response time is degrading at the same time as the queue depth on a storage array is increasing that the increase in queue depth is causing the response time degradation. This is known in the world of statistics as not confusing correlation with causation.
Key Components of a Service Assurance Solution
The question of how to do root cause turns out to be central to the question of how to construct a Service Assurance solution. Consider the diagram below which is a flow chart of how a Service Assurance solution would operate. Let’s look at each of the functions that needs to be performed and how one would perform it:
  • Monitoring applications performance is easily done by a modern APM solution that is build for dynamic environments. Good choices include AppDynamics, New Relic, BlueStripe, ExtraHop, and dynaTrace (now part of Compuware).
  • Taking action automatically is not hard. VMware has already built substantial automation features that are accessible via API’s in to vSphere. Enterprise focuses IT as a Service vendors like DynamicOps, Embotics, Platform Compuiting, Nimbula, and Gale Technologies all have substantial orchestration capabilities in their solutions.
  • Notification is not a hard problem and is solved in a variety of ways in virtually every product that plays a role here.
  • It is the Decision Engine where the rubber meets the road. This is where the decision is made as to what automated action should be taken to try to fix the problem that is at hand. If you notice the loop that starts with the first “No, the problem is not solved” you can probably envision the disastrous consequences of a bad decision (ever heard of “vMotion sickness”).
Who’s Who in Service Assurance
Since no complete service assurance solution is available from any single vendor, it is important to call out who  the leaders are in advancing the state of the art:
  • VMware deserves an enormous amount of credit for simply stating that IT Operations needs to be reinvented around automation, for building the ability to programatically control vSphere so as to guarantee resource levels to certain workloads, for committing to identify the workloads running in VM’s, and for releasing vCenter Operations with a root cause capability based upon the stochastic technology acquired from Integrien.
  • Netuitive is the only independent vendor with a self-learning performance management capability that can be applied to Service Assurance. A logical way to construct an independent Service Assurance solution would be to combine the APM solution that fits your applications with Netuitive and possibly your choice of an enterprise focused private cloud management vendor.
  • In order for APM to play a role in Service Assurance, APM solutions need to be significantly modernized with respect to what most enterprises have installed today. Legacy APM solutions that are expensive, hard to install, and require constant manual re-configuration simply do not fit the bill here. Look to vendors like AppDynamics, AppFirst, New Relic, Extrahop, BlueStripe, and dynaTrace/Compuware for modern solutions that provide fast time to value and low cost and effort of ownership.
  • Several vendors of private cloud (IT as a Service) management solutions have already implemented significant service assurance functionality in their solutions. Platform Computing, Abiquo and Gale Technologies all fall into this category.
  • In a category of its own is VMTurbo, which is the only management solution for vSphere that attempts to package up a full Service Assurance capability in one product. VMTurbo is able to identify the constrained resources in a vSphere environment, and is able to ensure that the applications that are the most important workloads get priority access to those constrained resources. VMTurbo does not yet have the ability to act upon application response time, but this will likely occur in partnership with one or more APM vendors.
VMware has put a stake in the ground by stating that automated IT Operations will be the benefit that carries virtualization and private cloud to the next level of business critical and performance critical applications.  VMware has taken important steps in terms of instrumenting vSphere so that API’s can be used to affect resources assigned to workloads.  Furthermore VMware has put together a management strategy in the form of vCenter Operations that leverages the self-learning technology from Integrien to determine what actions to take.
As VMware has raised the level of its game to address Service Assurance, we can be certain that the third party ecosystem will rally, partner, and attempt to provide better solutions than VMware provides. This sets up a win-win situation for customers who are looking for the next level of business benefit to drive the next generation of virtualization and private cloud projects.