Are we going to see real progress in IT Automation and Service Assurance in 2012?

In “VMware Articulates a Compelling Management Vision – Automated Service Assurance“, we detailed the strategy the VMware announced at VMworld Las Vegas in the fall of 2011. The cornerstone of that strategy was to open up a new ROI for virtualization. This new ROI is based upon OPEX savings that come from automating IT Operations, in contrast to the CAPEX savings that come the server consolidation that has fueled the virtualization industry so far.

The basics of this strategy and vision are:

  1. Detect a problem (through monitoring)
  2. Determine what actions need to be taken to fix the problem (through some intelligence)
  3. Take those actions automatically (through orchestration)
  4. Verify (again through monitoring) that the problem has been fixed
  5. Notify the human operators of the system that the problem has occurred, that it has been fixed and how it was fixed.

Notice that there are three essential pieces of technology that one needs to pull this off:

  • Monitoring. There is actually more here than you might think. Just monitoring the infrastructure for how resources are being used is woefully inadequate. What is needed here is end to end monitoring of the infrastructure for latency, and end to end monitoring of every application (through all of its components) for response time.
  • Intelligence. This is where automated service assurance gets really hard. There are two options for how to implement the required form of intelligence (only two right now because we have not figured out how to grow an “IT brain” in a jar yet). The first one of the two is a deterministic rules engine that knows that if X occurs then do Y. The second one of the two is a self-learning statistical engine that can detect abnormalities in patterns AND detect the most likely cause
  • Orchestration. This is the easiest of the three – especially if you assume that you are running on a virtualization platform like vSphere that has the control API’s and orchestration engine needed to implement actions.

Drilling Down into Intelligence

Since the intelligence required to set in between the monitors and the orchestration is the most unsolved of these problems lets drill into this one a bit. There are significant challenges with both the deterministic (rules based) and statistical (self-learning) approaches:

  • For rules based systems (if X goes wrong, do Y), the problem is quite simply that there are two many variations of X, and it is nearly impossible for humans to specify them all a head of time with remediation steps. For this reason rules engines have largely fallen by the wayside, as they have proven to be largely too expensive and time consuming to implement and maintain. Sure there are exceptions. Both HA and DRS are simple examples of rules based IT Automation. And both HA and DRS work because they take a very simple set of inputs (the server is about to go down) and take a very simple action (move the VM’s somewhere else).  The problem with rules based systems is, what if the symptom is the degradation in the response time for a business critical application? There are hundreds of things that could cause this, and writing a rule in anticipation of each one is simply not possible.
  • The lack of scalability and maintainability for rules based systems leads us to self-learning statistically based systems. Here we have three great examples of technology to work with, the analytics in vC Operations that came from Integrien, the market leading performance analytics solutions from Netuitive, and Prelert. Before we get into these in more detail, however it is very important to realize that they all share one common pitfall. That pitfall is that knowing that two sets of things both went out of bounds at the same time, is not the same thing as knowing that A caused B. In the world of statistics this is known and not confusing correlation with causation.
Therefore in order for us to make significant progress on the front of automated service assurance (guaranteeing application response time by automatically taking the correct remediation steps), self-learning technology has to progress from being able to tell you what else is out of whack when response time degrades, to being able to know what other thing that is out of whack actually caused the degradation in application response time.
Why is Automated Service Assurance so Important
To date, virtualization has basically been fueled by the CAPEX ROI that comes from server consolidation. To date, when low hanging fruit has been virtualized, the investment required to implement virtualization typically gets repaid within 12 to 18 months, and then continues to get repaid in the form of having a fewer number of servers to upgrade and manage from that point on.
When we get to the business critical applications that have not been virtualized yet, it is likely that the capacity that the owners of those applications purchased for those applications was purchased for a reason, and therefore it it likely that consolidation ratios may not be nearly as high on a forward going basis as they have been to date. For this reason an new ROI based upon OPEX savings from IT automation is necessary.
What are we Going to see this Year?
Given that VMware has put two stakes in the ground (with vCenter Operations, and with vFabric APM) both of which are intended, in their own way, to deliver upon certain aspects of automated service assurance we are certain to see some progress. We should expect limited capabilities from VMware this year, as this is really the first year that VMware has to actually implement what they announced in 2011.
However there are some extremely interesting developments from independent vendors. Netuitive has more customers (and some very large ones) in production than any other self-learning vendor, and is already today cross-correlating both applications response time data from APM solutions, and infrastructure data from infrastructure management solutions. By proving the ability to connect and accurately cross-correlate disparate data sources like this, Netuitive has taken an important step towards being able to trigger remedial actions. Prelert has shown similar capabilities, albeit at a smaller number of customers.
VMTurbo has the only shipping example of a product that allows you to prioritize your workloads based upon how much work they need to do (throughput, not response time (yet)), and then automatically adjust placement and resource application in a vSphere environment to achieve those priorities.
We stressing the importance of the monitoring aspect of this equation. This is actually a new requirement for APM solutions, as what is needed here is automatic discovery of applications as they show up, automatic discovery of their topology, and then end-to-end response time measurement for all of the applications in the environment (not just a select few). BlueStripe and ExtraHop Networks are both uniquely positioned to fill this need.
VMware is going to make progress on its automated service assurance vision this year, with initial steps coming in the Q1/2012 version of vCenter Operations and the initial release of vFabric APM. On the third party vendor front, progress is most likely to come by partnerships between vendors who have interesting pieces of the puzzle, but do not have the entire puzzle themselves. On this front the most interesting vendors are Netuitive, Prelert, Blue Stripe, ExtrHop Networks, and VMTurbo. The wild card in this equation is how service assurance will fit with cloud management and offerings from vendors like DynamicOps, Abiquo, Platform Computing and Gale Technologies.