In “VMware Articulates a Compelling Management Vision – Automated Service Assurance“, we detailed the strategy the VMware announced at VMworld Las Vegas in the fall of 2011. The cornerstone of that strategy was to open up a new ROI for virtualization. This new ROI is based upon OPEX savings that come from automating IT Operations, in contrast to the CAPEX savings that come the server consolidation that has fueled the virtualization industry so far.
The basics of this strategy and vision are:
- Detect a problem (through monitoring)
- Determine what actions need to be taken to fix the problem (through some intelligence)
- Take those actions automatically (through orchestration)
- Verify (again through monitoring) that the problem has been fixed
- Notify the human operators of the system that the problem has occurred, that it has been fixed and how it was fixed.
Notice that there are three essential pieces of technology that one needs to pull this off:
- Monitoring. There is actually more here than you might think. Just monitoring the infrastructure for how resources are being used is woefully inadequate. What is needed here is end to end monitoring of the infrastructure for latency, and end to end monitoring of every application (through all of its components) for response time.
- Intelligence. This is where automated service assurance gets really hard. There are two options for how to implement the required form of intelligence (only two right now because we have not figured out how to grow an “IT brain” in a jar yet). The first one of the two is a deterministic rules engine that knows that if X occurs then do Y. The second one of the two is a self-learning statistical engine that can detect abnormalities in patterns AND detect the most likely cause
- Orchestration. This is the easiest of the three – especially if you assume that you are running on a virtualization platform like vSphere that has the control API’s and orchestration engine needed to implement actions.
Drilling Down into Intelligence
Since the intelligence required to set in between the monitors and the orchestration is the most unsolved of these problems lets drill into this one a bit. There are significant challenges with both the deterministic (rules based) and statistical (self-learning) approaches:
- For rules based systems (if X goes wrong, do Y), the problem is quite simply that there are two many variations of X, and it is nearly impossible for humans to specify them all a head of time with remediation steps. For this reason rules engines have largely fallen by the wayside, as they have proven to be largely too expensive and time consuming to implement and maintain. Sure there are exceptions. Both HA and DRS are simple examples of rules based IT Automation. And both HA and DRS work because they take a very simple set of inputs (the server is about to go down) and take a very simple action (move the VM’s somewhere else). The problem with rules based systems is, what if the symptom is the degradation in the response time for a business critical application? There are hundreds of things that could cause this, and writing a rule in anticipation of each one is simply not possible.
- The lack of scalability and maintainability for rules based systems leads us to self-learning statistically based systems. Here we have three great examples of technology to work with, the analytics in vC Operations that came from Integrien, the market leading performance analytics solutions from Netuitive, and Prelert. Before we get into these in more detail, however it is very important to realize that they all share one common pitfall. That pitfall is that knowing that two sets of things both went out of bounds at the same time, is not the same thing as knowing that A caused B. In the world of statistics this is known and not confusing correlation with causation.