The principal objectives of monitoring a system (a set of hardware and software infrastructure), an application, or a service (a combination of the prior two that accomplishes a business objective like the ability to enter an order and ensure that it ships), are to find problems in the system, the application, or the service, and then to find out what caused the problem so that the problem can be eliminated and hopefully prevented from occurring again. There are therefore several steps to a process of monitoring and management that broadly get repeated across a wide array of problems.
These steps can be briefly summarized as:
- Collect data about whatever is important to you. If you are the systems administrator in charge of the virtual infrastructure in your company, then obviously what you want is data about how the resources in your environment are being used, what load is being placed on your environment, how your environment is performing in support of the applications and services that rely upon it, and how these numbers are changing over time so that you can plan for capacity additions to the environment in an orderly manner. This step alone can be somewhat daunting as it is not easy to select from the thousands of available metrics the ones to watch. Purchasing a resource and available monitoring solution from a vendor like VizionCore or Zenoss can be easily justified just on the basis of the time saved in selecting the most important metrics, since the vendors of these solutions have often done a great deal of the homework for you due to extensive interactions with their customers over a long period of time.
- Figure out what constitutes an anomaly in the data. This is where the really hard works starts. The simply way to do this is to look at the historical pattern of each metric and to set a manual threshold that seems high enough to avoid normal deviations, but that catches really abnormal deviations. However this simple way to do this does not scale when you have thousands (and in a large system due to the large number of components of those systems) tens of thousands, hundreds of thousands or even millions of metrics to set thresholds for. The other side of this problem is that as soon as you set those thresholds they risk becoming out of date due to changes in how your systems are being used. This is particularly true in virtualized systems, since they are so dynamic. For these reasons many products have simple “automatic baselining” features that let you pick a percentage (say 95%) that maps to a standard deviation (in the case of 95% two standard deviations from the mean or the average) and this baseline is applied on a time of day and day of week basis. This automates the process of setting thresholds for each metric and is an important part of reducing the administration burden for a monitoring solution.
- Now that you have the monitoring solution configured to produce alarms, figure out what your alarm response process is going to be. This is where the real work and frustration starts. The reality is that if you define normal to broadly in step 2 above, you will define as normal things that you should have paid attention to. If you define normal tightly, then you will probably not miss any important incidents, but you will also get a lot of false alarms. You will also find that there is no such thing as one product that can monitor business services, applications, and systems in a cohesive manner, and that even if you buy something to watch your virtualized environment, there will still be quite of number of point tools in use by specific teams responsible for specific items or layers in the environment like storage tools, database tools, J2EE applications tools, network tools, etc. Even very virtualization aware Infrastructure Performance Management solutions like Akorri, CA Virtual Assurance, Xangati, and Virtual Instruments while very capable will not cover the entire waterfront. This means that no matter what you do you will end up in “blame storming” meetings when serious issues arise where people compare the outputs of various tools in an attempt to exonerate their layer of the environment and get the right to leave the meeting. This leads directly to step 4 below.
- Now we get to the fun part which is root cause analysis. Let’s take a web based application that is business critical as an example. This application is comprised of a web server, a J2EE middle tier server and a Microsoft SQL Server back end database. All of it runs in your VMware vSphere environment. The application team uses an APM tool, perhaps a first generation J2EE APM solution like CA Wily, or perhaps one of the newer ones specifically tuned for the virtualization and the cloud like AppDynamics and New Relic. The virtualization team uses either a tool that watches resource utilization in the environment like VisionCore vFoglight, or Zenoss, an infrastructure performance management tool like Akorri, CA Virtual Assurance, Virtual Instruments or Xangati. The database teams uses database specific tools and the storage team uses their storage specific tools. The network uses their own tools to watch the network at a high level of detail. Now when a problem occurs the fun starts.
The Root Cause Challenge
In the scenario described above root cause is either going to be pretty simple or maddeningly complex. The simple problems (at least relatively simple) will be ones that show up in one of the tools, and where the underlying problem apparent in that tool. If a VM is running out of memory, then this will be obvious in the tools used by the virtualization team. If two applications are placing load on one spindle in the array and this is creating a bottleneck then Akorri is adept at finding these issues and making the solution obvious. If there is a problem in the code running in the applications server a byte code instrumentation solution like one from Wily, AppDynamics, or New Relic will be adept at showing where the time is being spent which will often lead the development team to a solution (fix the bug) in a relatively short period of time.
The hard problems have to do with issues in business services and applications whose root cause lies somewhere in the underlying environment. These problems are hard today when tier 1 applications reside on physical infrastructures, and they will get even more difficult as these applications are moved to virtual and dynamic infrastructures. The reasons for the difficulties lie in the fact that the tools which are capable of measuring the response time of an application (like AppDynamics, BlueStripe, New Relic and CA Wily) do not have response time deviations integrated with infrastructure management solutions. This is true for the legacy physical tools as well as for the new virtualization aware tools.
It is therefore imperative that a method be put in place in order to facilitate an orderly root cause process in order for it to become possible for enterprises to confidently virtualize tier 1 business critical and performance critical applications. Root cause analysis for these applications on a static physical infrastructure is already hard. Root cause analysis for these types of applications on a virtual infrastructure will be an order of magnitude more difficult as the dynamic behavior of the virtualized (or cloud based) environment will make problems even more difficult to troubleshoot.
Deterministic Root Cause
There are two possible approaches to achieving a streamlined root cause process. The first candidate process is a deterministic process. In this case deterministic means that the products in question completely and automatically determine what issue in the infrastructure caused the problem. This involves integrating performance deviation data from applications into infrastructure management solutions and to have these solutions “look for” the root cause. This would also require that application topology maps discovered by the APM products get shared with the infrastructure performance management solutions so that the infrastructure performance management solutions would have a reasonable domain (or subset) of the infrastructure to examine in order to try to find the issue at hand. Unfortunately sharing of the application topology map (dynamically derived and kept up to date in real time) with the infrastructure topology map (also dynamically derived and kept up to date in real time) is not something that is available today in any product you can buy – which makes the deterministic approach unattainable as a solution.
This leads us to Statistical Root Cause which is described below.
Statistical Root Cause
If Deterministic Root Cause is basically impossible to achieve over a wide set of use cases and environments then what is left? What is left is to use the fields of statistics and correlation to find relationships between applications/business service deviations, and issues in the infrastructure. There are several different approaches that can be taken here:
- Time Based Correlation. This approach simply says that if something goes wrong in the application (response time is too high) then lets go find what is out of bounds in the infrastructure at the same time. This presumes that the process of defining normal for the infrastructure was done correctly and is not out of date (if it was done manually, then it is almost certainly not correct and almost certainly out of date) and that one can quickly go through all of the tools required to see the infrastructure in order to sell all of the potential anomalies. At the end of the day, this process is really not much better than having everyone show up in a room and say “the problem started at 10:00 AM on Tuesday, what do your products show as being wrong at that time”?
- Self-Learning Models. There are two vendors who provide solutions targeted at the VMware environment that are based upon applying sophisticated statistical models to the problem of collected data from the virtualization environment, its supporting infrastructure, and the applications that run on it. Netuitive is the market leader in this space due to a long history and a track record of successful implementations in some very large enterprises. The Netuitive technology is unique in that it goes far beyond the idea of automated baselines for every metric and actually builds a dynamic model at each time point that cross-correlates all metrics with each other. Netutitive therefore dynamically understands the relationships between each metric and where the system is in time (time of day and day of week), and also the relationship between each metric and every other metric. The other vendor in this space is Integrien who is also successful, but who has not been focused on the virtualization market for as long as Netuitive and who does not have quite as many customers yet.
Virtualizing tier 1 business critical and performance critical applications will require that the virtualization team be able to provide assurances about infrastructure performance and applications performance to the applications teams and their constituents. This is a dauntingly complex requirement to meet due to the fact that meeting it requires the integration of tools that are not integrated today, and that virtualization adds risk to the equation due to the dynamic behavior of virtualized systems. The only current solution to this problem therefore involves putting a layer of real time, self-learning statistical correlation software on top of all of the tools that gather the metrics that need to be gathered and to let these tools build models that eliminate false alarms and tie infrastructure issues and service level issues in applications and business services together. Tools like Netuitive and Integrien are the only currently available options for performing this critical task. These tools should be evaluated on the basis of their ability to directly collect the required data from VMware VSphere itself, collect applications and business service data via integration with tools at that layer, collect data about the physical infrastructure that is not available via the vCenter APIs, and to accurately and automatically cross-correlate all of this information in real time in order to provide actionable root cause information for each application and business service event that occurs.