It is pretty much the consensus that most large enterprises have not virtualized more than about 30% of their servers, and what they have virtualized are the “tactical” applications that IT owns and that IT can virtualize without the permission of a group that owns an application and that has to answer to set of business constituents. The next set of applications to be virtualized are variously referred to as “Business Critical”, “Tier 1”, or “Performance Critical” but these terms all mean the same thing – the application has people who care for it and people who use it who demand that it perform well all of the time.
Dealing with these types of applications in the physical environment has always been an expensive, time consuming and “hand crafted by elves” exercise (no – applications support people are not being called elves – it is just an analogy). The process in general looks something like this:
- The entire physical infrastructure for the application (WAN, LAN, Servers, SAN, and Storage) tends to get massively (and expensively) over-provisioned just to be sure to eliminate the chance that hardware constraints will cause a performance triage exercise (also called a blamestorming meeting).
- The teams in charge of each of the components of the system use their own point tools to monitor their portion of the infrastructure. This means separate tools for the network, the storage layer, the servers, the operating systems and JVM’s, and finally the applications themselves.
- The decision as to what data to monitor and how frequently to collect it is made separately in each tool. This pretty much guarantees that any kind of a deterministic integration of this data into a system that identifies problems and then finds the solutions to these problems is almost impossible to create. Therefore these products have to be integrated statistically via a solution like Netuitive that that can figure out what matters and what does not automatically, leaving most teams to meet in a conference room and compare reports when things go wrong.
What Changes When you Virtualize a Tier 1 Application?
Given that how Tier 1 applications are monitored in the physical environment is a mess, why don’t we just apply that same mess to those same applications in a virtual environment? The answer is that the virtual environment is different enough so that those differences will create problems that will sink the project to virtualize those Tier 1 applications unless those differences are addressed. Those differences are:
- Part of the rationale to virtualize is to save some money by taking some of the over-provisioning out of the environment. This will only work if the quality of the monitoring improves to the point so that it can prove that performance is acceptable despite the higher rates of utilization of the infrastructure.
- Different workloads will share a common infrastructure. This again is essential to the hard dollar ROI from virtualization, but it introduces an unknown number of new variables into the performance equation. Again monitoring must keep up and be able to prove innocence and identify the guilty when resource collisions occur.
- A static environment becomes dynamic. Not only is a given infrastructure more shared than before, but workloads will be moving around due to things like vMotion and DRS. This again requires that the monitoring keep up with these changes as they occur.
- Resource utilization data (which we used to used in the physical environment to infer infrastructure performance) collected from within guests or VM’s gets corrupted by the timekeeping problem, making these metrics useless in a virtual environment.
These changes mean that in order for monitoring to keep up with a virtualized environment, it must be much more frequent, granular and event driven than it was for a static infrastructure. This means that monitoring needs to be rebuilt from the bottom up – starting with how the data is actually collected.
The Underlying Data Problem
Solving the problem of virtualizing these Tier 1 applications must then first start with the issue of how to collect the data that is needed to ensure that these applications are performing well, and that point to where the problem is when the application is not performing well. VMware has made some enormous progress on this front by taking the position that its own hypervisor will collect basic resource utilization data from the vSphere environment and make this data available via the vCenter API’s. This means that the plethora of agents that would otherwise get installed in servers that are virtualized are supposed to no longer be necessary.
However, there is a major issue with the VMware vSphere API data. That issue is that it is only available on a 5 minute interval. When it comes to ensuring the performance of a Tier 1 application, 5 minutes is an eternity. In fact 5 or 10 seconds can be too long for applications performance to be degraded to the point that user productivity or revenue generation is impacted. For an online retail organization, if 2 10 second degradations occur once every 5 minutes and each one causes an abandoned shopping cart, these short and intermittent problems can rapidly cascade into a significant amount of lost revenue.
The Need for Real Time Data
Virtualizing Tier 1 applications therefore requires that we rethink how we collect data from the environment. At the core of this rethink is the reality that the current 5 minute data collected by VMware and exposed in its API’s (and used by every monitoring vendor who declares “VMware support” in their product) is woefully inadequate. The bottom line is that the collection of true performance data, not resource utilization data, but data that measures the performance (infrastructure and applications response time) of the system and the application, and data that points to root causes must be collected in the following manner:
- It must be collected in real time.
- It must also be collected comprehensively (all of it in real time)
- It must be collected in the most granular manner possible. Aggregating data across multiple entities should be avoided if at all possible.
- It must be collected deterministically (meaning the actual data, not an average of multiple time points, or a statistical estimation of the data)
- Where possible it must be event driven so as to avoid the problem having the monitoring system create such a load on the system that the monitoring system itself becomes a source of the very performance problems one is trying to avoid. This is known as the application of the Heizenberg Uncertainty Principle to monitoring which holds that the more you try to measure something that more you influence its behavior.
Examples of Real Time Data for Virtualized Systems
The single most important realization that one must come to if one wants to virtualize Tier 1 applications is that just collecting the vCenter, 5 minute, data is a woefully insufficient approach to monitoring the environment. Therefore if relying upon the data that VMware makes available is insufficient, then one must look to vendors who have taken it upon themselves to collect their own data in a manner that meets the needs of the problem at hand.
One such vendor is Virtual Instruments (VI) who is a specialist in collecting real time, comprehensive performance data from the SAN. VI does this with a unique approach. Since the SAN does not come with a mirror or spanned port that you can just plug into, VI sells you the hardware to create these ports (TAPS) in your SAN and then sells you a monitoring solution that captures true SAN latency data on a per LUN basis.
The table below compares two sets of of about SAN latency for the identical sets of servers in a production environment. The left column is the VMware vCenter API data, the right column is the VI Virtual Wisdom data.
|VMware vCenter 5 Minute|
|Virtual Instruments Virtual Wisdom|
Real Time Data
Let’s go through the key differences between these two sets of data:
- The vCenter API data is aggregated and averaged across 5 minute intervals. Therefore even when it reports the highest write latency, that is the highest average write latency. The VI Virtual Wisdom solution measures individual Exchange Completion Times which are for individual write transactions
- The vCenter API data is also aggregated across all of the I/O activity for that VM. The VI data breaks this down by LUN and port so you see multiple entries in the table for each server.
- As you can see from the table above vCenter API data perceives that the longest write latencies are in the range of 21 and 22 milliseconds. The truth is that longest write latencies are as long as 41 milliseconds and they are occurring on a server that is number 3 in the VMware list.
The bottom line is that averaging latency or response time data at 5 minute intervals can (and does in the case above) obscure problems as bad transactions are averaged in with good transactions and the magnitude of the bad transactions get diluted in the averaging process. There is also a granularity issue with the VMware data as things that VI is able to measure separately are also lumped together by the VMware aggregation process.
Lest this be perceived as being too harsh on VMware, let’s also discuss why this is the case. VMware is getting its data from the hypervisor which is certainly a good place to get uniform data across a broad range of systems. However the hypervisor has to be a very thin and very efficient piece of code. It probably is not possible to collect detailed data about individual storage transactions from the hypervisor as the measurement point without causing the hypervisor to create the kind of overhead on the system which would be totally unacceptable. Therefore it is probably fair to say that VMware is making the best tradeoff that it can given its measurement point and that in order to get real time and more granular data, a different measurement point is needed. This is precisely where a vendor like Virtual Instruments that has a unique measurement point deep in the SAN can step in and add a significant amount of value to the question of how to instrument a VMware vSphere environment that has to support Tier 1 applications.