Virtualization Performance Management and Misleading Averages

You teenager takes the car out for a drive. Your teenager does an hour of driving and comes back with an additional 30 miles on the odometer. Your teenager also comes back with a speeding ticket for doing 60 MPH in a 30 MPH zone. When confronted with this unacceptable behavior, your teenager responds that their average speed while driving the car was only 30 MPH and that the police officer was “unfair” for focusing only on the peak and not the average. You explain the obvious – that you only have to go 60 MPH in a 30 MPH zone for an instant to be guilty.

VMware has just pointed out a related but opposite problem from the one discussed above. That problem is that outliers in averages can skew the average. To use the example above, if the teenager had not gone 60 MPH for a short period of time, the average speed during the measurement interval would have been significantly less than 30 MPH. The VMware Knowledgebase article is specifically about how storage latency averages can be skewed by storage events that take a long time (for example an XCOPY – one of the new features of VAAI).

Having something that takes a long time skew and average of n data points brings up two points, one of them obvious, and the other not so obvious. The obvious point is, “yes of course – we learned this in Sophomore Statistics” and if this is a problem, then why not report the peak value for the interval along with the average.

The not so obvious points gets to one of the core heartburn issues standing in the way of further adoption of VMware vSphere in enterprises – especially as penetration goes beyond 40% and into business critical and performance critical applications. It is clear that applications owners have different expectations of the performance assurance of a virtual infrastructure (due to its shared and dynamic nature) than they do of a static and dedicated infrastructure. In particular what applications owners want to ask the virtualization team is one very simple question.

End-to-End Infrastructure Performance Management

That question is, “When my application or workload places a request upon your infrastructure , how long does it take your infrastructure to complete my request end-to-end (from the initiation of the request in a web server or a database server to the spindle on the array and back again).”

The issue raised above about the value of data that is in fact an average plays into the answer to this question. The issue with averages cannot be resolved unless and until end-to-end infrastructure data can be collected and made available in the following manner:

  • Real Time. Now real time is a phrase that appears in the description of many products, but the accurate definition of real time is that you find out the information the instant that the measurement required to collect the information is completed. Real Time by definition does not involve averaging, as if you wait to average two things, you will not get the first observation in real time at all.
  • Deterministic. This is a corollary of real time. This means that you get the actual data, not a sample, not an average, and not a statistical approximation of the data. This is particularly important in the realm of infrastructure performance as milliseconds of latency can matter a great deal when it comes to the performance of the supported applications and workloads.
  • Comprehensive. This means all of the data. Every single transaction. Every single packet across the LAN, every single frame across the SAN, and every single write or read from the disk. While you may not want every bit of this data, what you do not want is for the management system to obscure any of it from you. So you may want to say, “give me all of the disk writes that take longer than 20 milleseconds”, and let the management system filter them for you. But the key for for the management system to have the true raw data at its disposal at the start of the process.

The Right Measurement Point

In the entire history of the evolution of our industry, it has always been the case that as new platforms have arrived and become accepted that those new platforms have been enhanced over time to provide management data about themselves. Switches and routers supported first SNMP and then Netflow. Windows Servers got enhanced first with PERFMON and then WMI. Storage devices got enhanced with SMIS.

The question that faces VMware and the virtualization industry that has arisen around VMware is “Is it different this time”. The question about averaging raises the question as to why VMware bothers to average. That raises the question of how frequently VMware can afford to get data from its “operating system” without producing a huge amount of data and destroying the very performance that measurement is trying to ensure (also known as the Heisenberg problem in systems management).

It is entirely possible that the hypervisor is so busy, and is being asked to do so much already, and it cannot be the source of real time, deterministic and comprehensive information about the performance (end-to-end latency) of the virtualized environment. If the hypervisor is the first case of a platform that cannot provide measurement data about its own performance in the required way (again real-time, deterministic, and comprehensive), then how will we solve this problem?

The Outside-In Approach

If we assume as posited above that vSphere is the first platform that will be unable to collect real-time, comprehensive, and deterministic data about its own end-to-end latency, then for the sake of the continued progress in virtualization and cloud computing this problem must be solved in another way.

The most likely answer will be to observe the performance or latency of the system from the outside in. The good news is that measurement points exist that allow for applications and infrastructure performance to be measured in this way. VMware has made a virtual mirror port available on its vSwitch that allows a software appliance to get every packet and the timing of every packet. Physical switches have supported mirror and span ports for years. And Virtual Instruments has pioneered the concept of inserting a tap into the fiber channel SAN and observing the Exchange Completion Time for storage transactions from this vantage point.

Who is Leading the Charge

So which vendors are doing a good job of taking advantage of this outside-in approach to the collection of real-time data? Here is a partial list:

  • Many vendors have appliances that attach to physical mirror or span ports and that can decode various applications level protocols. Compuware acquired Adlex several years ago and has continued to build out the of supported protocols. BMC just acquired Coradiant who was the market leader in understanding the performance of web applications from this vantage point. Quest’s Foglight product line includes this capability, as does the CA product line due to the acquisition of NetQos by CA. A recent and very strong entrant into this space is ExtrHop, who have distinguished themselves by being able to string together TCP/IP transactions at the level of several of the most interesting application level protocols. Xangati combines this approach with a deep understanding of the Netflow protocols.
  • VMware acquired Bhive, a virtual appliance that understood HTTP, Java and SQL protocols, launched the product as AppSpeed, and then did not do much with it. One hopes that the AppSpeed technology will re-emerge in a more prominent way as VMware fleshes out its offerings in the APM space (perhaps around a combination of the Hyperic and the AppSpeed technologies).
  • As mentioned above, Virtual Instruments has pioneered the approach of “tapping the SAN”. This results in real-time, deterministic, and comprehensive Exchange Completion Time data for every LUN in the SAN.

The Missing End-to-End View

Unfortunately what is still missing here is any kind of an end-to-end view of infrastructure latency that is also real time, deterministic and comprehensive. The marrying of the SAN point of view with the IP network point of view is the obvious combination. The hard issue here will be the identification of the applications so that these view of infrastructure performance can be surfaced on a per application basis. In summary, we have a long way to go here, and this just might be why so many of those virtualization projects for business critical and performance critical applications are having so much trouble getting traction.