Infrastructure Performance Management Heats Up

IT Operations groups who are responsible for managing the availability, performance and capacity of a virtual environment in support of performance critical tier 1 applications have two fundamental goals to achieve:

  • Ensure that the environment does not become unavailable or perform poorly due to a lack of physical resources (CPU, memory, network I/O capacity, SAN I/O capacity and capacity for I/O operations in the storage array)
  • Ensure that the entire virtual infrastructure performs as expected in support of performance critical applications.

Due to a wide range of factors covered in the Virtualization Performance Management White Paper, that in a virtual environment (unlike in a physical environment) one cannot effectively infer the performance of virtual environment by how its resources are being used. Basically the dynamic and shared nature of a virtual environment dilutes the value of resource utilization metrics for the purpose of assessing true Infrastructure Performance.

For these reasons the idea of Infrastructure Performance Management needs to be reinvented around a key metric – Infrastructure Response Time. IRT is defined as the round trip time that it takes for the infrastructure (from the guest to the spindle and back again) to respond to a request for work from any workload or application. It is critical that IRT be calculated in an application independent manner (the calculation needs to work for every application), but also in an application aware manner (the calculation needs to done on a per application and even a per application in a given VM basis).

It is also important that this IRT metric be calculated on as close to a real time and continuous basis as possible. This is critical since a dynamic and shared environment can have transient performance issues that will be missed by 5 minute and 15 minute polling intervals, and that if they accumulate over time rise to the level of being a business issue that can impede the further progress of virtualization in the enterprise.

There are several vendors that have come to the market with solutions that provide a perspective on IRT. Existing vendors that provide these solutions are:

  • Akorri. Akorri is the pioneer in this field and was the first vendor to actually calculate an Infrastructure Response Time Metric for an entire virtual environment. Akorri combines deep and broad data collection capabilities with robust analytics to be able to calculate an IRT from the VM to the spindle on the array and back again.
  • Virtual Instruments. Virtual Instruments pioneered an IRT solution that is focused upon the SAN. VI is the only vendor that has a fibre channel tap that can be inserted inline into the fiber channel fabric so that VI sees all of the SAN data all of the time (in the same manner that an appliance that sits on a mirror port of an IP switch sees all of the data flowing through that switch.
  • NetQos (now a division of CA) pioneered using the combination of Netflow data from switches with appliances on mirror ports on IP switches to provide a robust understanding of network performance. Last fall NetQos announced a virtual appliance version of its SuperAgent (the physical appliance that sits on physical mirror ports). NetQos now has the ability to see response times for the physical and virtual infrastructure across the physical LAN’s and WAN’s, and most importantly the interactions that occur between VM’s within a single host.

A new vendor, Xangati has now entered this fray with an approach based upon physical and virtual appliances that collect Netflow data from all tiers of the physical (LAN and WAN) and virtual (vSwitches) networks. Xangati previously only offered a physical appliance that was able to provide substantial information about how the performance of the physical LAN and WAN was impacting the performance of the virtualized systems. Xangati has recently announced that its system is now available in virtual appliances, which means that if one of those appliances is deployed on each host, interactions on the virtual networks within hosts are now visible as well.

A diagram showing how the new release of the Xangati solution is deployed is below.

Xangati Deployment Diagram

A table comparing these four solutions is presented below.

Akorri CA | NetQos Virtual Instruments Xangati
Data Collection Methods vCenter API’s, direct instrumentation of SAN’s and storage arrays vCenter API’s, Netflow data from physical switches, application performance data from virtual and physical switches via mirror ports vCenter API’s, FC switch polling through SNMP, proprietary taps into SAN fabric vCenter API’s, external directories like DNS and Active Directory, NetFlow data from physical and virtual switches/routers
Breadth and Depth of Infrastructure Response Time Data Collected Infrastructure Response Time is collected End-to-End (from Guest to Spindle on Storage Array) Infrastructure Response Time is collected for each application identified via port and protocol from the guest through the entire IP network (LAN, WAN, and IP Storage). Measures the Response Times of individual Fiber Channel Frames, and maps this to LUNs Infrastructure Response Time is collected through IP SLA technology and Xangati’s Remote Object Viewer
Storage Performance Visibility Has specific instrumentation to storage arrays. Captures IOPS and storage latency to physical spindles. Maps guests and workloads to spindles Only for IP attached storage devices using ISCSI Taps the SAN data directly for latency and load information for all Fiber Channel traffic to the LUN Only for IP attached storage devices using ISCSI, NFS, or CIFS
LAN and WAN Performance Visibility No visibility to the LAN and the WAN Deep visibility into all IP traffic (LAN and WAN) No visibility into the LAN and the WAN Deep visibility into all IP traffic (LAN and WAN)
Server Performance Visibility Direct calculation of IRT impacts on a per guest and host basis Sees server impacts from the perspective of the network Relies upon vCenter data to infer server level performance issues from resource utilization data Sees server impacts from the perspective of the network
Visibility to Performance Issues between Guests on one Host No Virtual appliance on the mirror port of the vSwitch sees interactions between guests on one host No Virtual appliance on the mirror port of the vSwitch sees interactions between guests on one host
Level of Application Identification Pulls process list from guests via WMI. Able to identify certain key applications and workloads Identifies applications based upon ports and protocols No ability tie applications to slowdowns in infrstructure Identifies applications based upon ports and protocols and key servers
Data Collection Interval Polls the entire virtual infrastructure ever 15 Minutes Real-Time Real-Time Real-Time with too-the-second data presentation
Built in Analytics Automatically calculates a Performance Index which compares IRT against capacity utilization Automatic baselines, thresholds, Top-N reporting. Optional investigations and notifications when performance degrades. Integrated reporting with full IT Management Suite. Custom baselines, thresholds, alerts and correlation reporting done for every customer at installation time Automatically learns behavior profile of every VM (up to 5K VMs) and application in virtual ecosystem
Deployment Model Deployed as one sub-net attached virtual appliance in the VMware Resource Pool. Deployed as one virtual appliance on the vSwitch in each VMware host, physical appliances on the physical mirror ports on the LAN switches and one management appliance Deployed as a physical TAP on the Fiber Channel SAN, plus standards-based interfaces to switches and vCenter Deployed as a Virtual Appliance in each VMware host, and a separate virtual appliance for management dashboard.


Infrastructure Performance Management is the single most important performance and capacity management issue that owners of a virtual environment need to address. The reason for this is that since the low hanging fruit has been virtualized, what is left is business critical and performance critical applications in the hands of applications owners and their business constituents. In order to convince these groups that the virtual infrastructure is performing acceptably in support of these important applications Operations groups in charge of virtual environments need to move beyond trying to infer infrastructure performance from resource utilization patterns. This does not work and will not be received by applications owners and their business constituents as a credible approach. The solutions profiled in this article take important steps towards addressing these issues and should be evaluated as a part of putting any performance critical or business critical application on a virtualization platform like VMware vSphere.