Virtualization Performance Management – A Focus Upon the Data

Prior to virtualization, there were a few relatively straightforward ways in which data about the performance of systems and applications was collected. Virtualization has introduced the need for new sources of data, and new methods of collecting that data. The reasons that new types of data and new methods of data collection are necessary in virtualized environments are:

  • Data collected by agents running in VM’s can become corrupted by what is known as the “Timekeeping Problem. VMware maintains an excellent White Paper on this subject which explains the reasons for this issue and its impacts. The bottom line is that time based metrics collected by agents (with agents built into the OS like the one in Windows that provides the Perfmon and WMI data or ones from third party monitoring companies) can get randomly corrupted by the virtualization process. Time based metric include CPU utilization, disk I/O rates, network I/O rates, page fault rates and context switch rates. Prior to virtualization these metrics were used in many products to infer the performance of the infrastructure. Once servers and networks are virtualized, this method if inferring infrastructure performance no longer works.
  • The dynamic nature of the virtual environment requires that additional data be collected, and that performance metrics be kept up to date in close to real time. The most important piece of additional data that is needed is a map of the infrastructure, and a map of each application system. Since VM’s can move around, both the topology of the infrastructure and the topology of applications can change. The rate of change in the infrastructure requires that performance data be collected much more frequently than the 5 minute and 15 minute intervals that were prevalent in the physical world.
  • The shared nature of the virtual environment and the multi-tenant nature of the cloud require a new level of granularity in performance data. It is not enough to be able to say that the performance of an infrastructure is acceptable. It is now necessary to be able to say that the performance that an infrastructure is providing to an application (or in the cloud to a customer) is at a certain level.
  • For the reasons articulated above, inferring the performance of the infrastructure from how utilized the resources are in the infrastructure is no longer viable. Much like Applications Performance Management has focused upon Applications Response Time as THE metric which characterizes the performance of an application, management of the infrastructure now needs to focus upon Infrastructure Response Time as THE metric that characterizes the performance of the infrastructure.

Prior to virtualization the data was gathered using one of these methods:

  • Collecting resource utilization statistics from the computing infrastructure. This included the number of IOPS on the spindles of the arrays, the amount of network traffic for LAN’s and WAN’s, and the latency for these networks, and CPU, memory, disk, page fault, and context switch data from servers. This approach is taken by a wide variety of vendors ranging from the large enterprise systems management vendors like CA, IBM/Tivoli, HP, and BMC, to hundreds of smaller companies.
  • Byte code instrumentation for Java and .Net based applications servers. This involved including what is effectively an agent in the Java Virtual Machine or the .Net CLR that is able to collect detailed information about how transactions are flowing through code including the timing and issues with these transactions. The leaders in this category are products like CA/Wily, IBM ITCAM, HP Diagnostics, Quest Foglight,  Compuware Vantage and Dynatrace. Coradiant licenses the Dynatrace solution and resells it and is also present in this space.
  • Putting an HTTP appliance on a mirror or spanned port of the switch that serves the web servers of an application and collecting detailed HTTP request/response data. Some of the major vendors (like CA, Quest, and Compware) have products in this space, but there is also in independent vendor, Coradiant who has a very strong product and position in this space. This approach is able to get good end user response time data from the perspective of the edge of the application system, as long as the application system is a web based (HTTP) application.
  • True end experience data is collected directly from the end user workstations by vendors like Knoa and Aternity who put agents on the end user PC’s. This provides the only true picture of the actual experience that the end user is getting on their “pane of glass”, but only really applies to situations where the owner of the application can influence getting the agents installed on the end user’s PC’s.

Given that we need new data, and new approaches to get this data in order to understand (and assure) the performance of a virtual or cloud based environment, it is useful to go through the new types of data that is being collected and how it is being collected:

  • VMware took an incredibly important step fairly early in the evolution of its solutions by making most of the data that one would have collected with an operating system level agent, or agentlessly via WMI available via the Virtual Center API’s. This data is collected from the VMware hypervisor, and is primarily resource utilization data made available at 5 minute intervals. Since this data comes from the hypervisor (which owns the hardware clock), and not the operating systems in the VM’s (which no longer own the clock) this data is not time-shifted, and represents a reliable indicator of resource utilization in VMware hosts and guests. However, this data is only about resource utilization, and it is only available in 5 minute intervals and this therefore insufficient to be the basis of either an infrastructure performance management or an application performance management assurance solution.
  • VMware also took an important step towards moving Applications Performance Management forward, by buying B-hive and productizing it as AppSpeed. AppSpeed is a virtual appliance that sits on a virtual mirror (promiscuous) port on the vSwitch in the VMware host. AppSpeed includes protocol decoders for HTTP, J2EE, .Net, and popular database protocols. So whereas in the physical world you might have used a physical appliance on the switch that serves the web servers, and a byte code instrumentation product for the J2EE servers, AppSpeed gets you almost all of the way there with just a virtual appliance. The key difference between AppSpeed and, for example, the combination of a CA/Wily CEM and a CA/Wily Introscope is that with byte code instrumentation you can see transactions flow through the JVM, whereas with APPSpeeds “over the wire” approach you can only see interactions between JVM’s and their adjacent servers, not details occurring with the JVM.
  • VMware also continues to take steps to make API’s available to ISV’s that provide for more granular and more real-time data collected on an “outside-in” basis. Two examples are the vStorage API’s (really designed to let storage vendors surface some of their value-add to vSphere) but which can also be used by monitoring vendors to get IOPS data down to the LUN, and the VMSafe API which is being used by vendors like Reflex Systems to get very granular configuration change data, as well as serve as the source of a deep packet inspection capability that can identify individual applications running in the environment.
  • It is important to note that the steps that VMware is taking to collect this data and make it available via API’s to members of the VMware ecosystem is one of the major differentiating features of vSphere. The level of performance management data, and the products that take advantage of this data is significant in vSphere and is relatively non-existent in the competing virtualization platforms from Microsoft and Citrix.
  • Akorri has taken the unique step of building specific instrumentation into a wide variety of storage arrays. This instrumentation is unique to each array, and often unique to a version of software in the array. While this has been quite time consuming and expensive for Akorri to create, it has resulted in Akorri being the only performance management vendor in the VMware ecosystem to have the ability to map the infrastructure down to the spindle in the array, and to find IOPS based hot spots that negatively impact performance. Akorri is also the first and only vendor who has created an end-to-end infrastructure response time metric from each VM to the associated spindles and back again.
  • Virtual Instruments is the only vendor that can actually measure the latency of every Fiber Channel frame on the SAN on a continuous and deterministic basis. VI does this by creating a physical mirror port (by putting a tap in the fiber network), and then attaching data collection software to those mirror ports. A very high percentage of the performance issues in the storage arrays manifest themselves in the Fiber Channel latency data. It is important to note that this latency data is not available via SNMP or SMIS which are the standard management interfaces to arrays and SAN switches. The only way to get the real SAN performance data is to tap the SAN in the manner that VI is able to do, which puts VI in the position of having a uniquely valuable perspective on the performance of everything that uses the SAN.
  • The NetQos division of CA pioneered some significant advances in the network performance management space by combining Netflow data collected from routers and switches with TCP/IP response time data collected via a physical appliance on the mirror port of the physical switch. That physical appliance is now available as a virtual appliance that can determine applications response time for all of the applications running in the VM’s on a host.
  • As mentioned above, Reflex Systems leverages the VMware VMSafe.Net API (leveraging this API requires that the vendor provide a driver that gets included in vSphere which in turn must get certified by VMware – something that Reflex Systems was the first vendor to achieve). The combination of this driver and the VMSafe.Net API gives Reflex unparalleled visibility into changes in the virtual environment. This is critical for performance management as mis-configuration is the single most common reason for performance issues in the infrastructure and in applications.
  • Xangati entered the market based upon a physical appliance that collected NetFlow data from physical switches and routers. Xangati now has a virtual appliance version of this collector that can collect NetFlow data from virtual switches. This allows Xangati to be the only vendor that can see the network performance data across virtual LAN’s, physical LAN’s and WAN’s. This is a great example of using an “outside-in” approach to collect data about the performance of the virtual infrastructure without having the credibility of the data being impacted by the virtualization process.
  • BlueStripe is the only vendor that can provide an end-to-end and hop-by-hop applications response time metric for all TCP/IP based applications across disparate physical and virtual infrastructures.  BlueStripe does this by putting an agent in each OS (physical or virtual) that watches the network I/O, dynamically and continuously maps the application system, and calculates the ART metrics.
  • New Relic broke new ground in the Applications Performance Management market by offering its solution on a SAAS basis which had two significant benefits to customers. The first is that it removed all of the complexity, time, effort and cost associated with bringing an APM solution into the enterprise and properly configuring it. With New Relic, you simply put their agent into your Java or Ruby based application, deploy the application and then log onto the New Relic hosted web console to see how your application is performing. The second benefit is that since New Relic is SAAS based it is inherently cloud competent. In fact New Relic was the first vendor to offer an APM solution that contemplated parts of applications running inside of the data center, parts running in a public cloud and having the entire APM system work with the attendant firewall issues that characterizes this deployment model. The out-of-the-box value provided by New Relic RPM is of particular value to organizations that use Agile Development to release code into production weekly or monthly, as any APM system that requires constant reconfiguration as the application changes is just too brittle for these agile environments.
  • AppDynamics built a next generation Wily Instroscope – with features specifically designed for modern virtualization, cloud, highly distributed, and Agile environments. The key new piece of data that AppDynamics provides is an automatically created topology map of the application system. Whereas previous J2EE byte code instrumentation products took a JVM centric view of the application, AppDynamics assumed that the application was broken up into hundreds of JVM’s, some of which were running inside the firewall, and some of which were running in a public cloud. AppDynamics is the first J2EE based APM solution that assumes highly scaled out applications (it is built and priced for this scenario), and also includes some very robust cloud orchestration rules that allow the APM solution to actually make decisions about where things run.
  • Zenoss is resource and availability monitoring solution that is designed for large scaled out infrastructures that support any combination of physical and virtual workloads. Zenoss is unique among the resource monitoring vendors in that it automatically builds and maintains a model of the environment. This is again crucial to the ability of a performance management solution to keep up with the dynamic nature of virtual and cloud based environments.


Both Infrastructure Performance Management and Applications Performance Management vendors who are targeting the virtualization and cloud markets have realized that new and unique data is needed in order to performance manage these new environments and the applications that run on them. This is dramatic departure from the old physical world where most vendors simply relied upon the data that were provided via standard OS API’s to infer systems and applications performance.  While great progress has been made, there remains a great deal of work to be done. The ability to identify applications is currently limited in most products to an analysis of port and protocol, which are not unique enough to identify all applications. The deep packet inspection approach used by Reflex shows great promise on this front. Significant progress needs to be made on front of getting more comprehensive data and getting it in more of a real-time fashion. The trick here is to increase the amount and frequency of the data collected without having the data collection process become one of the performance management problems that this entire exercise is designed to detect and prevent. If vendors of competing virtualization platforms are going to be serious about competing with VMware vSphere as the platform for business critical and performance critical application, then they must take a page from VMware’s book and provide API’s into their hypervisors that allow more performance and configuration data to be collected on an “outside-in” basis. Finally since few business critical applications system will be 100% virtualized anytime soon these solutions need to broaden to handle the mixed physical/virtual case.

Posted in IT as a ServiceTagged , ,