Prior to virtualization, “performance management” was roughly broken into two groups of products:
- Products that focused upon how utilized the infrastructure (hardware and software – everything from the spindle on the storage array to the operating system in the servers) was, is currently and will projected to be. These products came in many forms, from any different vendors, focusing upon many different parts of the infrastructure stack – but they invariably were put to one of two purposes. The first was to manage and plan the capacity of elements of the infrastructure. The second was to infer (or in some cases, assert) that if resource utilization was within either a set of manually set or statistically derived thresholds that the performance that the infrastructure was providing to applications was “normal”. These products were largely sold to the teams within IT who supported one or more of the layers of this infrastructure. Examples abound but include IBM Tivoli, HP Business Availability Center, BMC, CA Unicenter (now renamed CA Spectrum), and products from hundreds of other companies.
- Products that focused upon the performance of certain applications. These products started by capturing the response times of the application system at some layer in the applications architecture, and then usually tried to find response time anomalies and provide information about the causes of these anomalies in the applications themselves. These products were most often sold to the teams that supported these applications in production. Examples include J2EE/.Net applications monitors like CA/Wily Instroscope, Compuware Vantage, Quest Foglight, IBM ITCAM, and HP Diagnostics.
Virtualization has come along and as tier 1 applications have started to get virtualized, has dramatically shaken up this existing categorization of performance monitoring products. Virtualization has created a new set of challenges in how one does performance management for business critical applications once virtualized. By responding to these challenges, vendors have shaken up the performance management landscape resulting in new categorizations of solutions.
Performance Management Challenges Created by Virtualization
When a Tier-1 applications system is taken from a set of dedicated set of physical servers into put into one or more dynamic resource pools, several things change, that ultimately have a great deal of impact upon how the performance of the infrastructure for the application is monitored and managed:
- Resources become much more shared then before. Server virtualization creates shared pools of CPU and memory resources (VMware TPS shares the same piece of physical memory across N virtualized servers). Storage virtualization creates pools of logical storage which are frequently not mapped in any discrete manner to the underlying physical storage.
- Resource sharing becomes pervasive. The hard dollar ROI from virtualization stems largely from driving up server and network utilization through the sharing of these resources. This creates significant performance management challenges, as this sharing is the first thing that an application support team will want to blame when there are performance problems.
- The system becomes dynamic. Due to features like VMotion, DRS, HA, Distributed Power Management and many others, servers will automatically get moved from on host to another based upon a variety of events, rules and manual administrator actions.
- The measurement of time-based metrics from within a virtualized guest becomes impaired. VMware has an excellent White Paper on this subject, but the short answer is that if the guest OS does not have good timekeeping behavior, then time based metrics collected from that guest via its timer (like CPU utilization) will be shifted by the fact that execution time is now scheduled by the hypervisor instead of by the guest OS itself.
Impact of these Challenges Performance Management
The net effect of these challenges upon how one should manage the performance of a virtual as opposed to a physical system are:
- The hypervisor becomes an extremely important measurement point for resource utilization data – both at the level of the physical host and the virtualized guests. VMware has done an excellent job of making this data available via the vCenter API’s. Data collected by the hypervisor and exposed through these API’s is not subject to the timekeeping problems described above and is therefore the basis of accurate resource utilization information – which is incorporated into a wide variety of products from vendors that target resource and availability management for VMware. Some of the more high profile and successful vendors that use this data in their products include up.time Software, Veeam, VizionCore, vKernel, and VMware itself in the Hyperic product set which same with the SpringSource acquisition. VMware is continuing down this path, and vendors are making use of interfaces like the vStorage API’s and the VMSafe API’s to gather ever more granular data about the performance and configuration of the VMware infrastructure.
- One can no longer reliably infer the performance (response time) of either applications running on the virtual infrastructure, or of the infrastructure itself by looking at this resource utilization data. In other words, whereas on a physical system you could use resource utilization as an inverse indicator of applications and infrastructure performance, you can no longer do this once the infrastructure and the applications are virtualized. Even if you get good data from the vCenter API’s, the dynamic nature of the system, along with the degree to which these resources are now shared negatively impacts the link between resource utilization and performance (response time) to a significant degree.
- For these reasons, a new and separate category of performance management solutions has emerged that focus upon Infrastructure Response Time, and not just resource utilization. Since we can no longer infer the performance of an application by measuring its resource utilization on a virtual infrastructure, a new metric is needed to assess the level of service that the infrastructure is providing to the applications. That new metric is Infrastructure Response Time.
- In an important evolution, not necessarily motivated by virtualization itself, but motivated by a desire on the part of customers of certain business critical transactional applications to have visibility into the end-to-end flow of individual transactions, vendors like Optier, dynaTrace, CorrellSense and Quest (Foglight) added this capability to their applications management solutions. This created a new category of vendors focused upon Transaction Performance Management, instead of just Applications Performance Management.
The New Performance Management Categories
As shown in the diagram below, the previous two Application Performance Management and Infrastructure Performance Management have been split by virtualization and by innovation in the APM category into four new groups of products.
The Performance Management industry (at least the part of it that is aligned with the virtualization industry) is now organized along the lines of the diagram above in to the following categories.
Resource and Availability Management
This category as about collecting credible and reliable resource utilization metrics from every tier of the infrastructure (disk, SAN, network, CPU, memory) and analyzing this data for the primary purposes of capacity management and capacity planning. Capacity Management is the continuous and real time activity that must be done in dynamic virtualized systems to ensure the capacity bottlenecks are not being created that are in turn impacting performance. Capacity Planning is the process of forecasting resource utilization far enough into the future so that if a shortfall in resources is forecast, new resources can be procured and deployed within the forecast window. There are many vendors that fall into this category, and they include Veeam (Veeam Monitor), VizionCore (vFoglight), VMware (Hyperic), vKernel, and up.time Software.
Infrastructure Performance Management
This is perhaps the most interesting and most relevant of the new categories to the teams managing virtual infrastructure. Since (per the analysis above) one can no longer use resource utilization to infer the performance that the infrastructure is providing to the applications, a new metric to determine infrastructure performance and a new category of product to create that metric is needed. That new metric is Infrastructure Response Time (IRT). IRT consists of the round trip time that it takes the infrastructure to respond to a request for work by any application or service that is requesting this work. IRT is totally applications agnostic meaning it really does not need to know or care very much about what the application is, and how it is constructed. Rather IRT treats all applications (no matter how constructed) the same – merely as an entity that is requesting work to be done. Since IPM is a relatively new category, it has relatively few vendors in it, and the solutions in this category are all quite a bit different from each other. The vendors included in this category are Akorri (BalancePoint), NetQos and Virtual Instruments.
Applications Performance Management
The APM category has seen substantial innovation in the last few years, driven by three primary factors; 1) APM vendors continue to get better an instrumenting a variety of applications at ever deeper levels, 2) They continue to improve their diagnostics capabilities which allow them to determine why an application is failing or why it is slow, and 3) They have modified their products to make them virtualization aware. Virtualization aware APM solutions address the timekeeping problem by using new VMware API’s that provide real and apparent time, and by taking physical HTTP and network performance measurement appliances and packaging them up as virtual appliances that use the virtual mirror port on the vSwitch. The vendors included in this category are BlueStripe (FactFinder) Quest (Foglight), OPNET (Panorama and ACE, and VMware (AppSpeed).
Transaction Performance Management
The TPM category is a relatively recent offshoot of the APM category. The difference between the two is that TPM vendors specialize in tracking individual transactions from end-to-end across an entire multi-tier applications system, whereas APM vendors typically focus just upon the response time between the tiers of the application system without having full visibility into each individual transaction. TPM is important for applications where either each transaction, or a large set of transactions are so important that their performance must be measure individually and incrementally to the performance of the application system as a whole. These vendors have for the most part not built virtualization specific features into their products. This is reflective of the fact that the applications that these vendors monitor will be the last ones to get virtualized, and many of these vendors are just starting to see interest on the part of their customers in embarking upon the virtualization journey for these applications. Vendors in this category include dynaTrace, Optier, Quest, and Correlsense.
The Cross-Platform Question
When choosing any of the categories and tools listed above, an important question to be addressed is the one of platform diversity. There are two dimensions of platform diversity to be considered. The first is whether the tool needs to work across existing physical platforms and the virtualization platform. The second is whether the tool needs to work across multiple virtualization platforms (for example, VMware ESX, VMware vSphere and Microsoft Hyper-V). In general, the APM and TPM tools tend to be very agnostic as to physical and virtual applications platform as they are primarily concerned with the applications, and not the physical or virtual infrastructure. On the other hand, the Resource and Availability Management, and IPM tools tend to heavily leverage VMware interfaces and therefore tend to be very VMware oriented at this time.
Virtualization has been a catalyst for significant changes in the performance management business at all layers of the IT stack (from hardware to transaction). These changes have only begun. As the more and more tier 1 applications get migrated over to a virtual infrastructure, these vendors will advance their functionality, and more vendors will jump into the fray. It is also highly likely that over the next 24 months, the larger traditional vendors (HP, IBM, BMC) will get more active in this space – driven primarily by the fact that CA has now gotten active via its acquisition of NetQos. Enteprises looking to address the virtualization performance management problem should carefully consider whose problem is being solved (the infrastructure team or the applications team), and the composition of the applications and platforms that need to be addressed. In most cases, more than one tool will be needed to fully accomplish the job. Capacity Management, Infrastructure Performance Management and Applications Performance Management are all different enough from each other as problems as to justify a unique tool for each job.