On May 28th 2008, VMware announced that they were buying Applications Performance Management Vendor B-hive. The fact that VMware made this acquisition over a year ago, and is only now getting ready to ship the resulting product (AppSpeed) in concert with the vSphere rollout is in and of itself significant. The key message for vSphere is that it has the performance and scalability to virtualize “every” application, including business critical high transaction rate applications. The issue for enterprises with these applications is that the teams responsible for these applications are highly reluctant to insert another variable into their performance without a mechanism to assure the performance of these applications within the virtual environment. Assuring the performance of these types of applications in a virtual environment is a unique, demanding and largely unsolved problem. This problem must be solved in order for the growth of virtualization to extend beyond “low hanging fruit” and to include these business critical applications.
The Issue with Traditional Application Performance Management Tools
Applications Performance Management (APM) is a large (over $4B in size) mature market (with over 100 vendors). Vendors have been providing solutions to applications performance management problems for many years – with those solutions focused upon a variety of platforms and applications architectures. However, virtualization significantly changes the landscape for APM in the following respects:
- Traditional APM products focus heavily upon the resource utilization profile of an application running on a physical server. The normal amount of CPU, Memory, Network Activity and Disk Activity for the application is learned or configured, and the system alerts when variances occur. Virtualization pools resources across servers (a resource pool). Guests are allocated “shares” of resources like CPU and Memory across the resource pool. It is easy to find out how much memory and CPU a Guest is using at a point in time. It is not so easy to find out which applications in the Guest are using these resources. Nor is it necessarily meaningful if a guest is using more virtual CPU or memory than normal as long as the total available resources in the pool are not under pressure. So the idea of large pools of virtual CPU and memory resources makes measuring applications performance by looking at how applications use these resources less meaningful than was the case on physical servers.
- Even once you get the application resource utilization statistics from with a guest (say via WMI for a Windows Guest), those numbers are themselves not meaningful. The reason is that any number is that a rate over time (like CPU utilization) is “time shifted” due to the fact that the hypervisor is scheduling multiple operating systems, and that the operating system itself does not know that it is being scheduled out. This causes the clock in the guest OS to drift, which in turn makes measurements that rely upon this clock incorrect.
- The combination of #1 and #2 above means that unlike in a physical environment, in a virtual environment one cannot assume that true applications performance (response time) is the reciprocal of resource utilization. The nature of the virtual environment therefore invalidates the incumbent method of doing APM. Resource based metrics are no longer the right way to look at applications performance (due to resource pooling), and that even if this was the right way to look at the problem, many of the application specific numbers are warped by the virtualization process and cannot be relied upon.
- A multi-tier application environment with scaled out redundant presentation (web servers) and business logic (application server) tiers running within a virtual environment is a highly dynamic system. The static mapping of operating systems and applications to specific hardware no longer exists in the virtual environment. Furthermore due to DRS, FT, SVMotion, VMotion, and even actions taken by a product like AppSpeed, the location of applications components will be constantly changing. Therefore the “map” of an application system needs to be constantly discovered and updated in order to find out which transactions are taking which path through a constantly changing allocation of guests to hosts.
- A shared virtual environment is by its very nature more collapsed, denser, and more centralized than its physical predecessor (this is why virtualization has a compelling hard dollar ROI derived from just server consolidation). With 50 web servers each running on their own physical hardware, a process on one web server can at the most impact the other processes running on that same web server. In a shared virtual environment, a problem can impact all of the guests on that host, and ultimately the entire resource pool if that problem causes a cascading set of guest relocations.
- The concentration of more work into fewer physical servers also increases the chance that shared resources like a SAN port or a spindle on a disk array will be impacted, and that this impact will manifest itself in a performance problem for applications and end users.
- Baselines which were previously calculated against a known and fixed quantity of physical resource are no longer meaningful in a virtual environment. This problem is also impacted by time shifting issue discussed above.
- Finally, with products like VMware DRS taking dynamic actions with guests based upon VMware’s perceptions of CPU and Memory resource utilization, variability is inserted into the performance of the application system that has an unpredictable impact upon applications performance. Since DRS is completely unaware of the actual performance (response time) that the application is delivering to end users, DRS is making changes to how the application is running with no understanding of how those changes will impact actual applications performance.
Requirements for a New Approach
Since the owners of business critical applications will not let these applications get virtualized unless their performance (response time) can be guaranteed in the virtual environment, and since traditional APM tools cannot provide this functionality in a virtual environment a new approach to managing applications performance in a virtual environment is needed. This new approach should meet the following criteria:
- While an understanding of resource utilization from the perspective of the hypervisor (the vCenter API data) is necessary to find host pool resource conflicts, resource based data is insufficient to address the need. Specifically, it is not possible in the virtual environment to infer that changes in how guests and applications within guests use resources impact actual applications performance. Therefore, a set of response time based metrics are necessary in order to accurately characterize the performance of applications running in a virtual environment.
- In fact three different classes of response time metrics are needed. The first group of metrics needs to accurately measure Infrastructure Response Time (IRT) – the responsiveness of the infrastructure to requests made upon the infrastructure by applications. The second group of metrics needs to measure Applications Response Time (ART) – a rolled up response time number for every application hosted in the virtual environment. Finally, the last group of metrics needs to measure Transaction Response Time (TRT) – the response times for transactions within each application.
- Applications owners only care about the performance of their application (which may be built to a specific set of application infrastructure like .Net, J2EE, SQL Server or Oracle). Applications owners need an ART number for their application and a TRT that provides a “deep dive” into the transactions that comprise the application. The IT Operations staff needs an IRT number for each resource pool and for each application, and an ART number for every application hosted in the virtual environment irrespective of how that application is built. There is therefore a clear divergence in requirements between the applications owner and the owner of the virtual environment.
- These response time metrics must be collected passively, and based upon real interactions in the virtual environment. Synthetic transactions lack the granularity and flexibility needed to automatically discover new atomic transactions when an application changes, and cannot be programmed to cover the waterfront of possible combinations and permutations of user actions within applications.
- This new approach must flexibly and continuously adapt to the constant changes in the virtual environment. This means constantly discovering and rediscovering where the components of each application system are running and how they are interacting with each other. It also means that the product should not require any manual configuration to adapt to changes in the infrastructure, the addition of new applications, or changes to existing applications.
- Data (especially the response time data) must be collected in a manner that is accurate in a virtual environment. Today this largely means that response time data about applications and transactions needs to be collected from outside of the guests that house the applications themselves. This is due to the fact that it is difficult (although not impossible) to collect accurate response time data down to the millisecond level of granularity with an agent running inside of the guest OS. This is especially true if the transaction rate is high enough so that response time measurements will be taken very frequently.
- The method by which the response time data is collected should be carefully designed so as to not to create the kinds of problems that monitoring is designed to find and address in the first place. Agent’s within guests can be a problem for this reason. Collecting data from a virtual mirror port as do AppSpeed and Reflex Systems is a way to get a read-only copy of the data without the monitoring process being in the execution path of the actual application transactions. The VMSafe and the vStorage API’s hold great promise as a source of data for monitoring vendors but only if VMware finds a way to certify all of the required third party drivers, so as to avoid a finger pointing mess when things go wrong.
- Sampling of data needs to be replaced with methods that deterministically capture every transaction of interest and quickly throw away the ones that are not of interest (to avoid creating a data warehouse of normal data). 5 minute or even 15 second samples of data will not be frequent enough to catch interactions in this dense and dynamic environment.
- If the product uses statistically derived baselines to characterize normal and abnormal behavior, then these baselines need to be aware of the dynamic nature of the virtual environment. Unlike a physical server which has a static set of available resources, a guest in a virtual environment has a share of a resource pool, and will affect both the resource pool that it is leaving and the resource pool where it is placed upon a VMotion. Baselines therefore need to be normalized for the effects of dynamic actions within the virtual environment.
When AppSpeed ships later this summer, VMware will deliver an Applications Performance Management solution that is very different from anything that is offered by competing hypervisor vendors, or by third party vendors that are focused upon the virtualization performance management market. AppSpeed will bring the following unique features and benefits to the table:
- AppSpeed is implemented as a virtual appliance that is installed in each VMware host, and that collects data from a virtual mirror port (a promiscuous port) on the virtual switch in the host. When it ships AppSpeed will be one of two performance management products (Reflex Systems VMC – historically a security solution that is being augmented with performance data is the other one) that collects data in this manner. AppSpeed will also collect the resource utilization data from the vCenter API’s. This means that AppSpeed is collecting its data in a manner that ensures the accuracy of the data in a virtual environment.
- AppSpeed will automatically discover the applications running within the virtual environment and will automatically and continuously map how the components of each application are talking to each other.
- AppSpeed will automatically discover the atomic transactions (the individual request/responses in an HTTP application for example), calculate their round trip response time, calculate their per-hop response times through the layers of the applications system, and map those transactions to the SQL statements and tables in the corresponding back end databases.
- Atomic transactions can be combined into business transactions of interest. So the 200 HTTP request/responses that comprise processing an order can be combined into one transaction of interest. The response time in total and on a per hop basis can then be calculated for this business transaction (a feature that will be greatly treasured by applications owners).
- Baselines are automatically calculated for this response time data, allowing threshold to be set as a function of variation from a baseline instead of off of hard values that do not change by time of day or day of week. This will dramatically cut down on false alarms.
Issues with AppSpeed
- It would be wonderful if AppSpeed worked for every application in the virtual environment but it does not. AppSpeed is best suited to applications that are based upon web servers, .Net or J2EE middle tier applications servers, and Microsoft or Oracle database servers. The reason for this is that AppSpeed “understands” the specific applications level protocols used by these layers of applications infrastructure, and uses that understanding to do the mapping and response time calculations. If an application is not built to any of the specific middleware layers supported by AppSpeed, then AppSpeed will be of minimal value to that application.
- AppSpeed is a VMware VSphere specific solution. There are three issues with this; 1) you have to upgrade to VSphere to have an opportunity to buy AppSpeed, 2) AppSpeed is unaware of physical infrastructure that may support an application and may create problems for its performance (AppSpeed can map the databases that support the application as long as the physical database server is being directly accessed by a .Net or J2EE layer in the virtual environment), and 3) AppSpeed is of no use if you have more than one virtualization platform and want to measure response times for an application system that spans them.
- AppSpeed is not an IRT solution, nor does it provide a rolled up ART for every application. Therefore it does not provide the vCenter Administrator the numbers that the vCenter Administrator needs the most – specifically an IRT for every resource pool and application, and an ART for every application running in the virtual environment.
- AppSpeed will be packaged as a plug-in to vCenter, and will be sold by the VMware sales force to the owners and support staff of VMware vSphere as well as the VMware virtual environment in a bid to get upgrades to VMware vSphere. This will expose a great deal of “inside of the application” data to vCenter administrators who will not know what it means (and who really do not care about which transactions inside of an application are slow). It will also require that applications owners (who do care about transaction level granularity) will need to be given access to a vCenter console to get their AppSpeed data about their applications.
- AppSpeed is at least initially an island. Its data is not accessible to third parties via the vCenter API’s, nor is it able to forward summary data or alerts to management frameworks like Microsoft System Center and HP Operations Manager.
- The VMware sales force is going to be challenged as they try to sell an applications specific performance management solution to an audience (IT Operations) that really wants a broad scale infrastructure performance management solution. If the sales start to involve the applications support teams, the VMware sales force will then have to learn how to properly position AppSpeed against focused transaction performance management tools like CA/Wily Instroscope, Dynatrace, and Optier.
Challenges for Competing Virtualization Performance Management Vendors
The delivery of AppSpeed will create a set of significant technical and business challenges for existing virtualization performance management vendors. Depending exactly upon who the vendor is, those challenges are:
- The VMware sales force (along with VMware channel) will now be actively marketing a performance management solution. This will cause a significant amount of confusion, as there is no way that the VMware sales force will understand the nuances of applications performance management for different kinds of applications, vs. a general infrastructure performance management solution. This will in turn cause each third party vendor to have to get very crisp about how they fit or compete with AppSpeed. Vendors like Akorri and Virtual Instruments who have entirely complementary positions (due to their unique value at the SAN/Storage layers of the stack) will have a relatively easy time with proper positioning. Vendors who primarily live off of the vCenter API data, and who have no corresponding response time data face significant product strategy and positioning challenges.
- The VMware sales force and to a certain extent the VMware channel is likely to be much less open to bringing third party solutions into “their” accounts – at least until everyone learns what AppSpeed can do and what it cannot do. Third party performance management vendors will need to significantly hone their marketing and sales strategies in order to be able to execute in this more challenging environment.
- Whatever level of success AppSpeed achieves, it will change the terms of the debate in the market for virtualized infrastructure performance management and applications performance management. The market leader (VMware) is formally embracing response time as the metric by which infrastructure performance, applications performance and transaction performance will be measured and delivered whithin VMware vSphere. Any vendor that can help customers get business critical applications into production within a virtual environment (by providing response time based performance assurance for a set of applications or for the entire virtual environment) will find great success. Vendors that stick with just the vCenter and WMI resource utilization data will find that the marketplace will pass them by.
For a complete review of all of the virtualization performance management products (including AppSpeed), please read the white paper below:
Virtualized Performance and Capacity Management
Please Login to download the TVP: Managing Enterprise Scale Virtualized Systems - Criteria and Vendor Profiles