VMware Addresses the TimeKeeping Issue

Performance Management for virtualized systems (a topic covered in great detail in the white paper referenced at the end of this article) is very different than performance management for physical system for the following reasons:

  1. Virtualized systems are based upon putting groups of servers into pools (resource pools) which has the effect of creating shared pools of CPU and memory. This makes measurement of how much of the CPU and or memory that is available on an individual server and/or that is used by an application is therefore much less relevant than was the case on physical systems.
  2. Virtualized systems are highly dynamic, with workloads moving automatically based upon demand and supply between physical servers. This makes discovery of where applications are running and how they are interacting with each other critical to the understanding of what is slow and why.
  3. Due to what is known as the “Time Keeping Problem” in virtual machines, resource utilization metrics like total CPU and CPU per process, as well as other time based metrics collected within a virtual machine are “time shifted” by the degree to which the VM is scheduled out relative to other VM’s on the same server. This makes CPU measurements and time based measurements taken from within a guest unreliable.
  4. Due to #1, #2 and #3 above, measuring performance by understanding how much resource an application is using on a virtual server is no longer nearly as useful as it was when the application was running on a physical server.
  5. Since resource utilization is no longer a useful proxy for applications performance, a new metric is needed. Actually what is needed is new prominence for an old metric – response time.

There has been however a set of challenges associated with making response time the primary metric by which applications performance (and system capacity) is judged. Those challenges are:

  1. Good metrics do not exist to comprehensively measure the response time of the infrastructure to requests that are placed upon it. Akorri does a great job of measuring the responsiveness of the physical arrays that it supports in its product, and cross-correlating that data with resource data from the Virtual Center API’s. Virtual Instruments does a great job of comprehensively capturing the SAN data from the fiber channel switches and mapping that back into guests and resource pools. But a true end to end picture of infrastructure response time has yet to emerge.
  2. A consistently collected applications response time metric does not yet exist for all applications running in the virtual infrastructure. The VMware AppSpeed product is a huge step forward here because it calculates applications response time from the perspective of a virtual appliance that sits on a mirror port on the vSwitch in the host. But AppSpeed does not calculate applications response time for all applications in the virtual environment, only those for which it has decoded the applications level protocols (for example HTTP, SQL Server and Oracle).
  3. There has been no good way to measure applications response time and more importantly transaction response time from an agent running in a guest. Business critical applications require that individual transactions be traced across the dynamic virtual infrastructure, which can only be done with an agent that lives in the guest. However the aforementioned timekeeping issue has prevented these measurements from being taken reliably.

Recently, VMware has updated it seminal document on this subject, Timekeeping in VMware Virtual Machines. Page 9 of this document now contains a new section on Pseudoperformance Counters. VMware has now made available an API call that can be made from user mode in a VM which returns the real time (as opposed to the apparent time). To measure the time that a transaction takes, one simply calls this API before the transaction starts, calls it again after the transaction ends, and subtracts the difference. Since the time numbers provided are the time on the physical host, the resulting transaction time is not shifted by the clock drift issues that plague the system clock in the guest OS.

The availability of this API has the potential to precipitate some fundamental changes in how performance and capacity management are implemented for virtualized systems (at least for virtualized systems that are based upon VMware):

  1. Vendors of agent based transaction performance management products are now in an excellent position to bring their value to VMware. The vendors that are best positioned here are Quest (with the full Foglight product), BlueStripe, dynaTrace and Optier all of whom specialize in measuring transaction performance in the server tiers of applications systems, and Knoa and Aternity who specialize in measuring transaction performance from the perspective of the actual end user’s workstation.
  2. VMware’s historical bias against agents in guest operating systems appears to be mellowing a bit. Clearly by delivering AppSpeed, VMware has said that if possible reponse time  measurements should be taken from outside of the guests if at all possible. However, VMware is also now being very pragmatic and recognizing that measuring response time from outside of the guests on a VMware host lacks end-to-end transaction level granularity. This level of granularity is essential for certain cases (like business critical systems that will not get virtualized unless response time data at this level of granularity exists).
  3. As AppSpeed is rolled out, and as new agent based solutions that take advantage of this new counter in VMware are delivered, a sea change will occur in how the performance and capacity of virtualized systems is assessed. The bottom line is that far less attention (and money) will be spent on taking a resource consumption based view of performance and capacity, and far more attention (and money) will be spent on taking a response time view of performance and capacity.

This advance on the part of VMware, like all others is not a panacea and comes with caveats. Issues may occur when VM’s are migrated from one host to another, and the counter may experience variation under these conditions as well. It is unknown at this time what the overhead of this new counter will be for heavily loaded transaction systems in production – something that will get sorted out as products that use this new counter get delivered and get field tested.

Even with the above caveats, it is clear that VMware has laid the groundwork for a significant change in how the performance of business critical systems is assessed and managed. This is critical to VMware’s ability to virtualize these applications and extend its reach to more than the existing set of “low hanging fruit” that has been virtualized to date. It is also clear that VMware has taken another small but significant step in the process of clearly differentiating the vSphere platform from its alternatives as a host for these business critical applications (as no corollary capability exists or has been announced by other virtualization platform vendors).

Virtualized Performance and Capacity Management

Posted in IT as a Service, SDDC & Hybrid CloudTagged , , , , , ,