Service Assurance – The Key to Virtualizing Tier 1 Applications

This article asks (but unfortunately cannot fully answer) the single most important question facing enterprises that want to get to 80% or 100% virtualized environments. That question is, “How can I guarantee the performance of a business critical, performance critical tier 1 application on my virtual infrastructure”?

Before we address the question of how to do this in a virtual environment, let’s look at how this is done today when such applications are deployed on a purely physical infrastructure:

  • Much of the physical infrastructure that supports each application is unique to that application. This includes the servers that run the application, and often the LUN’s and spindles in the storage array that support that application.
  • The teams that own and support that application have budget control of these unique resources and can add resources (throw hardware at the problem) whenever they feel like a lack of hardware might be impacting performance.
  • The shared physical infrastructure that supports many applications tends to be massively over-provisioned. This includes in particular the LAN and the SAN.
  • There is a clear dividing line between the team that supports the OS and the team that supports the applications. Often these teams do not even report to the same person until one gets very high up in the organization (to the CIO or even above).
  • When there is a performance problem, everyone shows up in a room with the data from the tools that manage their layer (storage, SAN, LAN, WAN server, operating system, middleware, and application) and everyone uses this data to exonerate themselves so that they can leave the meeting. This is called a Blamestorming meeting, and the key metric for each person in the meeting is mean time to innocence.

When this business critical application is placed upon a virtual infrastructure the following things change:

  • Dedicated resources become shared. At the minimum CPU and memory become more shared than before as virtualization enables consolidation of over-provisioned physical servers into more heavily utilized hosts supporting guests.
  • The layer of software that enables this sharing is the virtualization platform (for example VMware vSphere).
  • Application owners naturally assume that any performance problem that occurs after the application is virtualized is the fault of the virtualization platform and the resource sharing it implements.
  • Since IT Operations team owns the virtualization platform, and the virtualization platform is getting blamed for performance issues, the IT Operations team now needs tools that allow it to both know how the infrastructure and the applications are performing, and more importantly ensure (assure) that performance critical applications are always getting the priority that they need to deliver the required level of response times to the users.
  • Applications teams need virtualization and cloud aware APM solutions that allow them to manage the response times of the applications themselves (after the infrastructure has been exonerated).

Meeting the needs described above requires that a new category of Virtualization Management solutions come to the market. This new category borrows capabilities from the Infrastructure Performance Management, Applications Performance Management, and Configuration Management segments of the overall Virtualization Management space. This new category is about assuring (guaranteeing) the performance of business critical applications, which includes both the ability to ensure that the required resources are allocated to each application and doing so with an understanding of the relative priority of the applications involved.

To understand what is required here, lets just take a look at the feature of vSphere that addresses these needs in a preliminary and very limited manner. That feature is Dynamic Resource Scheuduling (DRS). DRS is able to move VM’s from one host to another, but does so without awareness of all of the relevant resources (as DRS is unaware of the storage performance implications of the moves that it makes). Furthermore DRS is completely unaware of the actual performance of the applications it is moving around (performance being defined as Applications Response Time not how much resource the application is using), and the relative importance of different applications. Finally, DRS has no mechanism to ensure that when a move is made that the configuration of the target environment is identical to the configuration of the source environment. To turn this around, an effective Virtualization Service Assurance solution would have to incorporate the following elements:

  1. Be able to either measure directly and/or collect from existing Applications Performance Management solutions the Applications Response Time and the allowed variation (the actual service level) for each application for which service will be assured.
  2. Understand the relative priority of applications. Just because two applications each require an average response time of 1 second with 90% of the response times being between .8 seconds and 1.2 seconds, does not mean that they are equal in priority and value to the business.
  3. Measure the resource utilization and/or the infrastructure response time in the environment in order to determine constraints in the system
  4. Allocate the resources in the system that comprise the performance bottlenecks or constraints to their highest and best use. This is analogous to an auction where the application that is willing to “pay” the most for performance gets the resources that constrain performance over the next most important application. Google Search and the ads that appear on the right is a great example of such an economic allocation system. The vendor of the ad that was willing to pay the most to be associated with the search words is at the top of that list of ads, and the vendor that was willing to pay the next to the most is second, etc.
  5. Ensure that the required level of performance is not impacted by either configuration events or configuration mismatches between an existing location and where the application is potentially being moved to
  6. Finally, from a planning perspective, provide tools that allow infrastructure and applications teams to manage the capacity of the environment both for the current set of hosted applications, and do what if analysis. For example if there are already four applications in a resource pool that require a response time of 1 second +/- .2 seconds 90% of the time, would it be possible to add a fifth that has N users without violating existing performance guarantees and while meeting the requirements of the fifth application?

There are two vendors that are directly focused upon meeting these needs for enterprises. Platform Computing launched Platform ISF back in early 2009. Platform ISF builds upon Platform’s existing and strong position in scheduling resource in high performance computing grids to ensure the required job completion time. Platform ISF uses the same scheduling algorithm to provide for resource guarantees across private clouds, public clouds and physical resources. A new vendor VMTurbo has just launched with an extremely innovative solution that combines a complete picture of resource constraints (memory, CPU, network and storage) with a economic equilibrium model that allows different applications to value these resources differently and therefore ensures that resource go to the application that has the highest value for these resources (therefore the application that is the most important).

However, while Platform and VMTurbo do a much better job than DRS in the critical task of allocating resources to their highest and best use, both solutions currently lack perspectives on the true response times of the applications being managed, and on how configuration changes and mis-matches could negatively impact true applications performance. For these reasons, the set of vendors who could potentially fill in these gaps are included in the table below.

Service Assurance – Candidate Solutions

Vendor Infrastructure
Service Prioritization &
Akorri Yes – collects IRT end-to-end from the guests to the spindles and back again No Yes – collects configuration data from the vCenter API’s, and via detailed instrumentation of storage arrays. Maps guests to spindles. No
AppFirst No – collects resource utilization data only, not IRT Yes – calculates end-to-end and hop-by-hop ART for every TCP/IP application across virtual and physical infrastructures No No
BlueStripe No – collects resource utilization data only, not IRT Yes – calculates end-to-end and hop-by-hop ART for every TCP/IP application across virtual and physical infrastructures No No
Hyper9 Collects NFS Storage I/O Latency No Search based broad and deep range of configuration metrics spanning the virtual environment and its supporting physical infrastructure. No
ManageIQ No – collects resource utilization data only, not IRT No Extremely broad and deep range of configuration metrics spanning the virtual environment and its supporting physical infrastructure. No
NetQos|CA Yes – collects IRT on a per application basis View of applications performance limited to port and protocol No No
No – collects resource utilization data only, not IRT No No Yes – has a sophisticated resource scheduling system which allows resource guarantees to be made to critical applications.
Reflex Systems No – collects resource utilization data only, not IRT No Extremely broad and deep range of configuration metrics spanning the virtual environment and its supporting physical infrastructure. Also able to identify applications via VMSafe Deep Packet Inspection. No
VMTurbo No – collects resource utilization data only, not IRT No direct collection of ART, but can leverage ART collected by existing APM products Gets configuration events from vCenter API’s. Can also take configuration related actions. Yes – understands memory, CPU, network and storage resource constraints. Understands relative applications priority and value, and makes the optimal decision.
VMware AppSpeed No – collects resource utilization data only, not IRT Yes – calculates end-to-end and hop-by-hop ART for HTTP/Java/.Net/Database applications No No
VMware DRS No – collects resource utilization data only, not IRT No No Yes – makes dynamic load balancing decisions based upon memory and CPU utilization metrics. No knowledge of relative applications priority or value.

Enterprise Recommendations

Virtualizing business critical, performance critical, tier 1 applications requires that mechanisms be put in place to allow the performance (response time) of these applications to be assured (guaranteed) to the business constituents of the applications and the users of the applications. This requires an integration of Infrastructure Performance Management, Applications Performance Management, Configuration Management, with a highly intelligent resource application algorithm and guest placement automation. Enterprises seeking to implement this level of management of their virtualized environments will likely need to integrate new service assurance solutions with either existing or new applications performance management solutions and configuration management solutions as no complete out of the box solution exists today. However such custom integration should be well worth the effort for leading edge sophisticated enterprises who seek to drive virtualization to the greatest level of adoption possible.

Posted in IT as a Service, SDDC & Hybrid Cloud, Transformation & Agility