What exactly is the point of monitoring your infrastructure and your applications? Hopefully your answer to that question is not to just ensure that your network latency is low, that your servers are up, and that you are not about to run out of memory on a server, or hard disk space on an array. Hopefully your answer is that the end goal of monitoring your environment and your applications is to ensure that the applications that comprise your critical business services (to your internal and external constituents) are performing within the expectations of those constituents.
The Service Assurance Goal
One of the very nice things that VMware did with vSphere was create a virtualization platform that not only allowed workloads to be moved under programmatic control, but also allowed resources to be added or removed from workloads under programmatic control. It has been possible for quite some time to add virtual CPU and virtual memory to a workload while it was running in a vSphere environment. vSphere 4 added storage I/O control and network I/O control to the mix, meaning that it was possible to programatically manage all four of the key resources (CPU, memory, storage I/O rate, and network I/O rate) that the performance of an application depends upon.
The end goal of service assurance then is quite attractive, and quite difficult to attain. The end goal is to be able to say that application “X” needs to be able to run 1000 transactions per minute at an average response time of 500 MS, with 99.9% of those response times being less than 700 MS. The role of automated service assurance in this is to automatically adjust what runs where, and to decrease and increase resources as needed to workloads based upon their priority so that service levels are automatically maintained.
Why is Automated Service Assurance such a Hard Problem
Being able to guarantee that the applications that matter always deliver the response time that the users expect is so obviously valuable on the face of it, that one would think that we would be swimming in a sea of contending solutions to this problem. But this is not the case. It is not the case, because while writing about this is easy, doing it is hard. For the following reasons:
- If the response time for an application degrades and becomes a problem, there are a thousand different reasons why this might be the case. Some of those reasons might be related to resource constraints in the virtual environment. Some might be related to resource constraints in the physical environment (contention on a storage array between a virtualized application and a physical database server). Some might be related on a configuration issue. Some might be related to a code issue in the application.
- It is extremely hard to tie degradation in response time to a particular issue in the infrastructure and know for sure that this one issue is the cause of the problem. Yes you can do this statistically with products like Netuitive and the analytics in vCenter Operations – but doing this statistically well tell you that X is correlated with Y, not necessarily that X is causing Y.
- If you do not want to do this statistically, you can try to do this deterministically. But that means writing rules that say that if X happens, fix Y and there are just too may cases of X and Y leading to the need for too many rules for this approach to be practical.
VMTurbo entered the VMware market with a performance and capacity monitoring solution a couple of years ago. The core technology of the company is unique and takes a completely different approach to managing performance, capacity and resources than any other vendor. The technology is based upon the concept that workloads constitute demands for resources, that the environment constitutes supply of those resources, and that an economic market equilibrium model can be used to balance supply and demand on behalf of workloads based upon their priority. Basically resources get priced based upon their scarcity (just like in the real world), and workloads get budget based upon their priority, and the “market” takes care of the allocations.
In previous versions of the product, VMTurbo has focused upon allocating resources based upon the priority of the application, but has not had any formal awareness of either the response time requirements or the throughput requirements of the application. With this most recent release, this has changed. Now VMTurbo is formally aware of the throughput requirements of Windows applications running in guests on VMware, Microsoft, and Citrix virtualization platforms.
This means that VMTurbo has taken a ground breaking first step in the industry of tying a metric that the owner of an application actually cares about (its throughput) to how the resources of that application are managed relative to all other applications of greater or lesser priority.
The Future of Service Assurance
VMTurbo is further down the road of automated application aware service assurance than any other performance and capacity management vendor in the virtualization and cloud ecosystems. But a lot of work remains to be done. The key pieces are obviously support for more than Windows based applications, and then adding the notion of response time as a key metric to the existing notion of throughput. In “The Rise of Application Operations and the Role for Next Generation APM Solutions“, we discussed the need for a set of next generation APM solutions that would be the core tools for an “Applications Operations” function in the enterprise that was responsible for supporting all of the applications running on next generation dynamic infrastructures. The obvious answer is for response time data from these solutions to be combined into the VMTurbo model so that both response time and throughput can be automatically assured.
VMTurbo has broken new ground by delivering the first application aware automated service assurance solution for VMware vSphere, Microsoft Hyper-V, and Citrix XenServer. This is the first solution that takes advantage of the dynamic nature of these platforms and their control API’s to actually ensure something (throughput) that application owners care about. Preemptively assuring throughput (and hopefully response time in the future) may be a more effective approach than waiting for something to go wrong and then trying to pick the one root cause out of the hundreds or thousands of potential candidates.