Reinventing Infrastructure Performance Monitoring for the Cloud

Virtualization (the underlying foundation of a cloud) introduces a set of challenges to how one should monitor the performance of the infrastructure and the applications. These challenges are:

  1. Unlike physical systems that are mostly dedicated to specific applications, virtual systems are both shared and dynamic. This makes inferring the performance of the infrastructure by looking at resource utilization statistics ineffective. Leading edge Infrastructure Performance Management vendors like AppFirst, Akorri, CA|NetQos, Virtual Instruments and Xangati are taking an Infrastructure Response Time approach to this problem that avoids the issues of relying upon resource metrics, and provides a true picture of how the infrastructure is actually performing.
  2. The dynamic and share nature of virtualized systems impacts Applications Performance Management as well. Looking at how applications are using resources is also no longer as useful as it was in the physical world. Applications move around and therefore APM product need to track them and map them as a part of assessing their performance. Virtualization places a premium on understanding the actual response time of the applications, something that has been difficult to achieve across a broad range of applications. Vendors like AppDynamics, BlueStripe, Coradiant, and New Relic are all focused upon these problems and are making great progress. Putting an application in a cloud places a premium upon having good response time data for that application. These vendors all have excellent solutions in this space but care needs to be taken to match the monitoring solution up with the cloud type and the applications type.

Now along comes the public cloud which is built upon virtualization, but which itself introduces a new set of challenges:

  1. Cloud providers have to assume massive scaled environments (larger than those in even the largest enterprise data centers), and have to attend to cost pressures greater than those present in most enterprises (because the  Cloud vendor has to be less expensive than the enterprise alternative). For these reasons, cloud providers are using a new class of infrastructure monitoring solutions built around open source economics and scale from vendors like Nimsoft (now part of CA), SolarWinds, Nagios, and Zenoss.
  2. While these new infrastrustructure management solutions do a great job of telling the cloud vendor how the entire infrastructure is performing from an availability and resource utilization perspective, they cannot (at least today) provide a customer of the cloud an accurate perspective of how that customer’s slice of the cloud is performing. In other words, if you have an application in a cloud and you think the performance problem you are seeing is the fault of the cloud vendor, the cloud vendor has no way to provide you an accurate assessment of how your slice of his infrastructure is actually performing for you. In order for this problem to be addressed infrastructure performance management for the cloud needs to go in the same direction that it going for virtualization. This means that IPM for the cloud needs to focus upon Infrastructure Response Time, and that the IPM vendors who want to focus upon the cloud need to figure out how to provide IRT in a multi-tenant manner. This would allow cloud vendors to provide a specific IRT number to each customer via an API who could in turn consume that number in an APM solution, and thereby figure out what part of total response time is caused by delays in the infrastructure vs delays in the application.
  3. Applications Performance Management can be very different in the cloud depending upon whose cloud you are using and what type of cloud it is. Let’s assume that your existing APM solution relies upon a physical appliance attached to a mirror port on your switch in your data center. Good luck getting that installed in any cloud. How about a virtual appliance? Well if your cloud is VMware based then you are luck. If you are on Amazon EC2, virtual appliances are not supported. If your APM solution is based upon an OS level agent and you are using a PaaS cloud like VMforce, you are out of luck – you have no access the OS underlying the Java layer in the PaaS cloud. If you have a Java based application then you are in luck. AppDynamics and New Relic both have agents that you inert into your application, and that travel with your application into and out of the clouds (New Relic also supports Ruby on Rails). But if your application does not fall into narrow categories of applications supported by these modern cloud aware applications, then your best choice will likely be to use an IaaS cloud where you have access to the OS, and BlueStripe who can provide you with accurate response time information for any TCP/IP application.

In summary, in an internal IT enviroment, the infrastructure management group can be held responsible for the performance of that infrastructure – and the tools exist from leading edge IPM vendors like AppFirst, Akorri, NetQos|CA, Virtual Instruments, and Xangati to allow infrastructure managers to actually assess performance of their infrastructure based upon real Infrastructure Response Time data. In the cloud the tools presently used by cloud vendors do not provide an IRT number, nor do they at present support mult-tenancy, and therefore cannot provide relevant per customer performance data. The infrastructure manager at the cloud vendor therefore cannot yet make the same kinds of performance guarantees than an internal IT manager can make around an internal implementation of virtualization.

Posted in IT as a Service, SDDC & Hybrid CloudTagged , , , , , , , ,