In “A Perfect Storm in Availability and Performance Monitoring“, we proposed that legacy products from the physical environment should not be brought over into your new virtualized environment and that you should in fact start over with a horizontally layered approach, choosing a scaled out, and highly flexible product that can integrate with products at adjacent layers. In this post we will propose a Reference Architecture which can be used to accomplish this.
First of All Why Start Over?
Why, you ask should I not bring the products that I have been using for years in my physical environment over to my new virtualized environment? Well all of the changes discussed in the Perfect Storm post is one set of reasons. But there are many others:
- You are moving from dedicated silos of applications, operating systems, servers and in some cases even switches, SAN’s and storage arrays, to a set of hardware which is shared among many applications, applications frameworks, and operating systems. This level of sharing creates numerous new opportunities for things to conflict with one another, and is enabled by a virtualization platform – which as the new layer in the stack itself needs to be aggressively monitored.
- This new environment is dynamic. A new server can be created by copying a file. Features of the virtualization platform like vMotion cause workloads to move between hosts – changing the usage patterns of server, network, and storage resources.
- IT as a Service (ITaaS) initiatives create an environment where requests by users are provisioned in a fully automated manner – making workloads a function of the completely unpredictable requirements of your business constituents.
- Many of the “procedures” that you used in past to enhance availability and diminish the likelihood of performance hits are no longer available to you. In particular siloing hardware, and over-provisioning hardware are taken away as options once you virtualize since the ROI from virtualization is dependent upon provisioning more aggressively.
- While looking a resource utilization metrics is important, due to the dynamic nature of the environment, it is impossible to know for sure by looking at these metrics that your applications are performing well in a virtualized environment. Therefore both the understanding of infrastructure performance and applications performance needs to move away using a resource utilization approach, to an approach focused upon latency (for the infrastructure) and response times (for the applications).
- The dynamic nature of the environment, plus the fact that a large number of components are packed into every more dense configurations (like a Cisco UCS), means that a larger number of elements need to be monitored on a more frequent basis than was the case with previous physical architectures. This impacts every aspect of monitoring from watching configuration changes, to understanding application response time.
- The dynamic nature of the infrastructure, plus the likelihood of user driven workloads (ITaaS), means that monitoring solutions need to be self-configuring, and need to automatically discover, and keep up with the changes in their layer of the environment.
- If you throw public clouds into the mix, you then introduce the idea that applications could be spread across organizational boundaries, and that there would be (in the cloud) an organizational separation between the ownership of the infrastructure (the cloud provider) and the application (the customer of the cloud provider).
The above reasons (combined with many others) means that yesterday’s monitoring solutions are not built to handle these new requirements. But as we pointed out in the Perfect Storm article, just buying a new set of monitoring products without rethinking the entire architecture of your monitoring solution stack is a mistake. For this reason we are proposing a new Reference Architecture for virtualization and cloud based monitoring.
The Virtualization and Cloud Monitoring Reference Architecture
The key hypothesis of the Reference Architecture below is that in a typical enterprise each layer of the stack will be large (lots of devices), complex, and multi-vendor. Therefore the most important thing is to take an approach that ensures that you are on top of each layer, rather than taking an approach that starts at the top of the stack and tries to monitor everything that pertains to one application all of the way from the code to the spindle in the array.
The reason for this approach is simple. The average large enterprise has over 1000 business critical applications. If you buy a silo oriented monitoring product for each application that is 1000 monitoring products. Even if you buy products for specific applications approaches, you will send up with to many overlapping monitoring solutions which is exactly the problem you have today in your physical environment, and exactly the problem you want to get away from.
Let’s work our way from the bottom up and discuss solutions that address each layer of this stack. Note that there are solutions that can be used to address multiple layers of the stack – and these should be pursued as long as you are not giving up breadth of device support and scalability of the solution in the process.
The SAN and Storage Layers
It is fair to say that a great deal of the performance issues that arise in virtualized environments have something to do with storage. It is also fair to say that for the most part the teams that support the virtualized environment have little to poor visibility into how storage performance (latency) is impacting the performance of the system and the applications in the environment. In the Perfect Storm post we discussed the need for vendors of the infrastructure components like storage devices and network switches to step up and improve both the instrumentation of their own devices, and the accessibility of this information by third parties. This is a particular problem in the storage arena. NetApp has done an exemplary job of instrumenting its own storage arrays and through the acquisition of Akorri doing something useful with that data. On the other hand, EMC has purposely made getting good storage performance information difficult and expensive to get out of their products in an attempt to force customers to buy EMC specific storage performance management products from EMC.
The SAN is a similar black hole. The SNMP data available for most SAN switches is completely useless when it comes to assessing how long it is taking the SAN to execute storage network transactions, which ports are congested, and giving any visibility into how the SAN attached storage is performing.
If you define infrastructure performance as latency or infrastructure response time, then there are only three products that can manage this layer for you. NetApp (Akorri) BalancePoint measures the latency of requests to the storage arrays from the servers that house the HBA’s to the spindles in the array and back, and provides a map of which guests are accessing which spindles. Virtual Instruments Virtual Wisdom use a TAP in the SAN to see all of the data flowing through the SAN in the same manner that a network analyzer uses a spanned or mirror port on an IP switch to see all of the TCP/IP traffice. Virtual Wisdom is unique in that it is the only product that can calculate the exchange completion time for every transaction that flows through the SAN all of the time, irrespective of which vendors’ products are on either side of the SAN switch. Quest vFloglight Storage also calculates storage latency and its impact upon hosts and guests, but as the newest product of these three only supports a limited set of storage arrays at this time.
The Physical Server, LAN, Switch, Router, and Virtualization Platform Layer
Despite the fact that this layer of the IT infrastructure has more products available that monitor it than any other layer in the it is an extremely challenging task to find the right products at this layer. The reason is that you have to make tradeoffs between products that are specific to the virtualization platform (like Veeam Monitor, Quest vFoglight, Solarwinds Virtualization Manager, VMTurbo, and vKernel) and products like CA Virtual Performance, Zenoss, Xangati, Netuitive, and Quest Foglight) that have support for both the physical infrastructure and the vSphere virtualization platform. If you buy into infrastructure latency as a key criteria then the field narrows down quite a bit to Xangati and CA Virtual Performance. If you have a really large enterprise class network you could take the approach of buying the ultimately scalable network monitoring solution from SevOne, and combining that one of the vSphere specific solutions listed above.
The Application Layer
This is the layer that will determine whether or not you will be able to successfully virtualize business critical applications. Because if you want to virtualize them, you must be able to guarantee their performance to their constituents and end users. An in order to guarantee that performance, you must be able to measure the response times that these applications are delivering to those constituents and end users.
There are several key requirements here. They include the ability to automatically detect new applications as they come up for the first time, and to automatically instantiate monitoring for these applications without any pre-configuration or admin intervention. The product should also automatically discover the topology of the application system and keep that topology map up to date as layers of the application get scaled out, or as VM’s get moved around. Finally the product should be able to calculate hop-by-hop and end-to-end response time for the applications of interest in the environment.
It is in the applications of interest that the trade-offs will need to be made. This boils down to a depth vs. breadth trade-off. Products like Quest Foglight (Java and .NET), dynaTrace (Java, .NET, C++), New Relic (Ruby, Java, .NET, PHP), and dynaTrace provide deep code-level diagnostics, but at the expense of only supporting applications written to the platforms listed with the companies mentioned above. Vendors like BlueStripe, Correlsense, and Optier support a far broader range of applications, but do not provide the kind of in-depth code level diagnostics provided by vendors who instrument the applications server with byte code instrumentation.
Integrating your New Monitoring Stack
One thing you can be pretty sure of. Whatever three products you select, they will probably not be integrated with on another out of the box. Most of these products have easy to use API’s and ad-hoc integration is usually easily done. But the fact of the matter is that integrating these products so that for example a response time problem found in the application layer product is determined to have its root cause the a server or storage product is quite hard to do.
For this reason, enterprises should strongly consider Netuitive. Netuitive has connectors to a wide variety of monitoring solutions (and can quickly add more). Netuitive has self-learning performance analysis engine that continuously correlates data across multiple products with performance degradations, and can even predict degradations far enough out into the future in order for a portion of them to be averted. Netuitive also directly instruments the vSphere API’s, so you can complement Netuitive with a product that collects data about the physical infrastructure that support the VMware environment and some applications level data and you are done.
VMware will also shortly be releasing its first product in this area, rumored to be vCenter Operations Manager – which is rumored to include the Integrien self-learning technology. It remains to be seen whether or not vCOM will simply integrated the data that vCenter already collects, or whether it will be open to data from third parties (as is Microsoft SCOM).
Building a new availability and performance management stack for your virtualized environment should be done by picking best of breed solutions at the storage, SAN, server, network, virtualization platform and applications layers. Data from these layered products should then be integrated through a self-learning performance analysis solution like Netuitive in order to automate the interpretation and root cause analysis process.