A Perfect Storm in Availability and Performance Monitoring

Monitoring computing infrastructure and applications for capacity, availability, and performance is a business that has been around for a long time – in fact for just about as long as computers have been used for business critical applications (since the mainframe lead era of the 1960’s). Since that time several waves of change have swept through the computer industry, and with each wave of change has come new computing architectures, new applications, requirements for monitoring and new monitoring approaches. Those waves have included mini-computers, personal computers, LAN based file sharing, client/server based computing, Internet (browser) based computing, N-tier SOA based applications, and now include agile development, virtualization, cloud based computing, and the proliferation of mobile based applications.

It is the combination of agile development (which causes new code to go into production on very frequent intervals), virtualization (which commingles previously dedicated systems into a shared infrastructure), cloud computing (which creates commercial separation between the infrastructure and the application), and mobile based applications (which are driving yet another end user fueled set of requirements – “I want it on this device now”) that are creating a perfect storm for the monitoring industry.

For these reasons it is time for most enterprises to engage in a serious re-evaluation of their monitoring strategies. The first task should be to perform an inventory of the currently licensed monitoring solutions to find out how many you have, and what you are paying per year in licensing, subscription and maintenance for these products. Many enterprises will find that they have over 200 different monitoring solutions each of which was sold to a department or group to solve a specific problem, and within which there is massive overlap in functionality.

The evaluation of how to move forward with the removal of old legacy and overlapping monitoring solutions and their replacement with solutions that can prosper in the perfect storm described above should be based upon the principles listed below.

Systems Management Frameworks Cannot Move Forward into this new World

Almost every enterprise has a legacy systems management framework from IBM, CA, HP or BMC. Notice that none of the companies that started with highly scaled out cloud and virtualized data centers (Amazon, Facebook, Google, Netflix, and all of the cloud based service providers) use these products. They are too heavyweight, too difficult to configure, too hard to maintain, require to many people too customize and operate, and are too expensive to be brought forward into your new virtualized data center or your private cloud (AKA your IT as a Service initiative). It will likely be too hard to remove these products from your existing physical environment. But they should not be brought forward into your virtualized data centers and the IT as a Service initiatives built on top of them. As servers get virtualized, these solutions should be un-installed as the physical server becomes a VM. Of the four major vendors in this space only CA understands how new these requirements are, and only CA is on an aggressive path to refresh its technology portfolio through acquisition.

A “Top-Down” Purchase of a Monitoring Suite Should be Avoided

If  you are an enterprise with 200 different monitoring solutions this is how you got here. Someone high up in your organization did an ELA with a major monitoring vendor who promised that they could meet all of your needs. You probably found that this product could monitor your networks and your servers, but that as teams with special requirements tried the products in the portfolio they were found lacking. So the point products included in the “big buy” failed, and your teams went out and bought point products that met their needs. Do not repeat this mistake. No single vendor is going to be able to monitor your environment from the storage array to the end user – so do not believe them if they claim to be able to do so.

A Franken-Monitor is to be Avoided At all Costs

If a vendor claims to have a monitoring suite that looks on paper like it can meet all of your needs, then it is highly likely that a lot of the products on that sheet of paper have been acquired over the last few years. When these products were acquired they came with their own databases, data models, analytics, and consoles. It takes years to integrate these products into cohesive solutions. For most vendors with broad product lines, this integration task never gets completed. Band-aids are introduced that tactically integrate things so that it looks good in a demo. In fact what results from this exercise is a Franken-Monitor – something that looks and works like it was assembled after the fact – instead of a true solution designed from the ground up to solve a problem.

A New Monitoring Product for Every New Thing that is Invented is a Dead Model

It is one thing to buy new monitoring products because virtualization and ITaaS introduce new requirements that legacy solutions cannot meet. However, when doing do you should avoid falling into the trap again that you are trying to get out of over in the physical world. Specifically you should avoid the situation where as new vendors introduce new elements that need to be monitored – that you have to go buy a new monitoring product to monitor them. You should organize your environment into horizontal layers, and then get a product from a vendor who has a proven track record and strategy for staying on top of that layer. For example Zenoss does a great job with the network and server layer of the infrastructure – and has open-sourced the data acquisition part of their product. So if you buy something new and want to feed data into Zenoss from it, you do not have to wait for Zenoss to do a product release to support it. NetApp/Akorri BalancePoint can provide infrastructure response time metrics across virtual and physical infrastructures from the server that house the HBA’s to the spindles in the arrays. Virtual Instruments can monitor the round trip response times of transactions flowing through the SAN switch – irrespective of what servers and storage are on either side of the SAN. Xangati can monitor your virtual and physical IP network from end-to-end. New Relic supports Java, .NET, PHP, and Ruby at the applications layer, and dynaTrace supports Java, .NET and even applications written in C++ – proving that APM solutions are expanding in breadth while retaining the ability to do deep diagnostics. Coradiant supports every web based applications automatically – across physical, virtual, and cloud environments. BlueStripe is distinguished in that they are the only vendor that can provide end-to-end and hop-by-hop response time information for every TCP/IP application in your environment without any manual configuration (everything is automatically discovered).

Applications Monitoring is becoming more Valuable

Infrastructure monitoring (see below) is and will remain important. However since the business functionality is delivered by the applications, ensuring acceptable transaction performance and end user experience is becoming extremely important. This is especially true in dynamic, virtualized, and cloud based environments where the application is much more abstracted from the infrastructure than in the previous dedicated physical deployment. It is also clear that the old practice of trying to infer the performance of an application by looking at resource utilization metrics no longer works in this new world – putting increasing emphasis and value on approaches by vendors like dynaTrace, AppDynamics, and New Relic that actually understand what is happening with the application itself.

Infrastructure Monitoring is becoming less Valuable, but no less Important

It is not that monitoring a virtual or cloud based infrastructure is of no value. It is just that vendors like Quest (Vizioncore vFoglight), and Veeam (Monitor) have done a great job of providing basic VMware infrastructure monitoring with products that are so much easier to use, implement and afford than prior legacy approaches that the market price for these types of solutions has been reset by the success of these vendors. The presence of open source vendors like Zenoss and vendors who focus on easy to purchase software at affordable price points like Solarwinds will continue to accelerate this trend. CA bought Nimsoft specifically to address this opportunity, and is again the only one of the big four to have taken an appropriate product action to address these trends.

Virtualization and the Cloud Demand an Understanding of End-to-End Infrastructure Latency

The number one thing that the manager of a virtual infrastructure needs is an understanding of how long it is taking his infrastructure to respond to requests for work from workloads (applications). The dynamic and shared nature of these modern infrastructures make it impossible to infer the performance of the infrastructure from resource utilization statistics. A new category of vendors have emerged that focus upon measuring infrastructure latency (also called infrastructure response time) that include Akorri (acquired by NetApp), Virtual Instruments, Xangati, and CA (through the acquisition of NetQos). However despite the fact that this category includes four vendors each of whom measure a critical pieces of that end-to-end latency, no vendor has yet emerged can measures it from end-to-end. So there is a substantial amount of work to do – and a great opportunity for someone to step up and solve this problem.

Monitoring Infrastructure at Scale with Frequency and Low Cost is Critical

Imagine an environment with 40,000 servers supporting 500,000 virtual machines, which is itself supported by a set of physical switches and storage arrays. Such a system could very easily have over 1  million elements (VM’s, servers, switch ports, router ports, SAN ports, and storage arrays) to manage. This environment is also highly dynamic meaning that due to the rate of change, monitoring needs to be done much more frequently than once every 1 to 15 minutes as has been the norm in the past. Further more due to the cost pressures in such environments the required solution must be affordable to purchase, easy to implement, and inexpensive to operate. A good example of a solution that is built from the ground up to scale along with the infrastructure is SevOne, which uses a coordinated grid of management appliances that linearly increases the compute and storage power of the management solution as the managed environment expands in size.

Monitoring is Becoming a “Big Data” Problem

The environment described above, if monitored every 15 seconds will produce 1 million writes to a database every 15 seconds, or 66,666 writes per second. This is outside of what many products can handle in a single RDBMS, which has lead to sharding of database products in these environments. However sharding carries with it it own issues, which has lead to some innovative companies like Cloudkick and Evident Software building their products on Cassandra – a NOSQL data store that is optimized for high write environments. If it turns out that the scale and the frequency of data collection rises to these kinds of levels, it will force a re-architecture of most of the monitoring solutions on the market today.

Low Cost of Ownership is Becoming of Paramount Importance

Amazon and Google have proven that it is possible to have highly scaled out environments with admin to server ratios once thought unattainable by enterprise IT managers. While Google and Amazon have the good fortune to have many very identical servers, the pressure to increase server to admin ratios will only increase over time. This pressure will manifest itself in the monitoring industry in the form of requirements for products to be easy for admin’s to learn and use so that they can be productively used to manage large number of physical and virtual servers. This is yet another reason why the legacy frameworks will not make the leap in to this new environment.

Try Before You Buy is Driving the Purchase Process

In the vast majority of cases to day, when you buy a monitoring solution you will be able to (and you should insist upon) being able to try the product in production in your environment before committing to a full scale purchase. In most cases you will be able to get a free download that addresses a subset of your environment. In some cases you may need to buy a pilot installation before committing to an ELA. But in no cases should you buy a monitoring solution unless the product has been proven to meet your needs in your environment. In the case of products that are suites that meet the needs of multiple constituents, you should insist upon the right to try each module that is part of the package to determine that it is in fact best of class for the problem it targets. Veeam, Quest, Solarwinds, AppDynamics, BlueStripe, dynaTrace, VKernel, NetApp, New Relic, and VMTurbo all offer some sort of free version or trial version of their products via download.

Early Adopters of Breakthrough Technology Still Demand Vendor Partnering

Some products simply contain breakthrough technology that in some cases solve a problem that most customers do not even know they have yet. These kinds of products are well suited to early adopter customers in sophisticated industries like financial services and high technology. However a free download of products like this is not enough to convince these kinds of customers to trust a startup to solve a hard problem with breakthrough technology. Vendors with solutions like this will have to partner with these customers, and prove that they will be responsive to these customers in their development priorities in order to win this kind of business.

IT as a Service Initiatives will Require Cloud Scale Monitoring Solutions

ITaaS will cause enterprises to have clouds that serve multiple internal constituents, with dynamic and unpredictable workloads across these constituents. This will create a cloud-scale and cloud-complexity monitoring problem for enterprises pursuing ITaaS, and will require monitoring solutions built for this use case. Products that handle large environments, frequent data collection intervals, that self-configure as the environment grows and changes, and that are inherently multi-tenant will be required to address these issues.

The Cloud Breaks Infrastructure Monitoring as We Know It

It is easy for a cloud vendor to buy a product that collects resource utilization statistics about the physical infrastructure that supports their cloud. It is even possible to buy products that show each of the cloud vendors’ customers how much of their allocated virtual resources they are using. But this is not very helpful information – since the question of true resource constraints at the physical level is not being exposed to the cloud customer. Given today’s infrastructure, it is not clear that it is even possible to know how the capacity of a physical server, or switch is being sliced up between the various customers of the cloud running in virtualized environments on this infrastructure. For this reason we need to focus upon infrastructure latency or infrastructure response time as discussed above. Clouds will not be effective platforms for performance critical applications until cloud vendors can provide accurate end-to-end infrastructure latency information on a multi-tenant basis to their customers.

The Infrastructure will have to become Self-Instrumenting

To address the issue raised above, much of the physical infrastructure (routers, networks, and servers) will have to be reinvented to become self-instrumenting and to provide the end to end latency on a multi-tenant basis as discussed above. A switch will have to know to which customer a stream of packets belongs, and be able to report back how long it took to process that stream on behalf of the customer. The same is true of storage arrays, SAN’s, routers and servers.

The Relationship between Infrastructure Monitoring and Application Monitoring has to be Fixed to allow Cloud Adoption

In a physical environment, an APM solution will often be able to collect resource utilization metrics from the physical infrastructure and use time-based correlation to point out which resource utilization anomalies might be the cause of an applications response time issue. This is not possible in public clouds today, since the underlying physical resource utilization metrics are not available to APM products running in public clouds. To fix this multi-tenant end-to-end infrastructure latency (as discussed above) will have to be combined with APM products running in cloud hosted applications in order to allow monitoring solutions to actually provide performance assurance.

The Experience of the End User will become of Paramount Importance

Both virtualization and the cloud make it exponentially more difficult to infer either applications performance or end user experience from granular metrics collected in the infrastructure. Therefore in order for virtualization and public clouds to succeed the solutions that monitor them will have to evolve to be able to directly measure end user experience for every application running in these dynamic environments. Mobile applications that are not browser based, and that do not use HTTP  and that run on devices that are “all about the experience” will exacerbate this trend. These requirements will likely usher in an entirely new set of monitoring vendors who are focused upon this problem.

Monitoring as a Service will be a Force to be Reckoned With

The rate of change in dynamic infrastructures and the rate of change in applications (due to Agile Development techniques) means that monitoring vendors will have to become much more agile in the development and release of new functionality than they have been in the past. This tips the debate towards Monitoring as a Service (Maas), where just the data collection agents live at the customer site and there is only one back end for every customer of the MaaS vendor – one managed by the MaaS vendor. This avoids all of the complexity of releasing and shipping software, and supporting multiple versions on a backwardly compatible basis. Leading edge MaaS vendors include New Relic (by far the market leader), AppDynamics, AppFirst, and LogicMonitor. Cloudkick also fell into this category and has been acquired by Rackspace.

Horizontally Layered and Integrated Products are the Future

How to tackle this problem you say? Start by identifying each layer of your environment (storage, SAN, LAN, WAN, physical server, virtualization platform, ITaaS platform, OS, middlware and application) and then pick a solution that can scale out and address one or more layers. Pick these solutions from the perspective of their ability to address ALL of at least one layer, and their ability to integrate with adjacent layers. By all means move away from managing by silo where one product is used for one application and all of its supporting virtual and physical infrastructure. Note that this is going to be hard as we are at the dawn of a new age, and many products have a long way to go to catch up. CA Virtual Assurance, SevOne, Zenoss, Xangati, Coradiant, New Relic, AppDynamics, and BlueStripe all do a good job of monitoring their respective layers.

Self-Learning Analytics are Required to be able to Keep up with Complexity and Rate of Change

You are going to have more than one monitoring product in your environment. If you start over when you start to buy products for your virtualized or ITaaS environment, you will be lucky if you end up with fewer than 20 (much better than 200). However no matter how well you pick, these products will not be integrated in a manner that allows for deterministic root cause analysis through all of the layers (an application is slow, now where in the storage array is that problem?). To solve this problem you need a real-time self-learning performance analysis layer on top of everything else you have. VMware bought Integrien for this reason – and the independent market leader in this space is Netuitive.

Service Assurance is the End Game

The ultimate goal of availability and performance monitoring is to ensure that the applications running on the environment deliver an adequate level of performance (response time) to the end users of the application. In the physical world, there was little that could be done to adjust the infrastructure in real time to ensure adequate performance, so the infrastructure was massively over-provisioned to keep capacity from being an issue. In a virtual environment, it is both possible and highly desirable to achieve higher levels of resource utilization than was was the norm in the physical world, but not at the expense of applications performance. Service Assurance is about measuring applications performance (response time) and then automatically adjusting the resources for the VM’s that comprise the most important applications and their placement to guarantee those response times. There is no single solution on the market that meets these needs today. There are products that measure response time (AppDynamics, dynaTrace, New Relic, Coradiant, BlueStripe), and there are products that can adjust the composition of the infrastructure (NetApp/Akorri BalancePoint, VMTurbo, Platform Computing, DynamicOps, and Embotics) but the right combination has not been made yet.


The right approach to monitoring a virtual or cloud based environment is to start with a clean sheet of paper, determine your requirements, and assemble a horizontally layered solution out of best of class vendor solutions that address each layer. Vendors should be evaluated on their mastery of one or more layers, their ability to keep up with the change in that layer, and their ability to integrate with adjacent layers.