Real-Time Monitoring: Almost Always a Lie

In Understanding the Value of Unique Management Data, we pointed out that tools that collect unique data about the performance of infrastructure and applications are more likely to be able to provide you the value you want than tools that just rely on commodity data. In this post, we expose the most frequent marketing lie in the management software industry.

Why “Real-Time Monitoring” Is Almost Always a Lie

The Free Dictionary states one of the definitions of real-time as, “Of or relating to computer systems that update information at the same rate as they receive data.” Now, this is a particularly useful definition as it pertains to the the phrase “real-time monitoring.” Let’s take the most common metric of all, CPU utilization from a Windows server, and let’s look under the hood at what is really happening:

  • A Windows server contains its own management agent (which means that when it comes to monitoring server metrics from a Windows server, there is no such thing as “agentless monitoring.”) That agent looks at the clock rate of the processor (say 4GHZ), and then if, over a one-second period, the process is doing 2GHZ of work, the CPU utilization is 50%.
  • Therefore, at the point of the collection of the individual metric, the process is not real-time. It is based on samples of data in one-second intervals. So, any company that claims a product is a “real-time performance monitor” but uses CPU metrics from Windows servers as examples is lying even before we get to how its product works, because the data at its source is not being made available in real-time.
  • Now, the architecture of the particular monitoring product comes into play. Almost every monitoring solution samples the data itself. In the example above, there are few, if any, monitoring solutions that capture every single one-second CPU value for every single core of very single CPU of every single server. To do so would be to produce a flood of worthless management data. Instead, these products all sample this data. The most frequent rate at which modern solutions send up data like this to their management systems is usually every 10 seconds. Many products work at the 20-second or one-minute level. By way of example, VMware vSphere collects its core metrics every 20 seconds, and then it averages fifteen of them to give you a five-minute rollup of fifteen 20-second data points.
  • The process of the monitoring solution also comes into play. If you read the Unique Management Data post, you will recall that a lot of value is added through the way vendors process data to produce useful information. Well, that processing takes time. For example, Splunk (properly configured) can consume and index data as fast as you want to hose that data at Splunk. But it takes a couple of seconds for the most recently arrived data to be available in a query, which means it takes a couple of seconds for it to be available for display on a dashboard.
  • The update frequency for the dashboard of the monitoring solution comes into play. Splunk is probably as close to a real-time system for collecting and displaying management data as you can get, but even with Splunk, you run queries that collect data that is itself subject to delays in arriving every couple of seconds.

What all of this means is that you have to factor in the sampling of the data itself in addition to the collection interval, processing, and display update frequency of the monitoring solution to understand how close to a real-time picture you really are getting. Legacy solutions could easily be showing you data that is an hour or two out of date. The very best modern management solutions that get as close to real time as possible (like Splunk) are still going to be a couple of seconds behind real-time.

How Close to Real-Time Monitoring Should We Get?

This is a complicated question that, by its very nature, is going to lead to a complicated answer. We recommend the following process:

  • There is data that is worth having in real-time and data that is not worth having in real-time. Getting any data stream in real-time will be expensive, if for no other reason than that you are going to have to invest in the software and hardware to capture and store what will likely amount to a lot of data:
    • Resource utilization data (which is NOT performance data) is in general not worth capturing in real-time. Modern processor architectures, and the degree to which processors are shared and abstracted by virtualization, make resource utilization data important for longer term capacity planning but are essentially useless for real-time performance management.
    • Data about the latency of your infrastructure (which IS real performance data) is definitely worth capturing in as close to real-time as is practical. What is important here is to combine the concept of real-time with the concept of comprehensive. What you want is something that misses nothing of importance and when it sees something of importance gets it to you as quickly as possible. Virtual Instruments does this for Fiber Channel attached storage latency, and ExtraHop does this for Network Attached storage latency.
    • Data about the response time (which IS, again, real performance data) of your business-critical applications is also definitely worth capturing in as close to real-time as possible. Again, you want something that sees everything, figures out what is important, and gets it to you with the minimum amount of delay. AppDynamics, AppNeta, Compuware, and New Relic do a great job of this for custom-developed applications. AppEnsure, AppFirst, BlueStripe, Correlsense, and ExtraHop do a great job of this for every application you have in production (both purchased and custom-developed).
  • The more data you collect in near real-time, the more data your monitoring product is going to have to process in near real-time, and the more you will need a big data back end and a scale-out processing architecture like the one that Splunk has. In Replacing Franken-Monitors and Legacy Frameworks with the Splunk Ecosystem, we highlighted how a variety of near real-time data collection sources from different vendors could be combined into one data store with one query mechanism to solve a breadth of monitoring solutions in near real-time that cannot be solved any other way.

Summary

If you are hosting rapidly changing business critical and performance critical applications on a virtualized, distributed, or cloud-based infrastructure, you need to move beyond commodity monitoring solutions that lie to you about important things like valuable data and real-time data collection. True real-time monitoring is essential to understanding the performance of important applications on modern infrastructures, requiring near real-time monitoring of both the application for response time and the infrastructure for latency.

Bernd Harzog (318 Posts)

Bernd Harzog is the Analyst at The Virtualization Practice for Performance and Capacity Management and IT as a Service (Private Cloud). Bernd is also the CEO and founder of APM Experts a company that provides strategic marketing services to vendors in the virtualization performance management, and application performance management markets. Prior to these two companies, Bernd was the CEO of RTO Software, the VP Products at Netuitive, a General Manager at Xcellenet, and Research Director for Systems Software at Gartner Group. Bernd has an MBA in Marketing from the University of Chicago.

Connect with Bernd Harzog:

Tags: , , , , , , , , , ,

5 Responses to Real-Time Monitoring: Almost Always a Lie

  1. December 30, 2013 at 12:32 PM

    At AppNeta we get asked a lot if our data is “real time”, and this is an excellent writeup of why that’s not the right question to ask.

    However, one minor point: our server side monitoring agent (TraceView) can also work with vendor applications (e.g. Oracle, SAP, Atlassian, etc.) as long as they are running on servers you control. Even if they aren’t, though, there’s the option of AppView Web synthetic monitoring for cloud-hosted web apps like Salesforce, Outlook 365, and Google Apps.

  2. GP
    January 7, 2014 at 4:03 PM

    Bernd – this is a great post and thanks for highlighting an increasingly important and critical issue around IT Operations. The ability to perform monitoring and logging in a timely, accurate and cost effective manner is essential for enterprise systems. I wonder if you have looked at some of the new generation architectures put out by netflix and linkedin?

    There is a great post from the LinkedIn team on Apache Kafka and a nice architecture background on a unified logging framework

    Netflix Suro was just released as well

    There is a slew of new products coming out around real-time stream processing including Apache Storm, Samza and Spark etc. These tools coupled with collectors like Apache Flume allow the rapid filtering of important and significant signals (through Interceptors).

Leave a Reply

Your email address will not be published. Required fields are marked *

Please Share

Featured Solutions