In Understanding the Value of Unique Management Data, we explored the difference between unique data and commodity data as it pertains to the value of a monitoring solution. In Real-Time Monitoring: Almost Always a Lie, we explored the difference between real-time data collection and near real-time processing of non real-time data. In this post, we take a step back and explore what data we need and how we need to collect it to manage the software-defined data center (SDDC) and the cloud.
The History of Management Data Collection
If you look at how most management data has been collected in the past and is still being collected, it falls roughly into two buckets:
- Infrastructure data about how resources are being consumed by software and hardware infrastructures is largely collected by agents built into those infrastructures. For example, the WMI agent built into the Windows operating system collects data from Windows Servers; the SNMP and NetFlow agents built into routers and switches collect data from those routers and switches; and SMI-S agents built into storage devices collect data from those storage devices. All of these collectors make data available in various intervals ranging from a couple of seconds to tens of minutes via standard APIs.
- True application performance management solutions (products that monitor applications themselves and not the infrastructure for the applications) embed agents in the run-time for the application or the OS to collect true response time and throughput data about the application system. This data is never available via any standard interface; therefore, any product that collects data only from standard interfaces is not really an application performance management product.
These methods of collecting data were perfectly fine when there was a one-to-one mapping between applications that changed once a year and when application run-times, operating systems, physical servers, network ports, and physical disks all were running in one data center. However, the new environment consists of rapidly arriving and rapidly changing applications, built-in diverse languages, running in application run-times abstracted from the operating systems, running in operating systems abstracted from the hardware, running on networks abstracted from the physical switches, and running on storage abstracted from the actual physical disks. The combination of rapidly changing applications with dynamic, abstracted, and distributed infrastructure is depicted below.
How to Collect Management Data for the SDDC and the Cloud
In order to address how we manage the software-defined data center, we have to rethink how we collect the data with which we manage it and the cloud. Much of the data that we collect today is either not the right data, not collected the right way, not stored the right way, or not analyzed the right way. The wrong data, collected and stored the wrong way and fed into legacy analytics, leads to the worst of all possible garbage in, garbage out scenarios. To be clear, commodity resource utilization data collected at five-minute intervals from standard management interfaces and stored in many different databases results in a Franken-Monitor. You can read more about Franken-Monitors in Beware of the Franken-Monitor and Getting Rid of Your Franken-Monitor.
In order to get data collection right for the SDDC and the cloud, we need to keep the following in mind when collecting data:
- Relevance to task at hand: “Is it relevant to the task at hand?” is one of the most important questions to ask when thinking about your monitoring strategy. Other important questions include: “what problems am I trying to solve or prevent,” and therefore, “what data do I need?” In some cases, you can figure out what data you need for specific problems. In other cases, it is impossible to know what data you are going to need, and therefore you need to collect as much data as you can, as frequently as you can.
- Nearness to real-time: In Real-Time Monitoring: Almost Always a Lie, we explained that any product that relies on data collected five minutes after the occurrence of the problem you are interested in is not actually real-time. We also explained that real-time monitoring in the computer science sense of the term is almost impossible, as it would produce so much overhead that it would probably cause the very problems that you are trying to prevent via monitoring. Therefore, when it comes to collecting monitoring data, “real-time” is a relative concept. Real-time is relative to how much time elapses between the time that the problem you are interested in starts and when the monitoring system gives you the alert that helps you start fixing the problem. For the SDDC and the cloud, data collection has to get much closer to real-time than it is now, because things are changing so quickly in those areas that you need to collect data more frequently in order to catch things before they get out of hand.
- Deterministic: One thing that is really wrong about much of the management data collected today is that much of it is based on statistical estimates of the actual values or on averages of multiple data points rolled up to less frequent measurement intervals. For example, most of the VMware vSphere management data is collected every twenty seconds (not near real-time enough), and then fifteen of these twenty-second measurements are rolled up into an average every five minutes. Therefore, the number you get every five minutes is based on an averaging process that obscures the anomalies in the data. In many cases, it is, in fact, the anomalies in the data that we are interested in.
- Comprehensiveness: The way in which management data is collected today for the most part offers no guarantees whatsoever that the specific piece of data you want is actually going to get collected. If collection is to be comprehensive, it should, without creating the problem that monitoring is in place to prevent, collect as close to everything as possible. If that is not practical, data collected must include the exceptions and the anomalies, not just the averages.
- A shared data store across vendors: When each vendor puts its data into its own data store, the result is a Franken-Monitor. It will only be possible to manage the SDDC and the cloud with a data store that contains all of the data collected by all of the management software vendors. Each vendor contributes its data to this data store and then benefits from the existence of the rest of the data in the data store. Splunk is the first (but not the last) commercially viable example of such a data store; the combination of Splunk’s offerings and those of its partners is described in Replacing Franken-Monitors and Frameworks with the Splunk Ecosystem.
Metrics That Should Comprise the Management Data for the SDDC and the Cloud
- Response time and throughput for application performance: In static and physical environments, you can infer application performance from resource utilization metrics. In rapidly changing, abstracted, and shared environments, you can no longer make this inference. Therefore, the only way to know the performance of an application is to measure its response time across its tiers. This needs to be a measurement of the actual response times of the actual application, not a synthetic transaction. Any product that claims to be an APM solution but does not actually measure application response time and throughput is not actually an APM solution.
- Latency and throughput for infrastructure performance: Now that the infrastructure consists of physical storage, virtual storage, physical and virtual networks, and physical and virtual servers, it is no longer possible to infer the performance of the infrastructure from its utilization of its resources, either. In order to understand the performance of an infrastructure, you have to measure how long the infrastructure takes to do what you are asking it to do (latency) against how much work you are asking it to do (throughput, or in the case of storage, IOPS). Note that this is a particular challenge for public clouds, because if you are a customer of a public cloud, you cannot measure this yourself, and most public cloud vendors are not sharing this data with their customers.
- Resource consumption data: This is the data that has been with us the longest. It is also the data that is most frequently misused. Too many people still try to infer application and infrastructure performance from this data. As mentioned above, the right approach is to measure application response time and infrastructure latency, and then look at this data only as a part of the diagnostics and root cause process.
- Application topology and application dependency maps: If you or your team members get called into war room meetings and get yelled at because “the application is slow,” you had better get two pieces of information in a hurry. One is the application’s actual measured response time and throughput (see above), and the other is the application’s topology map with information on how the topology of that application maps to (is dependent on) the underlying infrastructure.
- Infrastructure topology: Again, since we are now dealing with an entire virtual infrastructure layered on top of the physical infrastructure, it is critical to understand how both the virtual and physical infrastructure are linked together, both at any moment in time, and in support of any set of applications. This is again an area that is a real problem for public clouds, as the cloud vendor is not about to tell its customers what the underlying infrastructure topology is for a particular customer environment.
- Configuration state: In a software-designed data center, and for that matter in a public cloud like Amazon, all of the resources that you need can be defined in software via software APIs. Some of the resources are actually implemented in software (in a virtual network, the software network moves the East-West bits between servers). This makes it much easier to make configuration changes quickly—which opens the door to making mistakes on a massive scale at the speed of light. This requires that the infrastructure publish configuration changes as they occur, and that these changes are stored in a near real-time CMDB, something which, for the most part, does not yet exist.
Techniques for Collecting the Management Data for the SDDC and the Cloud
So, once we know what the requirements for collections are, exactly how do we collect this data, and what products collect this data in the correct way for us? There are several methods by which one can collect uniquely valuable management data:
- From the network via taps, span ports, or mirror ports: This method works for both TCP/IP networks via the physical span ports on IP switches (from vendors like ExtraHop, Gigamon, and Riverbed), and on fiber channel SAN networks if you put a TAP on the SAN courtesy of Virtual Instruments. ExtraHop also has a virtual appliance that sits on the virtual mirror port of the VMware and Microsoft vSwitches. The virtue of this approach is that it is truly “outside-in” in that the collection of the data is in fact a copy of the data, and therefore the collection process does not impose any of the risks of being inline in the data path itself. Of course, in the case of a virtual appliance, that virtual appliance is now competing for CPU resources along with the workloads running on that physical server.
- From agents inside the application run-time: If you have a custom-developed application and you want to understand where in your code a problem in response time or throughput exists, then the one and only way to understand this is to put an agent into the application run-time. In the world of Java, this is called bytecode instrumentation; a roughly analagous method exists for .NET. Similar methods now exist for Ruby, PHP, Python, Node-JS, and Scala. The vendors to consider here are AppDynamics, AppNeta, Compuware, Dell (Foglight), Riverbed (the AppExpert product line that came from OPNET), and the SaaS Site24x7 offering from ManageEngine. This method of collecting data is unique in its depth and richness. The only downside to it is that it is only useful for custom-developed applications written in a specific set of supported languages.
- From agents inside the operating system: Hundreds of vendors have agents that live inside of operating systems. Ninety-nine percent of these agents do nothing more than collect the commodity metrics that the operating system itself collects. What is special is agents that discover the names of the applications running on the servers, discover the processes that comprise those applications, discover the network topology between those processes, measure response time and throughput between those processes and, therefore, measure response time and throughput across an entire topology map. The great news about this approach to data collection is that it works across all applications (custom developed and purchased) and is particularly useful to operations teams trying to support complex mixtures of applications. This approach to data collection is used by AppEnsure, AppFirst, BlueStripe, Boundary, and Correlsense.
Special Challenges in Public Cloud Computing
If you try to follow the approaches outlined in this post to monitor your environment at Amazon, you are going to be able to monitor your applications and operating systems in a manner consistent with the approaches. However, you are not going to be able to monitor the Amazon infrastructure in the manner recommended in this post, because Amazon will not let you have the necessary access to its infrastructure in the required manner. This constitutes the biggest barrier to the use of the cloud for performance-critical enterprise applications.
We are not going to be able to manage the SDDC and the cloud in the same way that we managed physical data centers. Therefore, we need to rethink management data collection for the SDDC and the cloud.