Hundreds of companies and products monitor and manage various elements of your data center and your clouds. But most of these products rely on commonly available management data that is accessed via either industry-standard APIs or management APIs provided by various vendors. A few products do the extra work to collect unique data, and these products will be the focus of this article.
Commodity Management Data
Commodity management data comes in many forms and from many sources. Here are some examples of the most common types of commodity management data:
- Resource utilization data from Windows servers and desktops — accessed via WMI (Windows Management Interface) or its lower-level API, Perfmon
- Network utilization data from the Simple Network Management Interface (SNMP) — generally available from network devices like routers and switches, but also available from some operating systems
- Storage utilization data from the Storage Management Interface Specification (SMIS) — generally available from storage arrays
- Resource utilization data from the VMware vSphere virtualization platform — available from the vSphere API
- Resource utilization data from Amazon AWA — available via the CloudWatch APIs
There are a couple of points that need to be made about selecting tools and these commodity data sources:
- When you pick a tool, you want that tool to provide value to you.
- You want to pick the best tool, which often means that one that provides the most value to you.
- If all a tool does is use the same data that everyone else uses, then it cannot use unique data as a source of value.
- That does not mean that all tools that just use commodity data are bad. Tools can create great value through how they process the data and through the analytics they apply to the data. For example, CloudPhysics, Cirba, Netuitive, Prelert, VMTurbo, and VMware vCenter Operations rely on just commodity data. But they apply such valuable and unique analytics to the data that they are able to solve problems with their tools that other vendors cannot.
- Virtualization and cloud computing are changing the role of this commodity data. In a purely physical environment with a one-to-one mapping between applications, operating systems, and hardware, you can infer a great deal about the performance of an application from resource utilization data. Once you time slice, abstract, and share resources in a manner common in virtualization and clouds, the relationship between resource utilization and performance no longer holds.
- Given that point, it is, then, essential to utilize unique data that is not available from common management APIs to solve the important system and application performance problems that pervade virtualized data centers and clouds.
Unique Management Data
Management data can be unique in several respects. It can be unique because a vendor took the time to collect it. It can be unique due to the quantity of the data, leading to the need to architect the tool in a different manner. Most importantly, it can measure things other than resource utilization—like latency and response time—that are better indicators of, respectively, system and application performance than is resource utilization. Here are some vendors that collect unique data, and an explanation of what they collect:
- Splunk collects log data (nothing special about the data itself). But the quantity of the log data (one vSphere host generates 250MB of data per day) and the need to be able to query that data with great performance means that you need to have a completely different back end data architecture than what 98% of management products have. The quantity of log data makes monitoring into a big data problem, and Splunk currently leads the market in solving this problem for customers. With the recent acquisition of CloudMeter, Splunk also is now positioned to be able to add unique data from the network to its back end data store. Splunk is also notable for the ecosystem of third-party partners that it has recruited to add their data to the Splunk data store. This is covered in detail in Replacing Franken-Monitors and Frameworks with the Splunk Ecosystem.
- With Log Insight, VMware has a similar big data back end and can collect quantities of data similar to those Splunk can collect. VMware is also working on recruiting a network of third-party vendors who put their data into Log Insight, but it has a lot of catching up to do before its ecosystem rivals the depth and breadth of Splunk’s ecosystem.
- Virtual Instruments stands out in terms of the unique data that it collects about storage latency for fiber channel attached storage. By tapping the SAN, Virtual Instruments is able to see every storage transaction at a sub-second level of granularity between every physical server and every storage array, irrespective of the vendor of the array. This is a case study in the value of unique data, as five-minute storage latency data is available essentially for free as a part of vCenter, but customers are willing to pay Virtual Instruments significant amounts of money for storage latency data that is much more comprehensive, much more deterministic, and much more near real-time than the commodity latency data available from VMware.
- ExtraHop stands out in terms of both unique data about the performance of network attached storage arrays, and unique layer 7 data about the performance of applications. ExtraHop puts a physical appliance on the mirror (span) port of a switch and then decodes the storage protocols (NFS, iSCSI, SMB), as well as the layer 7 application protocols. ExtraHop therefore represents two sources of unique data: data about the latency of the infrastructure, and data about the response time of applications that vendors that rely upon commodity data cannot provide.
- TeamQuest stands out for its ability to calculate the infrastructure latency for every server upon which TeamQuest is deployed. This again is precisely the correct way to understand infrastructure performance—by understanding how long it takes the infrastructure to do what you are asking of it, not to be looking at resource utilization.
- Tintri is an example of a vendor of next-generation infrastructure (a storage appliance) that improves storage performance while using latency as the measurement of storage performance. This is precisely the direction in which infrastructure needs to go—where a focus upon how long it takes for the infrastructure to do its job becomes a core responsibility of the infrastructure itself.
- Vendors like AppDynamics, AppNeta, Dell (Foglight), ManageEngine, New Relic, and Compuware (dynatrace) put an agent in the Java, .NET, or other (PHP, Ruby, Node-JS runtime) and deeply instrument the interaction of the custom code that comprises that application with the runtime. This is perhaps the best example of uniquely valuable data, as this data both measures the hop-by-hop performance of the application and provides diagnostics as to where in the code the slowdown or errors are occurring.
- An entirely new set of vendors instrument some combination of the operating system and the network stack in the operating system. Vendors like AppEnsure, AppFirst, BlueStripe, Boundary, and Correlsense all have agents that collect unique data at the boundary between the applications and the operating system, and/or unique data about how the applications are using the network from the network stack. This results in an ability to understand the response time and throughput of every application in production (both custom developed and purchased) no matter where that application runs (virtualized data center, private cloud, hybrid cloud, or public cloud).
Choosing a Management Tool for Your Virtual or Cloud-Based Environment
Virtualized data centers, private clouds, hybrid clouds, and public clouds are all so different from the legacy physical environment that you should use entirely different criteria for choosing the tools that you use to monitor and manage these new environments:
- When evaluating any monitoring tool, one of the first questions you should ask is “what, if any, unique data does this tool collect, how does it collect that data, and what is the value of this data to the problems that I am trying to solve”? For many of the problems that people supporting modern infrastructures and applications face, commodity data equals commodity results. A tool that just collects data from commodity management interfaces is simply not going to be able to provide you with the value that a tool that collects actual latency and response time information is going to give you.
- That said, it also matters what the tool does with the data. As mentioned above, vendors like CloudPhysics, Cirba, Netuitive, Prelert, VMTurbo, and VMware vCenter Operations all provide value through analytics, even if they start with the same data as everyone else.
Therefore, at the end of the day, the single most important thing when selecting a tool is to focus on the problem you want to solve. There are hundreds of tools that can give you visibility into commodity data—assuming you have the expertise to interpret anomalies in commodity data—and tie those anomalies back to real issues with application and infrastructure performance. Important problems like efficiently running your infrastructure and assuring the performance of all of your applications need to solved with either unique data or unique analytics. Tools that do not provide you with either unique data or unique analytics are probably not worth your consideration.
The new IT environment is so different from its physical predecessor that a new approach to management tools is needed. Unique management data needs to be combined with unique analytics to address these issues.