We are in the midst of an analytics boom. Everywhere I look, I see analytics presented as the answer to everything from sweaty pores to security. They may even improve hair growth. That aside, analytics truly are invading everything we do. There are three types of analytics. Over reliance on any one type leaves businesses vulnerable to false positives. IT lives in the land of false positives. We disable and ignore seemingly false alerts, but are they really false? How can we gain more from our analytics?
This question has plagued IT as a Service products for years. The vendors’ answer has always been to refine their algorithms, team up with event management platforms that have their own analytics, or let the users disable the alerts and events as needed. What’s the real solution to our false positives issue? We have gone through three phases of evolution in monitoring for performance, security, and the like. Those phases are:
- Thresholds: Anything over a given threshold is considered bad, but what happens if we hover right around the threshold? Is that good or bad? What happens if we flap around the threshold?
- Machine Learning: We use an algorithm to sample the data for issues. The more data we sample, the more accurate our algorithm. Each algorithm has a basis or goal in mind when it is created. Some algorithms look to have a green data center by powering off nodes. Others look to move things around until they spread out workloads to make them more balanced. Still others want the highest density of workloads. Some have multiple algorithms: the first layer and then a layer on top of those results to be more predictive.
- Artificial Intelligence: AI attempts to mimic what humans do, but without the mistakes we make as humans. AI aims for a repeatable process and function. AI needs to be trained with millions of pieces of data before it is ready to learn your environment and to predict and react appropriately.
Each of the last two approaches requires a huge investment in data. The more data, the better. In some cases, the more sources of data, the better. However, after a certain point, it’s the type of data rather than the sheer quantity that matters. At the moment, we need visibility more than just handling data. What we do with event data becomes important. Should we overlay the events from our management tool on top of the data to get a better handle on what we just saw? One example is seeing a sudden spike in memory usage. Is this issue predicted from previous sets of data, or is it related to something entirely new? If we overlay the administrator events at a minimum over the data, we may see that it is related to a known event. What if that event was not known? Is that not a security issue? Once you have visibility, you can think of new ways to analyze the data, new groupings and algorithms, and even new ways to train your systems.
In the following table, we look at analytics within several platforms:
|Rollout Logs to group like lines/events|
|Root-Cause Analysis (H)ardware/(S)oftware||S||H/S|
|Data Augmentation (Raw Data + Raw Events + …)|
|Common Repository (for all Products)|
Here are our definitions for the above table:
- Productivity Analytics look at how productive people or tools are being. This is the most basic of questions about return on investment. Do I know if productivity is being impacted by a tool, a person, a location, the weather, etc.?
- Basis is the fundamental aspect of each algorithm for the tool. Is the basis cost, density, power, consumption, or more than one of these factors? The basis shadows each algorithm and needs to be considered as you look at each tool. If you want to save money, a cost- or power-basis tool may be best. If you want to run dense, then you need a density-based tool or perhaps a consumption-based tool.
- Capacity Analytics is not just about predicting when you need to buy more hardware. It also concerns how to pack the hardware you currently have to get the most out of it. It is about playing “what if” questions. This is one of the most basic aspects of many tools.
- Cost Planning is pulling the cost of all aspects of IT into the algorithms to make decisions on how to proceed. This is looking at not just the cost of new hardware for your data center, but also the costs of potentially going to the cloud.
- Storage Analytics look at the underlying storage infrastructure and try to make heads or tails out of the IOPS, latency, and capacity needs. More advanced analytics will move workloads around to ensure IOPS or latency goals are achieved.
- Hadoop, Splunk, Elasticsearch, and Dump are all methods for getting data out of the system to be used by other tools or for deeper investigation. Can the data be sent to a Hadoop instance, Splunk, or Elasticsearch, or can it be exported as a dump of data? Many people like the search capabilities of these tools and want to use Hadoop to ask even deeper questions.
- Thresholds, Machine Learning, or AI is a look at the type of platform we are discussing. Many start out with thresholds and quickly transition into machine learning once enough data is available. Others are trained before you get them, and their training is refined with your data.
- Rollout Logs to group-like lines or events is the ability to not just look at a log one line at a time but to find the actual events in each log and group the events together as one block of log data. Most logs are interleaved, and part of the problem we have is that logs are hard to interpret unless like things are grouped together.
- Predictive is whether or not the tool can predict what will happen at the end of the month, say based on the history it sees for the previous years and months. Until there is enough data, prediction is limited.
- Service Impact implies the tool can predict or tell you about service impacts to other objects within the system once one object goes haywire. For example, if storage is impacted because of one workload, can the tool tell me what else is impacted? Service impacts are not limited to the typical “find the impact from the noisy neighbor” use case.
- Root Cause Analysis is about finding the root cause of a failure or issue. Many systems claim to give the root cause, but can only tell you to the level of data it has. If the system does not have hardware data, then all failures look like the software layer. Even in this age, hardware is a requirement, and often the problem is within not software but our underlying hardware.
- Data Augmentation allows a user to augment the data with their own finding, tag, or other information. You could use this aspect to layer on those events we discussed earlier. It is also a way for teams to communicate by providing a single interface to augment data.
- Common Repository is a requirement when using multi-tool systems. If they use disparate repositories, maintaining your data could be difficult. You need a deeper understanding of the tool to know where the particular type of data is housed. If there is a common repository or common interface to all data, then that need disappears.
- Data Investigation is a deeper view of the data. It is a way to use the data for forensic and other analysis, a way to ask your own questions about your environment.
There is quite a lot to think about when you look at the IT as a Service space, whether it is for a management tool, performance monitoring, or event management. Analytics is just one part of the whole, but it is becoming an important part. It is a part that merits serious thought and questions about a product’s suitability for your needs. The ultimate goal is to use a combination of algorithms and data sources to remove false positives from the output.
Share this Article:
Latest posts by Edward Haletky (see all)
- Scale and Engineering - March 23, 2017
- SDS and Docker: The Beginnings of a Beautiful Friendship - March 21, 2017
- Security Operations Center: Not Just Visibility - March 14, 2017