Moving the configuration of the environment from the hardware that supports the environment to a layer of software which can collectively manage all of the storage, networking, compute, and memory resources of the environment is one of the main points of the SDDC. Once all of the configuration of the data center is moved into software, and some of the execution of the work is moved into software, SDDC Data Center Analytics will play a critical role in keeping your SDDC up and running with acceptable performance.
The Role of Software Defined Data Center Analytics
In “The Big Data Back End for the SDDC Management Stack“, we proposed that the amount of data collected by and generated by the different management products which will comprise the SDDC management stack will require a big data back end. This is critical, since only with a common big data back end data store will there be one place where the customer can go find the information that they are looking for, and one place where each vendor can find complementary information from other vendors. For example, if an APM vendor puts their application response time information in the big data back end, they can in turn query the big data back end for the operations and security data that may have caused a response time degradation.
Deterministic vs Analytic Based Approaches
If you step back and think about how problems are solved in your data center today, you’ll find that the approaches roughly sort out into two groups. For some problems a deterministic approach is possible. For example, if you want to know the source of a network outage, you can use dependency mapping and outage reports from various nodes to figure out where the actual fault in the network lies. Vendors like EMC Smarts, IBM with Netcool (acquired via MicroMuse), and Zenoss have all had the ability to find faults in networks and systems via a variety of deterministic approaches for years.
The problem with almost all deterministic approaches is that they require some combination of a rule set or a map of the environment. But rule sets and comprehensive maps are almost impossible to keep up to date in a modern virtualized data center, and they will be completely impossible to keep up to date in an SDDC.
Analytic-based approaches can be more adaptable to rapidly changing environments, if data is collected about the environment on a near real-time basis (which is what leads to the need for the big data back end in the first place), and the analytics can themselves adapt to changes in the data. This leads to a crucial requirement for analytics in the SDDC; for those analytics to be effective and usable they need to self-adapt or self-learn the changing conditions reflected in the changes to normal data about the SDDC.
Attributes of Software Defined Data Center Analytics
Big data analytics and analytics in support of data center management software are both rapidly evolving areas. There is a great deal of unsettled debate as to what the best approaches should be to particular kinds of problems. Therefore, when evaluating standalone analytics offerings, or analytics which are features of management offerings, it is important to understand where these analytics fit along the following dimensions:
- Self-Learning vs Trained – the whole idea of using analytics to try to manage the performance of computer systems got its start back in 1990. Back then there were three companies. CA offered a product, Nugents, that was based on neural net technology that needed to be trained for each situation it was supposed to manage. The requirement to train the models caused Nugents to be unviable and CA killed the product. ProactiveNet offered a product based upon standard Bayesian statistics, which also required a sample set of data to build a model. ProactiveNet was never able to achieve much in the way of market traction and was ultimately acquired by BMC. The failure of Neugents and ProactiveNet has pretty much killed the idea of using pre-trained models to understand dynamic management data. Netuitive pioneered the self-learning approach and still exists today as an independent vendor with some of the largest and most sophisticated enterprises as customers. With the analytics in vCenter Operations, VMware uses a hybrid approach. There are multiple pre-trained models, and a master model that decides which of the pre-trained models to use in a particular situation.
- Generalized vs Specific to a Use Case – Netuitive and Prelert are both examples of very generalized performance analytics. You just wire up the products with the data feeds, and their correlation engines learn the normal patterns of the data and alert you as to deviations. However, there are many use cases for analytics that are not best met with a generalized approach VMTurbo has analytics in their product that specifically allocate scarce resources to the most important workloads, ensuring that the most important applications perform well. Cirba has analytics that optimize the placement of your workloads across a variety of physical and virtual infrastructure. CloudPhysics has different approaches to analytics underlying each of their “cards”. For example, there are specific analytics that tell you if you have enough headroom for HA to work and different analytics that tell you where to allocate your flash storage. AppDynamics has very sophisticated time of day and day of week statistical baselining, allowing you to get alerts when things are abnormal in the context of what should be happening at any moment in time. Veloxum has a set of very specific analytics that optimizes the configuration of the environment supporting an application so as to automatically provide the best possible response time and latency.
- Time series oriented vs event stream oriented – One of the important, yet very detailed differences between analytics is whether they are better at analog time series data (the data looks like a sine wave with peaks during the day and valleys at night) or whether they are better at event streams (where events are coded and then analyzed for their frequency). Netuitive is really good at time series where the data looks like sine waves (peaks during the day, and valleys at night). Prelert is an example of a class of analytics that is very good at finding anomalies in event streams, which is what makes Prelert into a natural addition to Splunk.
- Model Driven vs Data Driven – This is again a very technical but very important area. Some vendors start with a model and then apply that model to systems management. When Netuitive got its start, the model was being used to forecast energy demand and predict when an airplane wing would break under stress. The core technology (the algorithms) behind Prelert come from gene sequencing. The core algorigthms in vCenter Operations were adapted by the founder of Integrien from algorithms that were first used in chaos theory. In these cases, vendors start with a model that is often a significant advance upon the previous state of the art and then adapt the model over time to fit the data and the use cases encountered in the management realm. CloudPhysics takes the opposite approach. It starts with the data for a particular situation or use case and then builds a model that is most appropriate for that use case. The benefit of this approach is that you get a model that is custom-built for the problem you are trying to solve. The downside is that if CloudPhysics has not gotten around to buildling a model for your problem, then you have to wait until they build one.
- Analytics and Automation – Automatically figuring out what is wrong is a significant advance in the root cause processes used by most enterprises today. Any enterprise that could automate the mudslinging process that characterizes most blamestorming meetings would likely be extremely grateful. But this is not the end goal. The end goal is to translate the cause as found by the analytics into a fix that can then be automatically applied. For example, if your automated root cause analytics tell you that response time is slow because the number of incoming requests is higher than normal, you know you probably have a capacity issue. But just knowing that does not tell you where the constraint is and what kind of capacity (web server, java server, database server, server hardware, network, or storage) you need to add. We have a long way to go before automated root cause based upon self-learning analytics can drive an automated remediation process.
Install Splunk and feed it all of your log data, your server metrics, and your VMware metrics. Then feed it the network APM data from ExtraHop. Then feed it the data from your custom applications monitored by AppDynamics. Then take a look at something like AppEnsure to get all of the response time data and the topology for all of your purchased applications. Now you have the ability to manually issue a query to find out what is causing an issue with extraordinary speed. But you are still reliant upon a human to know what question to ask. Turning your real time big data management data store into an asset that will help you find problems that you do not know about yet or, better yet, find issues that when fixed in time can prevent problems, is the holy grail. This will require layering self-learning analytics from vendors like Netuitive and Prelert on top of Splunk. CloudPhysics accomplishes the same thing by having their own big data back end and their own analytics in the cloud (that collects that data from all of their customers). The combination of big data and analytics is where the future of operations management, application performance management, and many other aspects of managing the SDDC and the cloud lies.