Analytics as Code

We all need performance and capacity management tools to fine tune our virtual and cloud environments, but we need them to do more than just tell us there may be problems. Instead, we need them to find root causes for problems, whether those problems are related to code, infrastructure, or security. The new brand of applications, if designed for the cloud Ă  la Netflix, or older technologies instantiated within the cloud need more in order to tell us about their health. Into this breach comes a new set of tools, as well as an existing set of tools.

These tools show us robust topologies but also use external sources of data to alleviate false positives. By using other sources of data besides performance data, we are able to determine what is a normal event and what is not normal. Many tools, like VMware vRealize Operations, determine what is normal over time, building up a picture based on algorithms of what is considered to be normal. If, for example, after twelve months of use, there is a spike every first of the month, we could consider that behavior to be normal. But the first time we see it, we may not understand that it is normal behavior. If we cannot, then tools also may not know.

Many of these tools require time to make these decisions—months, usually. However, we need answers immediately. We need to find the root cause of a problem. And increasingly, we need tools that work together, that perhaps use each other as data sources. Why? Because there is no one tool for all root cause analysis. There is a host of tools, such as ExtraHop, vRealize Operations, VMTurbo, Aternity, SolarWinds, and SIOS, that will gather data and push out alerts, graphs, charts, etc. Each of these tools uses its own algorithms, and some concentrate heavily on one aspect of the environment, such as end-user experience, over the others.

However, each of these tools needs to correlate events against performance data. If they do not use events as a source to check the performance data, the data will lead to a false positive. The more false positives, the less likely the tool is to be used. If the event is, for example, a backup job of 4,000 virtual machines, you will expect to see performance issues with underlying storage subsystems. If this is normal behavior, then it should be logged but not alerted upon. However, if this is an abnormal event, you may want to alert upon it. Someone may have changed the backup job, time, number of systems to back up, etc.

How does this make it Analytics as Code? A good tool will have an API that can be called to input data, such as events outside the norm, and can be queried for topological and other data. In essence, a part of deployment needs to be able to tell the management tool that there are new items to manage, correlate, investigate, and analyze. You can write your own tools or work with a company like Intigua, which provides a framework for deploying the necessary agents as well as informing the management tool that there are new objects and events within the system.

More importantly, we need to use those same analytics to determine whether there have been any changes between the initial blueprint developed and the code actually deployed. In a containerized world using Docker, for example, the container definition could be a blueprint, but to me that is just a configuration of part of an application. There could be multiple containers involved, and in that case, you need one blueprint for the entire application.

If the blueprint could be output as a TOSCA graph, and the analytics could output topology as a TOSCA graph, then we would have the ability to compare, contrast, and even alert if they are wildly different, somewhat different, or the same. Why is such an alert important? We design applications to behave in certain ways. Often deployment differs from architecture and design by quite a lot, which leads to mistakes and other changes to the environment. If we can keep all our documentation in sync, then the next time we go to architect a change to the application, we have exactly what is in production.

Furthermore, we can deploy those blueprints easily using tools such as Ravello, so we can create a test bed for security and other testing.

It all starts with a blueprint, but if we already have deployed our application, we need to pull the deployed production environment into a new blueprint automatically. SIOS, Virtual Infrastructure Navigator, and other tools create topologies that we can use to further our goal of having an automated deployment into new environments. If we can do this, the by-hand drawing of application architectures will be a thing of the past. That is a worthy goal. What this means is that part of our application deployment is to use Analytics as Code to grab the existing deployment first, then compare where we are to where we want to go, and go there with automation in mind.

A tool that knows what should be there, correlates events from normal and external sources, and matches an existing blueprint to what is real will give deep root cause analysis and alleviate false positives.

Posted in IT as a ServiceTagged , , , , ,