Monitoring for Agile Operations and DevOps

Agile Cloud Development

In Agile without Ops Is Not Really Agile, The Virtualization Practice analyst Mike Kavis did an excellent job of pointing out that just making development agile does not make the entire IT organization more responsive to the business—as it ignores all of the things that can and do go wrong in production. Making production support agile requires addressing the processes and tools that are used in Operations. A huge part of agile operations involves monitoring and the processes that use the monitoring tools.

IT Operations Requirements for Agile Operations (DevOps and AppOpps)

Moving to an Agile Development methodology requires a huge change in the culture and practice of software development as well as a new set of tools that are suited for this new process. It also requires that Development be reorganized to support this new development process.

The changes that IT Operations must go through are similar in scope and profound in nature:

  1. Silos must be broken down and eliminated. There can be no more storage team, network team, server team, OS team(s), and virtualization team. As VMware rolls out its software defined data center, and as customers bring public clouds like Amazon (which already is a software defined data center) into production, IT Operations teams must be seamlessly cross-functional.
  2. IT Operations must assume responsibility for the performance of  every business critical and performance critical application in production. This includes not just the applications developed in house (via agile methodologies), but also all of the purchased applications and the compound applications that are mixtures of custom-developed and purchased applications.
  3. Application performance needs to be defined as response time and throughput, not resource utilization. In other words, IT Operations needs to assume responsibility for the response time and throughput of every business critical and performance critical application in production.
  4. Capacity needs to be defined as the ability to deliver the response times and throughput rates that the business requires for each application—and not how much resources those applications use.

Monitoring Requirements for Agile Operations (DevOps and AppOpps)

In order for IT Operations to meet the requirements listed above, the monitoring tools used by IT Operations and the way in which those tools are used must change dramatically. The key changes are:

  1. Monitoring must become comprehensive. That means that everything that could have an impact upon the availability or performance of an application needs to be monitored.
  2. Monitoring must be done with high fidelity and granularity. Coarse-grained polls of data will no longer suffice. The devil will often be in the details, so if the details are not monitored, the devil will not be found.
  3. Monitoring must become more frequent and therefore near real-time. Every 5 minutes no longer suffices. Even every 1 minute is not frequent enough. Infrastructure latency information must be collected at 1-second levels of frequency. Application response time metrics must be collected at 5 or 10-second levels of frequency.
  4. Monitoring must become deterministic. This means that averaging or rolling up of metrics needs to be taken out of the data collection process, as averaging inevitably obscures and hides exceptions through the averaging process.
  5. Commodity data equals commodity results. What this means is that if all your monitoring tools do is collect data from management APIs like WMI, SNMP, or SMIS, then all you have is the same dubious data that produces the same set of false alarms that everyone else ignores until a human being picks up the phone and screams about something.
  6. Monitoring must collect a diverse data set. Just logs alone are not enough. Just the time series data from WMI and the vSphere API are not enough. Just the configuration change events are not enough. Just the topology maps of the infrastructure and the applications are not enough. All of these things must be collected into one data store.
  7. Monitoring becomes a big data problem. The quantity, frequency, and diversity of the data that must be collected will overwhelm any traditional relational database.
  8. The big data back end for monitoring must be a multivendor back end into which every monitoring vendor puts its data, and from which every vendor can query data in order to provide enhanced root causes analyses.

Building Your Monitoring Stack for Agile Operations (DevOps and AppOps)

The single most important decision that you need to make when embarking upon building a management stack for a new Agile DevOps and AppOps environment is to assume that none of the monitoring tools that you own will make the cut and make the transition into your new world. After you go through the process outlined below, you may conclude that some of them will make the transition. But you should confer no advantage of incumbency to any tool that you currently own. That said, the suggested process for building your new management stack is:

  • Start with your applications. Find out how many business critical and performance critical applications you have. Find out how many are custom developed, how many are purchased, and how many are a mixture of the two.
  • If you have both custom-developed and purchased applications, plan on buying at least two tools: a developer-focused tool like New Relic or AppDynamics, which is focused upon finding bugs in custom code in production, and an operations-focused tool like AppEnsure, AppFirst, BlueStripe, Correlsense, or ExtraHop that will work for every application in production.
  • Define infrastructure performance as end-to-end infrastructure latency, and purchase monitoring tools that get you that data. For fiber channel environments, Virtual Instruments has the only real-time, comprehensive, and deterministic solution available. For networked attached storage, look at ExtraHop or Riverbed. Once you get out of the storage realm, take a look at solutions like Boundary or TeamQuest.
  • Pick a big data back end. This is the foundation of your management stack architecture, so you have to get this one right. You want a big data back end that can technically handle the flood of incoming data, one that makes it easy to query across many different data sources, and one where the vendor is partnering with many adjacent vendors who all feed their data into the back end. Right now, the clear leader in this area is Splunk. But you should also consider that the right place for your big data back end to live might be in the cloud, which should cause  you to take a look at either the new Splunk Cloud offering or CloudPhysics.
  • If you are a VMware shop, take a hard look at vCenter Operations Manager and Log Insight. While these two tools are just at the start of the process of getting fully integrated, together they make a powerful combination for collecting operations data and log data.
  • Once you have this flood of data coming into your multivendor big data back end, put in place a strategy for automatically analyzing it. This would require a self-learning analytics solution that is integrated with your big data back end like those from Prelert or Netuitive.

For more details how how to pick your monitoring tools at each layer of the monitoring stack, please read the following posts:


Building an Agile Operations process will require changes in organization, processes, and tools. This is an essential journey for any organization that wishes to remain relevant in the age of Agile Development, the Software Defined Data Center, and the Cloud.

Posted in IT as a Service, SDDC & Hybrid Cloud, Transformation & AgilityTagged , , ,