Monitoring for Agile Operations

In “Agile Without Ops Is Not Really Agile,” Mike Kavis points out that the Agile Development process and the DevOps support process must culminate with a situation where operations has the tools and uses the processes required for operations itself to be agile. Therefore, Agile Operations should be the natural consequence of agility in development and support, but often this is not the case. This post is about how the right monitoring tools can be used to help operations become agile.

Agile Development, DevOps, and Agile Operations

Mike Kavis has written an excellent series of posts on the correct (and incorrect) ways to implement Agile Development and DevOps processes. All of these posts may be found in our Agile Cloud Development topic. While it is impossible to summarize Mike’s body of work in a few sentences, the key points are depicted in the diagram below. And the most important of the key points is that agility and responsibility must exist at each stage in the process. Each stage must iterate until the criteria for promotion to the next stage are met. This culminates with the promotion of the new release into production, with operations then having the ability both to compare the behavior of the new release with its predecessors and to assess the impact of the new release on the virtual and physical infrastructure in light of the requirements of the rest of the applications that share that infrastructure.

The Agile Development → DevOps → Agile Operations Process 

Agile Operations
Agile Operations

Challenges Implementing Agile Operations

Even once the development and support teams fully embrace and adopt Agile Development and DevOps, there is no guarantee that operations will magically adapt to and embrace this new rate of change. In fact, operations has its own set of alligators to wrestle, the main one being managing dynamic and distributed environments (data center virtualization and various types of clouds). Layering rapidly changing applications on top of a dynamic and distributed environment often puts operations into the unenviable position of being guilty until proven innocent. This leads to the behavior shown in the cartoon below (courtesy of AppDynamics).

AppDynamics Agile Operations
Click to expand

Why Does Operations Resist Rapidly Changing Applications?

The single biggest reason that operations resists streams of new applications is that it is the experience of operations that any change causes problems. In fact, you could argue that ITIL (Information Technology Infrastructure Library) was invented to slow the rate of change by forcing changes to be documented and approved in change control committee meetings. This was a natural reaction on the part of operations to the fact that it had far from comprehensive and real-time visibility into the environment.

The problem that most operations groups have with rapid rates of change is rooted in the fact that most have assembled their suites of monitoring tools over the years in an uncoordinated and ad hoc manner. Some organizations bought frameworks and then added point tools to them to fill in the gaps. Some organizations simply bought a point tool to meet the needs of every constituency in the company that needed to monitor something. This resulted in companies having many unrelated and unintegrated monitoring tools. Such a collection of disparate tools is best referred to as a “Franken-Monitor.” The dangers of a Franken-Monitor were covered in “Beware of the Franken-Monitor.” The Franken-Monitor in all of its glory is depicted in the image below.

Click to expand

The problems with Franken-Monitors are:

  • They do not monitor frequently enough to keep up with rapidly changing environments and applications.
  • They do not monitor comprehensively enough to capture the data needed to quickly troubleshoot problems.
  • They monitor the wrong things. They focus on resource utilization and uptime instead of response time, latency, and throughput.
  • Data is not integrated across tools, resulting in the need for a war room or a bridge call every time a big problem occurs.

Monitoring Tools for Agile Operations

So, if you want to do Agile Operations, you need a set of monitoring tools that are appropriate for rapidly changing applications running in dynamic and distributed environments. To build such a tool set, you will need to follow the following process:

  • If you have a framework, put in place a plan to get rid of it. This will likely take years, but even a long journey must start with one step.
  • Start the process of getting rid of your framework by constraining it to “old” environments and refusing to use it in new environments. For example, you should categorically refuse to use legacy frameworks to manage data center virtualization environments like VMware and Hyper-V and your instances in clouds like Amazon and Azure.
  • Start looking for point tools to shoot. When it comes time to renew the maintenance for any monitoring tool, ask the question, “Can we get rid of this?”
  • If you narrow the footprint of the framework in your environment and shoot point tools, you will free up budget that should allow you to build your new management stack simply by redirecting funds.
  • Put in place a monitoring architecture. In “Building a Management Stack for Your Software-Defined Data Center,” we presented a reference architecture appropriate for managing the forthcoming software-defined data center. It turns out that this architecture is also the right one for Agile Operations. This reference architecture is shown below. Note that in this architecture, all of the monitoring tools put their data into one common big data back end. This is the only way that operations will be able to keep up with the rate of change driven by the Agile Development and DevOps processes.
Agile Operations Management Stack
Click to expand

Choosing Monitoring Tools for Agile Operations

The reality is that the management software industry has not fully caught up with the requirements for monitoring and managing performance in virtualized data centers and clouds while also meeting the requirements of Agile Operations. Therefore, what we have today is a bunch of really good starts that can be built upon as these vendors build out their offerings:

  • Splunk got its start collecting logs and built a big data back end in order to be able to cope with the arrival rate of the data and the size of the subsequent database. Splunk has also built out an impressive ecosystem of partners, all of whom put some of their data into Splunk or make their own data stores queryable from within Splunk. The process for constructing a management stack out of Splunk and its partner solutions was explained in “Replacing Franken-Monitors and Frameworks with the Splunk Ecosystem.” Splunk also offers its own operations management solution for VMware environments, Splunk App for VMware.
  • If you are an agile shop, then you are building applications. You need an APM tool that is suited for rapidly changing applications. The three requirements here are that (1) the tool be able to discover as much of the application system as possible, leading to the requirement that (2) the amount of configuration required be as close to zero as possible and that (3) the tool tell you where the problem is in the code as quickly as possible. AppDynamics, New Relic, AppNeta, Compuware, and Riverbed all have excellent tools focused on custom applications built in a variety of languages. If you are building mobile applications, you should pay particular attention to Crittercism, New Relic, and Compuware.
  • As great as APM tools are, they are built to meet the needs of the developers and support teams that are supporting custom applications in production. They are not built to meet the needs of the operations teams that are trying to be agile in support of Agile Development, DevOps, and every purchased application in the environment. Operations teams should therefore look to AppEnsure, AppFirst, Boundary, Correlsense, ExtraHop, and Virtual Instruments, which all collect unique data across a broad set of interconnected applications that operations teams need in order to support rapidly changing applications in their environment. The point is that the operations teams have completely different needs than developers, and therefore they need a completely different tool.
  • Operations management itself needs to be transformed for Agile Operations. VMware has built vCenter Operations specifically to meet the needs of operations groups running VMware. Zenoss represents the most credible replacement for an entire systems management framework on the market. VMTurbo and CiRBA both allow for resource allocation decisions to be automated in the pursuit of service assurance. CloudPhysics combines sophisticated analytics with operations management data leading to insights and answers that are not available from anyone else. The first step in Agile Operations is having an up-to-date map of your environment, something that is provided by Neebula.
  • Being operationally agile across multiple hypervisors and clouds is extremely challenging. HotLink has a unique solution that allows you to manage all of your virtualization platforms and clouds from within VMware vCenter—while also migrating workloads and their storage between all of these environments as needs dictate.
  • VMware has now successfully implemented software-defined compute, software-defined networking, and software-defined storage. Intigua deserves credit for focusing on software-defined management—meaning the abstraction, pooling and sharing, and centralization of the configuration of all monitoring, backup, and other management solutions that need to be deployed in a modern environment.
  • Cloud management, automation, and orchestration all are a crucial part of the ability to automatically provision and deploy virtual machines and their contents in a managed but automated manner. This is where solutions like VMware vCloud Automation Center, Virtustream, CloudBolt Software, and Puppet Labs all come into play.
  • Lastly, the flood of management data that results from near real-time instrumentation across this management stack defies human comprehension (it makes monitoring into a big data analytics problem). Netuitive and Prelert both have advanced self-learning algorithms that automatically learn the normal behavior of your monitoring data and then flag you when anomalies occur.

We are far enough down the road in data center virtualization, the software-defined data center, and the cloud to know that legacy management tools are not going to make the transition to these new environments. We know that frameworks are dead. We know that frameworks need to be replaced with ecosystems of cooperating vendors like the one run by Splunk. What we do not know yet is who else is going to form an ecosystem of cooperating vendors, as Splunk has done.


Agile Operations requires that the entire stack of monitoring and management tools be rebuilt from scratch. Legacy frameworks and Franken-Monitors should be uninstalled and replaced with modern tools that can keep up with the rate of change in the environment.