Recently it has become abundantly clear that there is great turmoil in the business of Application Performance Management (APM). CA has sponsored a study by IDG research services that concluded that “Most Enterprises are Approaching APM SaaS Cautiously“. Separately, Information Week asked “What’s Killing APM“, and concluded that “App performance management is seen as less important than it was two years ago, partly because vendors haven’t kept up”. These problems are being caused by first and second generation APM solutions and are driving the market for third generation APM solutions.
The State of APM in Most Enterprises
To put these changes in context, it is useful to start with an understanding of where the APM industry is. There are three distinct phases or generations of APM solutions:
- The first generation of APM was based upon the concept of measuring how the processes that comprised the application used resources and attempted to infer the performance (response time) of the application from resource utilization. This worked reasonably well for applications running on dedicated physical hardware as spikes in resource utilization were often correlated with issues in application performance. But this approach did not measure the performance of the application directly, which opened the door to the second phase.
- The second generation of APM was pioneered by Wily Technology in 1996 (subsquently sold to CA in 2006) who figured out a way to inject an agent into a Java run time. This allowed for the direct measurement of how long it took for each function (object or method) in the application to do its job, allowed for the direct observation of errors in code, crashes in code, and stalled transactions. The business of managing code quality in production was born. Along with CA who bought Wily, IBM, Quest, and Compuware all bought Java agent based startups and the second phase of APM became a big company management software business as of 2006.
- We are now living through a third generation of APM which is reinventing the category along several dimensions. Those include supporting more language platforms, delivering APM as a service, supporting distributed and scaled out deployment models, and doing APM via the network (Network based APM). These will be discussed in detail below.
Changes in Application Development and Deployment
- Enterprises have learned that the demand to implement business functionality in software is infinite. Once implemented, the demands to change it are infinite. The software development backlog is infinitely long (the existence of the backlog discourages new requests).
- Enterprises have discovered that implementing business processes in software is of very high value. It equates to competitive advantage or such a competitive necessity that failure do to it right leads to extinction. Witness the physical retailers taking on the Internet retailers on the web this Christmas season. Target realizes full well that Amazon and not some other chain of stores is its biggest competitor, and that failure to compete on the web leads to extinction.
- The imperative to deliver software and updates to software quickly is driving many changes. The first of them is the adoption of Agile Development as a method of getting software into production more quickly, and evolving it continuously once it is in production.
- The second change is driven by the unrelenting pressure to be able to do more software development with less expensive people. This has been going on since Microsoft popularized Visual Basic. Recent innovations in this area include Ruby, PHP, Python, and Node-JS. Tool innovation and proliferation is driven by the economics of software development. This results in an inexorable push towards tools that are easier to use, and that abstract the developer further away from any knowledge of and responsibility for what is actually happening inside of the application.
- Agile Development works best when an application is broken into components, and each component is then managed by a small team. This results in what used to be monolithic applications being broken up into highly distributed applications with many tiers.
- The continued improvement in the price/performance of Intel servers along with cheap operating systems (Linux) and cheap application runtimes (Tomcat or vFabric) has replaced a few large J2EE application servers with many smaller commodity servers. The combination of the componentization of the application outlined in #5 above with the scaling out of applications across large farms of commodity servers means that an N-layer application is often spread across hundreds and even thousands of servers.
- Many enterprises are finding that mobile platforms are critical end user devices that need to be supported with locally installed applications. This has added a new layer to what is already a N-layer architecture for many of these applications.
- These scaled out applications are being run in not just one data center, but data centers distributed across the world, and distributed across ones the enterprise owns, and ones rented from public cloud providers of various stripes.
- Enterprises are now virtualizing business critical applications with a vengeance. This is leading many important purchased and custom developed applications which used to run on dedicated hardware to now run in shared and dynamic execution environments. This is in turn leading the owners of these applications to demand that the IT organization ensure the performance of these applications in production. This is directly driving the need for APM solutions that can monitor every application in a virtualized environment somethign that neither first nor second generation solutions can do.
Factors Driving the Third Generation of the APM Market
- Focus on Response Time and Throughput: The single most important metric when measuring applications performance, and especially applications performance for applications running in virtual or cloud environments is applications response time. The reason for this is that this metric accurately reflects the service level that the application is delivering to its users, and it is therefore a metric that the applications owners will easily buy into as one that represents their concerns. It is therefore essential that you choose an APM solution that can measure response time for your applications in the most realistic (as close as possible to what the user actually sees) and comprehensive (across all tiers of the applications system) as possible. All of the solutions profiled in this article focus upon response time, so there is not column for this criteria in the table below, as it is met by all of the vendors. For certain applications, throughput is also an extremely important metric. This is so for applications that do a large number of transactions, where throughput within a response time threshold is how servers should be sized. It is also true for applications that run very large transactions (batch jobs) where the completion of the batch job within a certain period of time is a critical issue.
- Deployment Method. This is where you have to make some difficult tradeoffs. The first tradeoff is an on premise solution vs a SaaS delivered solution (Monitoring as a Service). The advantage of MaaS is that you do not have to maintain the back end, and as the vendor adds features to the product, they just upgrade the back end and you get the new features. The advantage of an on-premise solution is that data about the performance of your business critical applications is not sent over the Internet to someone else’s data center.
- Data Collection Method. This and Supported Application Types (directly below) is where you make your tradeoff in the breadth of the applications that you can manage with your APM solution vs the depth of the analysis. You basically have three sources of data to choose from in a modern APM solution. The first choice is to collect the data from the network via a physical or virtual appliance that sits on a physical or virtual mirror port. The virtue of this approach is that it works for every application that you have – irrespective of how it was built, or whether it was built or purchased. The next choice is a modern transaction oriented agent inside the operating system. These are very different agents than the legacy agents that just capture resource utilization statistics. These agents capture the detail of how the applications interact with the OS, and how the processes that comprise the application communicate over the network that connects all of the servers that host the application. The last choice is to use an agent that lives in the application run time environment. This provides for the deepest level of diagnostics and transaction tracing, but only works for applications that are written to the specific run times supported by the APM vendor (you get depth, but you give up breadth).
- Breadth of Supported Application Types: The APM solution has to work with and support your applications architectures. If you just have web based applications with Java middle tiers and database back ends there are many good tools to choose from. The more you diverge from HTTP/.NET/Java/SQL as the applications architecture the fewer tools there are to choose from. If your application has a proprietary front end (a Win32 client), a proprietary middle tier (maybe something written in COM+, or C++ ten years ago) and a database that no one supports then you need to look for at a tool that operates at the TCP/IP layer since instrumenting the application itself will likely be impossible. However, in so doing you will give up the insights into the business logic that Java and .Net aware tools provide. The tradeoff between depth of code analysis and breadth of applications supported is what differentiates the “DevOps” category of APM tools from the “AppOps” category of APM tools.
- Application Topology Discovery: As your application will now be “dynamic” you will need a tool that can keep up with your application and its topology no matter how it is scaled out, and no matter where a component of the application is moved. This means that if the APM tool relies upon an agent, then that agent must travel with (inside) of the application so that as the application is replicated and moved, the agent comes up and finds its management system. It is also critical that these agents map the topology of the application system from the perspective of what is talking to what. Otherwise it will be impossible to troubleshoot a system with so many moving parts.
- Private/Hybrid/Public Cloud Ready: If you are thinking about putting all or a part of an application in a public cloud, then you need an APM solution that works when there is no LAN connection or VPN between the agent in the application and the management system. Polling agents that live in clouds will not work, as you cannot assume the existence of an inbound port to poll through. Therefore the agent needs to initiate the connection, open an outbound port back to the management system, and which then needs to be able to catch the incoming traffic in your DMZ. You also need a system that is able to map the topology of your application system across the data centers that it executes in.
- Zero Configuration Required: If you are an Agile Development shop, then it is essential that you choose an APM solution that can keep up with your rate of enhancement and change in the application. Essentially this means that you need a “zero-config” APM tool, as with a rapid rate of change in the application you will have no time to update the tool every time you do a release into production.
- Deep-Dive Code Diagnostics: For some really performance critical applications, being able to trace transactions through the layers of an application system can be invaluable when it comes to understanding end-to-end performance, but this capability is traded off against breadth of platform support. The bottom line is that you cannot have both the deepest possible visibility, and the broadest possible support for applications architectures in one product.
- Support for New Application Run-Times and Languages. You need to carefully marry your application development strategy with your APM strategy. If it is appropriate for you to specify a standard language and run-time for your organization, then pick a tool that supports that language or run time. If on the other hand the requirements are so diverse and the needs for responsiveness are so demanding that you are ending up with not just Java and .NET,but also some combination of PHP, Python, Ruby, and Node-JS then look either for a solution that supports your set of languages or one that is agnostic as to how the application is built. Recognize that if you go the agnostic route you will lose line of code visibility as a result.
The “DevOps” Category of APM Tools
The DeveOps category of APM tools, listed in the table below, focus upon helping the portion of the development team that is supporting a custom developed application in production, rapidly resolve issues with the application itself. After making sure that the tools supports your application and its execution environment, the single most important criteria for selecting a tool in this category is “time to bug”. Time to bug means from the time that the problem occurred, how long did it take the tool to notice the problem (how close to real time is the instrumentation), and then once the alert is raised, how many “clicks to resolution” are there. Notice that the target audience of these tools is a developer who can and probably will need to make a code change to resolve the problem. Therefore these tools are targeted specifically at developers and at applications where the enterprise owns the code and has the responsibility for maintaining it. If you are currently using a second generation APM product from a legacy vendor that is expensive, difficult to use, difficult to maintain, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
Third Generation DevOps Tools
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppDynamics||Monitor custom developed Java and .NET applications across internal and external (cloud) deployments||On Premise/SaaS||Agent inside of the Java JVM or the .NET CLR||Java/.NET|
|dynaTrace (Compuware)||Monitoring of complex enteprise applicatons that are based on Java or .NET but which may include complex enterprise middleware like IBM MQ and CICS||On Premise||Agent inside of the Java JVM or the .NET CLR||Java/.NET, Websphere Message Broker CICS, C/C++|
|New Relic RPM||Monitor custom developed Java, .NET, Ruby, Python, and PHP applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM, NET CLR, or the PHP/Python runtime||Ruby/Java/ .NET/PHP/Python|
|VMware vFabric APM||Monitor custom developed Java applications in production. Strong integration with the rest of the VMware product line including automated remediation and scaling.||On Premise||Mirror port on the vSphere vSwitch and an agent inside the Java JVM||HTTP/Java/.NET/SQL|
AppDynamics is a Java/.NET APM solution based upon an agent that does byte code instrumentation for Java and .Net based applications. AppDynamics is different from the first generation of Java APM solutions in that it installs and works out of the box, it is designed and priced for applications scaled out across a large number of commodity servers, and it includes cloud orchestration features designed to automate the process of adding instances of the application in a public cloud based upon sophisticated and intelligent rules. AppDynamics is offered on both a SaaS and on-premise basis.
Compuware dynatrace is a Java and .NET APM solution that is differentiated in its ability to trace individual transactions through complex systems of servers and applications. This is a different level of tracing than just understanding which process on a server is talking which process on another server – it truely means that individual transactions can be traced from when they hit the first Java or .NET server until they leave the last one in the system (usually to hit the database server). This tracing combined with in depth code code level diagnostics via byte code instrumentation is what distinguishes dynatrace. Dynatrace is also the only vendor that can trace individual transactions from their inception in the user’s browser through the entire application system.
New Relic pioneered the Monitoring as a Service category by being the first APM vendor to offer robust APM functionality on a SaaS (or more accurately MaaS) basis. The product is truly plug and play, all you do is sign up, install the agent in your Ruby, Java, .NET or PHP application and then log onto a web console that points back to New Relic’s hosted back end of the monitoring system.
VMware vFabric APMis based upon a virtual appliance that collects data from a promiscuous port on the vSwitch (or Nexus 1000v) in the VMware host and a new agent that lives inside of the Java virtual machine that hosts your web/java/database application. vFabric APM is therefore a combination of some breadth in application support, as with the virtual appliance approach it can see all TCP/IP traffic on the virtual networks, and with the Java agent it can see deeply into the performance of the actual applications. VMware will also be buidling automatic remediation into vFabric APM so that when issues occur they can be automatically addressed. The issues with vFabric APM is that is only works for applications written to the vSphere platform, which means of course that it does not support applications running on physical hardware either.
The AppOps Category of APM Tools
The AppOps category of APM tools, listed in the table below, focus upon helping the team that supports every application in production. Every means both custom developed (and if so no matter in what, how or when), and purchased applications. Since the scope of supported applications includes purchased applications (think SAP, Oracle Financial, Microsoft Exchange and SharePoint), the objective of the tool is not line of code analysis as this would be useless for a purchased application. Rather the objective of the tool is to tell you if the problem (slow response time, slow throughput or errors) is in the application or in the infrastructure, and if it is in the infrastructure provide the best guidance possible as to where.. Notice that the target audience of these tools is a someone is who may be in IT Operations, or who many be on an application support team, but not someone whose job it is to make a code change. If you are currently using a first generation APP product from a legacy vendor that is expensive, difficult to use, difficult to maintain, that probably just measures resource utilization and does not measure response time and throughput, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
Third Generation AppOps Tools
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppEnsure||Monitor every application in production irrespective of source and deployment||SaaS||Agent inside of the Windows or Linux Operating System||All TCP/IP on Windows or Linux|
|AppFirst||Monitor every application in production irrespective of source and deployment||SaaS||Agent inside of the Windows or Linux Operating System||All TCP/IP on Windows or Linux|
|BlueStripe FactFinder||Monitor every application in production irrespective of source and deployment||On Premise||Agent inside the Windows, Linux, AIX or Sun Operating System||All TCP/IP on Windows, Linus, AIX, or Sun OS|
|Boundary||Monitor the impacts of network flows upon the application||SaaS||Agent inside of the Windows or Linux operating system||All Linux TCP/IP applications|
|Confio Software IgniteVM||Monitor database performance especially in conjunction with the performance of the underlying storage||On Premise||Agentless collection of detailed database data and storage latency data from vSphere||DB2, Oracle, and SQL Server Database applications running on vSphere|
|Correlsense||Monitoring transactions of complex enterprise architecture that are based on large variety of platforms, languages and middle-tiers||On Premise||Agent inside the Windows, Linux, AIX or Sun Operating System||All TCP/IP on Windows, Linus, AIX, or Sun OS|
|ExtraHop Networks||Monitor every application in production irrespective of source and deployment||On Premise||From a mirror port on a physical switch or the vSphere vSwitch||All TCP/IP regardless of platform|
|Splunk||Collection of logs and many other metrics into an easily searchable “big data” database||On Premise/ SaaS||A wide variety of collectors that interface to log sources and other sources of data||An application for which a log of some type is generated|
AppEnsure is a SaaS delivered APM solution focused on providing application discovery, application topology discovery, end-to-end response time monitoring and automated root cause analysis across all applications (custom developed or purchased) deployed across any mixture of physical, virtual, or cloud based environments.
AppFirst is a SaaS delivered APM solution that is most frequently used by SaaS software vendors or delivered through cloud vendors to customers. It is a perfect complement to New Relic as it monitors all of the layers of the application system which an agent in the application run-time is cannot see.
BlueStripe FactFinder is based upon an agent that lives in the Windows or Linux OS that supports the application. This agent watches the network flow between that OS and everything that it is talking to. Through this process FactFinder discovers the topology map of the applications running on each set of monitored servers, and calculates an end-to-end and hop-by-hop response time metric for each application. Since FactFinder is in the OS and not in the application, FactFinder is able to calculate these response time metrics for any application that runs on a physical or virtual instance of Windows or Linux. This makes FactFinder into the only product that provides this level of out of the box functionality for such a breadth of applications.
Boundary is a SaaS delivered APM solution that focuses on using deep and near-real analysis of the network to understand how infrastructure issues are impacting application performance. Boundary is also a perfect complement to New Relic as it collects the precise set of data that leads to infrastructure impacts upon applications.
Confio Software IgniteVM is a tool that combines deep visibility into the performance (response time) if database queries, combined with a time correlated view of the storage latency of the arrays that support that database in a VMware vSphere environment. Since a very high percentage of the problems in application performance are tied to database performance and the underlying storage performance issues, IgniteVM is extremely useful for customers running performance critical database applications on a vSphere platform.
Correlsense also makes use of agents that live in the operating system, and which use interactions between the application and the OS to map application topologies and time transaction performance. This is another solution that provides excellent end-to-end and hop-by-hop application response time and transaction response time across a very wide range of applications and runtime environments.
ExtraHop Networks uses a mirror port on either the physical network or a mirror port on the VMware vSwitch to see all of the network traffic that flows between physical and virtual servers. This source of data means that ExtraHop can see application topologies and measure end-to-end response time for every TCP/IP based application on your network without requiring the installation of any agents in applications, JVM, virtual servers, or physical servers.
Splunk is the leader in collecting a large variety of log and other monitoring data and putting that into an infinitely scalable and easily searchable big data back end. While most people think of Splunk as a log analysis solution, a substantial portion of Splunk’s customers tie the logs from components of the application system together into a dashboard, which is then used to provide an end-to-end view of the behavior of the application system.
How to Choose a Third Generation APM Tool
The most important thing to do when choosing an APM tool is to focus upon what problem you are trying to solve, and whose problem you are trying to solve. This leads to the following process:
- Is this a custom developed application and the job of the tool is to support the development team of the application by finding bugs in production? If so, focus upon a DevOps tool. Is the problem supporting the application in production and is the job of the tool to support an application support person who does not own the code? If so, focus upon and AppOps tool.
- Seriously think about the tradeoffs of an on-premise solution vs an APM as a Service solution. Those tradeoffs were outlined in this post about CA’s Reaction to APM as a Service. Take a look at what other SaaS services your company is using. If you are already using a SaaS delivered CRM solution like SalesForce.com, then trusting an APM vendor with the data about your application is not a steep hill to climb.
- Do not consider any APM solution that does not measure response time and throughput for the target set of applications. Inferring application performance from resource utilization is a dead idea that needs to be put to rest.
- Realize that per the Information Week article referenced at the start of this post, that the legacy APM vendors have failed to keep their products up to date with modern use cases and requirements. If you are using one of these products, you may find that licensing a third generation APM solution is less expensive than paying the maintenance on a first or second generation legacy solution.
- Despite some of the marketing and sales messaging that goes back and forth between the vendors, the DevOps category of tools does not compete with the AppOps category of tools. In fact you might be well served buying one of each. AppEnsure, AppFirst and Boundary are all perfectly sensible complements to SaaS delivered New Relic or AppDynamics. Extrahop, BlueStripe, and Correlsense are perfectly sensible complements on an on-premise installation of AppDynamics or dynaTrace.
- Pursue a strategy of instrumenting every important application in production with an approprate APM solution. If you are virtualizing business critical applications (for many organizations that is all that is left to virtualize), then baselineing the performance of the application with an APM solution on physical hardware and then using that baseline as the SLA for its virtualized instance is really the only sensible way to overcome objections from application owners to the virtualization process. Therefore choosing the right set of APM solutions is a critical part of your virtualization initiatives for 2012 and 2013.
Legacy first and second generation APM tools have failed to keep up with customer requirements. These tools need to be replaced with modern third generation APM tools that have been modernized for today’s development and deployment paradigms. These tools are also a critical part of the process of virtualizing business critical applications.