Applications that are changing rapidly due to agile development and DevOps, and that are running on dynamic and distributed infrastructures such as virtualized data centers, software-defined data centers, and private, hybrid, and public clouds, present new challenges in managing application performance. It is imperative to measure application performance with modern application performance management (APM) or application-aware infrastructure performance management tools.
The Need for SDDC and Cloud Application Performance Management
Consider the following scenario. You have one hundred business-critical applications in your company. Eighty percent of these are applications purchased from vendors. You have to support these applications in production and are held responsible for their operation. But you obviously do not own the code, nor do you have access to it. If there is a bug in the application, you report the symptoms of that bug to the application’s vendor, which is responsible for fixing it. However, most of the problems with these applications are not bugs in code, but rather issues in the infrastructure that impact the applications’ operation and performance.
The other 20% of your applications are custom developed. They are developed and maintained in production by your development teams. When a bug impacts the operation of these applications in production, your development team has to find and fix that bug. You use agile development to build these applications, and you use DevOps to support them. You are under tremendous pressure from the business to bring new application functionality to market, yet you are finding that rapid rates of change in applications and the fact that your developers have no idea what consequences their coding practices have on performance conspire against rapid innovation and stability.
The above two scenarios were in place prior to the arrival of virtualization, and even with virtualization they are still in place now, prior to the arrival of the software-defined data center (SDDC) and the cloud. The SDDC and the cloud will make APM into even more of an imperative, for the following reasons:
- Most tools that claim to monitor “application performance” in fact do nothing of the kind. Most only monitor the resources used by the application. When end users of an application complain about its performance, they are complaining about the application’s response time. Therefore, in order to be able to have a meaningful conversation with application owners and business constituents about application performance, IT operations is going to need tools that measure the response time of every application in production. The bottom line is that the definition of performance is not resource utilization, but rather response time.
- The injection of the new layers of software that comprise the SDDC into the stack of software that supports the applications is going to be viewed with great suspicion by application owners and business constituents. Just as Citrix was always blamed for the performance of applications delivered over Citrix, and just like VMware was (and is) always blamed for the performance of newly virtualized applications, the SDDC (and its owners, IT operations) is going to be guilty until proven innocent when it comes to the response time of applications running on it. This will, again, require that IT operations have tools that monitor the response time, throughput, and error rate of every application in production.
- Running response-time-critical applications in public clouds brings up a whole new set of issues. The most important is that what is actually going on in the data center of a cloud provider is completely hidden from you, the customer of the cloud provider. If you want to read about a nightmare that can come from not knowing what is happening in the software infrastructure layers that support your application, read this story about the travails of Rap Genius.
- While first-generation APM tools targeted only custom-developed applications, the SDDC will be running every business-critical application that you own. Therefore, you will need tools appropriate for custom-developed applications and tools appropriate for purchased applications. APM tools are needed to find problems in the code you own and are responsible for. Application-Aware Infrastructure Performance Management (AA-IPM) tools are needed to determine how your infrastructure is impacting your applications and how your applications are impacting your infrastructure.
Attributes of Modern SDDC and Cloud-Aware APM and AA-IPM Solutions
- Response Time and Throughput: The single most important metric when measuring applications’ performance, especially for applications running in virtual or cloud environments, is application response time. Why? Because this metric accurately reflects the service level that the application is delivering to its users; thus, it is a metric that the applications’ owners will easily buy into as one that represents their concerns. It is therefore essential that you choose performance management solutions that can measure response time for your applications in the most realistic (as close as possible to what the user actually sees) and comprehensive (across all tiers of an application’s system) manner possible. All of the solutions profiled in this article focus on response time, so there is no column for this criteria in the table below, as it is met by all of the vendors.
For certain applications, throughput is also an extremely important metric. This is so for applications that do a large number of transactions, where throughput within a response time threshold is how servers should be sized. It is also true for applications that run very large transactions (batch jobs); for these applications, the completion of a batch job within a certain period of time is a critical issue.
- Deployment Method: This is where you have to make some difficult tradeoffs. The first tradeoff is an on-premises solution vs. a MaaS-delivered solution (Monitoring as a Service). The advantage of MaaS is that you do not have to maintain the back end; as a vendor adds features to a product, it just upgrades the back end, and you get the new features. The advantage of an on-premises solution is that data about the performance of your business-critical applications is not sent over the Internet to someone else’s data center.
- Data Collection Method: The data collection method and supported application types (directly below) are where you make your tradeoff in the breadth of the applications that you can manage with your APM solution vs. the depth of the analysis. You basically have three sources of data to choose from in a modern APM solution. The first choice is to collect data from the network via a physical or virtual appliance that sits on a physical or virtual mirror port. The virtue of this approach is that it works for every application that you have, regardless of how it was built or whether it was built or purchased. The next choice is a modern, transaction-oriented agent inside the operating system. These agents are very different from the legacy agents that just capture resource utilization statistics. These agents capture the detail about how the applications interact with the OS and how their processes communicate over the network that connects all of the servers that host the application. The last choice is to use an agent that lives in the application runtime environment. This provides for the deepest level of diagnostics and transaction tracing, but it only works for applications that are written to the specific runtimes supported by the APM vendor (you get depth, but you give up breadth).
- Breadth of Supported Applications: Gartner has always defined APM as including deep code diagnostics. Therefore, Gartner has defined APM as being only for custom applications that you have built yourself. Earlier this year, Gartner woke up and realized that 80% of the applications that customers run are purchased, and it created a new category of tools: Application-Aware Infrastructure Performance Management tools, which address all of the applications that a typical enterprise has to support (both custom-developed and purchased).
- Application Topology Discovery: As your application will now be “dynamic,” you will need a tool that can keep up with it and its topology, no matter how it is scaled out or where any component of the application is moved. This means that if it relies on an agent, then that agent must travel with (inside of) the application or the supporting operating system, so that as the application is replicated and moved, the agent comes up and finds its management system. It is also critical that these agents map the topology of the application system from the perspective of what is talking to what. Otherwise, it will be impossible to troubleshoot a system with so many moving parts.
- Private/Hybrid/Public Cloud Ready: If you are thinking about putting all or part of an application in a public cloud, then you need a performance management solution that works when there is no LAN connection or VPN between the agent in the application and the management system. Polling agents that live in clouds will not work, as you cannot assume the existence of an inbound port to poll through. Therefore, the agent needs to initiate the connection and open an outbound port back to the management system, which then needs to be able to catch the incoming traffic in your DMZ. You also need a system that is able to map the topology of your application system across the data centers that it executes in.
- Zero Configuration Required: If you are an agile development shop, then it is essential that you choose an APM solution that can keep up with your rate of enhancement and change in the application. Essentially, this means that you need a “zero-config” APM tool, as with a rapid rate of change in the application you will have no time to update the tool every time you do a release into production.
- Deep-Dive Code Diagnostics: For custom-developed applications, the ability to find the object or method in the code that is the cause of an application slowdown is a critical requirement. The ability to do this in production, with acceptable overhead and minimal configuration, is what distinguishes modern APM tools from the legacy offerings of IBM, BMC, HP, and CA. It is also important that the tool you pick support code diagnostics for the languages you use to develop your applications. Legacy tools tend to support only Java and .NET. Modern tools tend to support PHP, Python, Ruby, and Node-JS as well.
- Understanding Application and Infrastructure Impacts: This is what the new category of Application-Aware Infrastructure Performance Management (AA-IPM) tools are all about. If you do not own the code, what you care about is how the application is affecting your infrastructure and how the infrastructure is affecting your application. Understanding this problem in the context of application response time and throughput is such a new problem that only leading-edge and very innovative startup companies do a good job of addressing it. See our detailed post on AA-IPM solutions.
APM Tools for Custom-Developed Applications
APM tools, listed in the table below, focus on helping the portion of the development team that is supporting a custom-developed application in production, and they rapidly resolve issues with the custom application itself. After making sure that the tool in question supports your application and its execution environment, the single most important criteria for selecting a tool in this category is “time to bug.” Time to bug indicates how long it takes a tool, starting from the time a problem occurs, to notice the problem (how close to real time is the instrumentation), and then, once the alert is raised, how many “clicks to resolution” there are. Notice that the target audience for these tools are developers who can and probably will need to make a code change to resolve such problems. Therefore, these tools are targeted specifically at developers and at applications for which the enterprise owns and maintains the code. If you are currently using a second-generation APM product from a legacy vendor; if it is expensive and difficult to use and maintain; and if it does not fit into your scaled-out and distributed deployment strategy, then you should replace that tool with one of those below.
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppDynamics||Monitor custom-developed Java and .NET applications across internal and external (cloud) deployments||On-Premises/SaaS||Agent inside of the Java JVM or the .NET CLR||Java/.NET/|
|AppNeta (TraceView)||Monitor custom-developed Java and .NET applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM or the .NET CLR||Ruby/Java/ PHP/Python|
|Dynatrace||Monitoring of complex enterprise applications that are based on Java or .NET but may include complex enterprise middleware like IBM MQ and CICS||On-Premises||Agent inside of the Java JVM or the .NET CLR||Java/.NET, Websphere Message Broker CICS, C/C++|
|New Relic (RPM)||Monitor custom-developed Java, .NET, Ruby, Python, and PHP applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM, .NET CLR, or PHP/Python runtime||Ruby/Java/ .NET/PHP/Python|
AppDynamics is a Java/.NET APM solution based on an agent that does byte code instrumentation for Java and .NET based applications. AppDynamics is different from the first generation of Java APM solutions in that it installs and works out of the box, is designed and priced for applications scaled out across a large number of commodity servers, and includes cloud orchestration features designed to automate the process of adding instances of the application in a public cloud, based on sophisticated and intelligent rules. AppDynamics is offered on both a SaaS and an on-premises basis.
AppNeta TraceView is a SaaS-delivered APM service that supports applications written in PHP, Ruby, Python, and Java. Since it is a cloud-delivered service, it is particularly appropriate for new applications written in new languages (like PHP, Python, and Ruby) and deployed to new deployment environments, such as public clouds. TraceView includes the ability to trace transactions across multiple tiers of an application system, including tiers that reside in different data centers.
Dynatrace is a Java and .NET APM solution that is differentiated in its ability to trace individual transactions through complex systems of servers and applications. This is a different level of tracing than just understanding which process on a server is talking with which process on another server: it truly means that individual transactions can be traced from when they hit the first Java or .NET server in the system until they leave the last one (usually to hit the database server). This tracing, combined with in-depth, code-level diagnostics via byte code instrumentation, is what distinguishes dynaTrace. DynaTrace is also the only vendor that can trace individual transactions from inception in a user’s browser through the entire application system.
New Relic pioneered the Monitoring as a Service category by being the first APM vendor to offer robust APM functionality on a SaaS (or, more accurately, MaaS) basis. The product is truly plug-and-play; all you do is sign up, install the agent in your Ruby, Java, .NET, or PHP application, and then log onto a web console that points back to New Relic’s hosted back end of the monitoring system.
Application-Aware Infrastructure Performance Management Tools
Application-Aware Infrastructure Performance Management tools, listed in the table below, focus on helping the operations team that supports every application in production. “Every” means both custom-developed (regardless of what, how, or when) and purchased applications. Since the scope of supported applications includes purchased applications (think SAP, Oracle Financials, Microsoft Exchange, and SharePoint), the objective of the tool is not line-of-code analysis, as this would be useless for a purchased application. Rather, the objective of the tool is to tell IT operations if the problem (slow response time, slow throughput, or errors) is in the application or in the infrastructure, and if it is in the infrastructure, to provide the best guidance possible as to where. Notice that the target audience for these tools is someone is who may be in IT operations, or who may be on an application support team, but it is not someone whose job it is to make a code change. If you are currently using a first-generation product from a legacy vendor that is expensive, difficult to use, difficult to maintain; that probably just measures resource utilization and not response time or throughput; and that does not fit into your scaled-out and distributed deployment strategy, then you should replace that tool with one of the ones below.
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Identification||Application Topology Discovery||Cloud Ready||“Zero- Config”|
|AppEnsure||Manage response time and throughput of every Windows and Linux application, whether purchased or custom-developed, physical or virtual, or remote or local, and whether in hybrid or public clouds. Includes automated application discovery, topology mapping, and root cause analysis.||On-premises/ SaaS||Agent inside of the Windows or Linux operating system||All TCP/IP on Windows or Linux|
|AppFirst||Monitor every application in production irrespective of source and deployment||On-premises/ SaaS||Agent inside of the Windows or Linux operating system||All TCP/IP on Windows or Linux|
|BlueStripe FactFinder||Monitor every application in production irrespective of source and deployment||On-premises||Agent inside the Windows, Linux, AIX, or Sun operating system||All TCP/IP on Windows, Linus, AIX, or Sun OS|
|Boundary||Monitor the impacts of network flows on the application||SaaS||Agent inside of the Windows or Linux operating system||All Linux TCP/IP applications|
|Correlsense||Monitor transactions of complex enterprise architecture that are based on a large variety of platforms, languages, and middle tiers||On-premises||Agent inside the Windows, Linux, AIX, or Sun operating system||All TCP/IP on Windows, Linus, AIX, or Sun OS|
|ExtraHop Networks||Monitor every application in production irrespective of source and deployment||On-premises||From a mirror port on a physical switch or the vSphere vSwitch||All TCP/IP regardless of platform|
|Virtual Instruments||Collects sub-second storage latency data and throughput data for Fibre Channel–attached storage||On-premises||A TAP on the SAN, allowing every Fibre Channel transaction to be observed from the outside in||All applications that are dependent on Fibre Channel–attached storage|
AppEnsure is an on-premises and SaaS-delivered APM solution that focuses on identification of each application by name, application topology discovery, end-to-end response time monitoring, and automated root cause analysis across all applications (custom developed or purchased) deployed across any mixture of physical, virtual, and cloud-based environments. AppEnsure is based on an agent in the Windows and Linux OSes that sees the interaction of processes with the operating system, and then sees the interaction, over the network, of those processes with the adjacent components of the application system.
AppFirst is an on-premises and SaaS-delivered APM solution that is most frequently used by SaaS software vendors or delivered through cloud vendors to customers. AppFirst focuses on the collection of a comprehensive set of metrics including OS metrics, performance metrics, log files, and StatsD metrics. It is a perfect complement to New Relic, as it monitors all of the layers of the application system that an agent in the application runtime cannot see.
BlueStripe FactFinder is based on an agent that lives in the Windows, Linux, AIX, and Solaris OS and supports all applications running on those operating systems. This agent watches the network flow between the OS and everything that it is talking to. Through this process, FactFinder discovers the topology of the applications running on each set of monitored servers and calculates an end-to-end and hop-by-hop response-time metric for each application. Since FactFinder is in the OS and not in the application, FactFinder is able to calculate these response-time metrics for any application that runs on a physical or virtual instance of Windows or Linux. This makes FactFinder the only product that provides this level of out-of-box functionality for such a breadth of applications.
Boundary is a SaaS-delivered solution for both public and private cloud deployments that focuses on using deep and real-time analysis of the infrastructure and network to understand how issues are affecting application performance. Boundary provides DevOps and IT operations teams with a tool for matching operations visibility with modern application change frequency. This is complementary to code-focused APM solutions deployed to the cloud and on premises, since these deployments tend to result in the kinds of complicated interactions between the components of the application system that is Boundary’s focus.
Correlsense uses an agent that lives in the operating system. It observes the interactions between the application and the OS and maps application topologies and time transaction performance. This is another solution that provides excellent end-to-end and hop-by-hop application response time and transaction response time across a very wide range of applications and runtime environments. ExtraHop Networks uses a mirror port on either the physical network or the VMware vSwitch to see all of the network traffic that flows between physical and virtual servers. This source of data means that ExtraHop can see application topologies and measure end-to-end response time for every TCP/IP-based application on the network, without requiring installation of any agents in applications, JVMs, virtual servers, or physical servers.
ExtraHop collects layer 2 through 7 TCP/IP data from physical taps on physical switches, from virtual taps on the virtual mirror port of virtual switches, or via agents inserted into the network stack of the Windows or Linux operating system. ExtraHop includes layer 7 decodes of popular protocols like HTTP, which allows it to identify applications that use unique ports and protocols and to measure their end-to-end response time.
Virtual Instruments Virtual Wisdom focuses on collecting real-time (subsecond) storage latency and transaction completion information for Fibre Channel–attached storage arrays. This data is collected via a tap that is inserted into the Fibre Channel SAN and sees every transaction that is flowing to and from Fibre Channel–attached storage arrays. VirtualWisdom provides real-time, comprehensive, and deterministic latency information on every storage transaction for a Fibre Channel–attached storage array.
How to Choose an SDDC Application Performance Management Solution
The most important thing to do when choosing an APM tool is to focus on what problem you are trying to solve and whose problem you are trying to solve. This leads to the following process:
- Is this a custom-developed application, and is the job of the tool to support its development team by finding bugs in production? If so, focus on a DevOps tool. Is the problem supporting the application in production, and is the job of the tool to support an application support person who does not own the code? If so, focus on an AppOps tool.
- Seriously think about the tradeoffs of an on-premises solution vs. an APM as a Service solution. Those tradeoffs were outlined in this post about CA’s Reaction to APM as a Service. Take a look at what other SaaS services your company is using. If you are already using a SaaS-delivered CRM solution like SalesForce.com, then trusting an APM vendor with the data about your application is not a steep hill to climb.
- Do not consider any APM solution that does not measure response time and throughput for the target set of applications. Inferring application performance from resource utilization is a dead idea that needs to be put to rest.
- Despite some of the marketing and sales messaging that goes back and forth between vendors, the DevOps category of tools does not compete with the AppOps category of tools. In fact, you might be well served buying one of each. AppEnsure, AppFirst, and Boundary are all perfectly sensible complements to SaaS-delivered AppNeta, New Relic, or AppDynamics. AppEnsure, ExtraHop, INETCO, BlueStripe, and Correlsense are perfectly sensible complements on an on-premises installation of AppDynamics or dynaTrace.
- Pursue a strategy of instrumenting every important application in production with an appropriate APM solution. If you are virtualizing business-critical applications (for many organizations, that is all that is left to virtualize), then baselining the performance of the application with an APM solution on physical hardware and then using that baseline as the SLA for its virtualized instance is really the only sensible way to overcome objections from application owners to the virtualization process. Therefore, choosing the right set of APM solutions is a critical part of your virtualization initiatives for 2014 and an absolutely essential part of your strategy for migrating to an SDDC.
Application Performance Management will be a critical part of ensuring that the applications that matter to your business are highly available and perform well in your software-defined data center and clouds. Running rapidly changing applications on a highly dynamic software infrastructure will lead to intractable problems unless proper APM tools are deployed in your SDDC.