The most frequently encountered barrier to virtualizing business critical applications is a “concern” on the part of the application owners that the applications will “not run as well”, or “not perform as well” in a shared and dynamic virtual environment as it does in a dedicated physical environment. Depending upon who has what political power, these concerns can stop the project to virtualize these applications dead in its tracks.
The Current State of Monitoring Business Critical Applications for Performance
The first step in addressing the performance concerns of someone who owns or supports a business critical application is to understand how (if at all) the application is being monitored today. Less than 5% of the business critical and performance critical applications in large enterprises are being monitored by something that actually measures the response time and throughput of the application. Therefore chances are that even though that application might be very important, what is in place to monitor it will be one of the options below:
- Basic Resource Utilization Monitoring – a lot of monitoring software has been sold through years with an “application” label on it that does not monitor the application at all. Much of this software simply monitors either the hardware that supports the application, or how the hardware resources are being used. The most common products simply collect resource utilization statistics from the operating system that supports the application. This approach is useful in a physical environment where a server is dedicated to an application, but it is not terribly useful in a virtualized environment. This is because in a virtualized environment resources are so shared and abstracted that you really cannot infer how an application is actually performing by looking at resource utilization.
- Middleware Monitoring – Again a lot of monitoring software has been sold with an application performance label on it that monitors the web servers, java servers, .NET servers, and database servers that underlie an application. Much of this software uses agentless techniques like JMX and WMI to collect data from the environment. While these solutions can often tell you if the monitored middleware layer is having a problem, they cannot tell you if the application itself is providing an acceptable level of service to its users.
- Synthetic Transaction Monitoring – For many applications this is the only type of true application performance monitoring being used. This might in fact be an entirely acceptable solution for an application that changes infrequently, and is deployed in a static environment. But if you throw Agile development and dynamic platforms (like virtualization and clouds) into the mix, then any approach that requires that a script be updated whenever the application or its environment changes simply falls down.
The Application Performance Management Disconnect
The bottom line is that they way in which applications are being monitored in the physical world does not lead to an ability on the part of either the application owner or the owner of the virtual infrastructure to be able to measure and ensure what everyone really cares about – the end user experience of the application. End user experience is tightly tied to the concepts of response time and throughput, and the three legacy approaches listed above simply cannot measure response time and throughput for hundreds of different kinds of applications as they start migrating into the virtualized data center.
Therefore the fact is that the application owner is going to demand of the virtualization team that the virtualization team provide a higher level of performance assurance for the application then the application owner was willing and able to provide himself. Modern virtualization aware and cloud aware APM solutions that can easily address large numbers of different kinds of applications with little to no manual configuration are therefore essential to the process of virtualizing business critical applications.
Attributes of Modern Virtualization and Cloud Aware APM Solutions
- Response Time and Throughput: The single most important metric when measuring applications performance, and especially applications performance for applications running in virtual or cloud environments is applications response time. The reason for this is that this metric accurately reflects the service level that the application is delivering to its users, and it is therefore a metric that the applications owners will easily buy into as one that represents their concerns. It is therefore essential that you choose an APM solution that can measure response time for your applications in the most realistic (as close as possible to what the user actually sees) and comprehensive (across all tiers of the applications system) as possible. All of the solutions profiled in this article focus upon response time, so there is not column for this criteria in the table below, as it is met by all of the vendors. For certain applications, throughput is also an extremely important metric. This is so for applications that do a large number of transactions, where throughput within a response time threshold is how servers should be sized. It is also true for applications that run very large transactions (batch jobs) where the completion of the batch job within a certain period of time is a critical issue.
- Deployment Method. This is where you have to make some difficult tradeoffs. The first tradeoff is an on premise solution vs a SaaS delivered solution (Monitoring as a Service). The advantage of MaaS is that you do not have to maintain the back end, and as the vendor adds features to the product, they just upgrade the back end and you get the new features. The advantage of an on-premise solution is that data about the performance of your business critical applications is not sent over the Internet to someone else’s data center.
- Data Collection Method. This and Supported Application Types (directly below) is where you make your tradeoff in the breadth of the applications that you can manage with your APM solution vs the depth of the analysis. You basically have three sources of data to choose from in a modern APM solution. The first choice is to collect the data from the network via a physical or virtual appliance that sits on a physical or virtual mirror port. The virtue of this approach is that it works for every application that you have – irrespective of how it was built, or whether it was built or purchased. The next choice is a modern transaction oriented agent inside the operating system. These are very different agents than the legacy agents that just capture resource utilization statistics. These agents capture the detail of how the applications interact with the OS, and how the processes that comprise the application communicate over the network that connects all of the servers that host the application. The last choice is to use an agent that lives in the application run time environment. This provides for the deepest level of diagnostics and transaction tracing, but only works for applications that are written to the specific run times supported by the APM vendor (you get depth, but you give up breadth).
- Breadth of Supported Application Types: The APM solution has to work with and support your applications architectures. If you just have web based applications with Java middle tiers and database back ends there are many good tools to choose from. The more you diverge from HTTP/.NET/Java/SQL as the applications architecture the fewer tools there are to choose from. If your application has a proprietary front end (a Win32 client), a proprietary middle tier (maybe something written in COM+, or C++ ten years ago) and a database that no one supports then you need to look for at a tool that operates at the TCP/IP layer since instrumenting the application itself will likely be impossible. However, in so doing you will give up the insights into the business logic that Java and .Net aware tools provide. The tradeoff between depth of code analysis and breadth of applications supported is what differentiates the “DevOps” category of APM tools from the “AppOps” category of APM tools.
- Application Topology Discovery: As your application will now be “dynamic” you will need a tool that can keep up with your application and its topology no matter how it is scaled out, and no matter where a component of the application is moved. This means that if the APM tool relies upon an agent, then that agent must travel with (inside) of the application so that as the application is replicated and moved, the agent comes up and finds its management system. It is also critical that these agents map the topology of the application system from the perspective of what is talking to what. Otherwise it will be impossible to troubleshoot a system with so many moving parts.
- Private/Hybrid/Public Cloud Ready: If you are thinking about putting all or a part of an application in a public cloud, then you need an APM solution that works when there is no LAN connection or VPN between the agent in the application and the management system. Polling agents that live in clouds will not work, as you cannot assume the existence of an inbound port to poll through. Therefore the agent needs to initiate the connection, open an outbound port back to the management system, and which then needs to be able to catch the incoming traffic in your DMZ. You also need a system that is able to map the topology of your application system across the data centers that it executes in.
- Zero Configuration Required: If you are an Agile Development shop, then it is essential that you choose an APM solution that can keep up with your rate of enhancement and change in the application. Essentially this means that you need a “zero-config” APM tool, as with a rapid rate of change in the application you will have no time to update the tool every time you do a release into production.
- Deep-Dive Code Diagnostics: For some really performance critical applications, being able to trace transactions through the layers of an application system can be invaluable when it comes to understanding end-to-end performance, but this capability is traded off against breadth of platform support. The bottom line is that you cannot have both the deepest possible visibility, and the broadest possible support for applications architectures in one product.
- Support for New Application Run-Times and Languages. You need to carefully marry your application development strategy with your APM strategy. If it is appropriate for you to specify a standard language and run-time for your organization, then pick a tool that supports that language or run time. If on the other hand the requirements are so diverse and the needs for responsiveness are so demanding that you are ending up with not just Java and .NET,but also some combination of PHP, Python, Ruby, and Node-JS then look either for a solution that supports your set of languages or one that is agnostic as to how the application is built. Recognize that if you go the agnostic route you will lose line of code visibility as a result.
The “DevOps” Category of APM Tools
The DeveOps category of APM tools, listed in the table below, focus upon helping the portion of the development team that is supporting a custom developed application in production, rapidly resolve issues with the application itself. After making sure that the tools supports your application and its execution environment, the single most important criteria for selecting a tool in this category is “time to bug”. Time to bug means from the time that the problem occurred, how long did it take the tool to notice the problem (how close to real time is the instrumentation), and then once the alert is raised, how many “clicks to resolution” are there. Notice that the target audience of these tools is a developer who can and probably will need to make a code change to resolve the problem. Therefore these tools are targeted specifically at developers and at applications where the enterprise owns the code and has the responsibility for maintaining it. If you are currently using a second generation APM product from a legacy vendor that is expensive, difficult to use, difficult to maintain, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppDynamics||Monitor custom developed Java and .NET applications across internal and external (cloud) deployments||On Premise/SaaS||Agent inside of the Java JVM or the .NET CLR||Java/.NET|
|dynaTrace (Compuware)||Monitoring of complex enteprise applicatons that are based on Java or .NET but which may include complex enterprise middleware like IBM MQ and CICS||On Premise||Agent inside of the Java JVM or the .NET CLR||Java/.NET, Websphere Message Broker CICS, C/C++|
|New Relic RPM||Monitor custom developed Java, .NET, Ruby, Python, and PHP applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM, NET CLR, or the PHP/Python runtime||Ruby/Java/ .NET/PHP/Python|
|Quest Foglight||Monitor custom developed Java and .NET applications and trace transactions across all physical and virtual tiers of the application||On-Premise||Agent inside of the Java JVM or the .NET CLR||Java/.NET|
|VMware vFabric APM||Monitor custom developed Java applications in production. Strong integration with the rest of the VMware product line including automated remediation and scaling.||On Premise||Mirror port on the vSphere vSwitch and an agent inside the Java JVM||HTTP/Java/.NET/SQL|
AppDynamics is a Java/.NET APM solution based upon an agent that does byte code instrumentation for Java and .Net based applications. AppDynamics is different from the first generation of Java APM solutions in that it installs and works out of the box, it is designed and priced for applications scaled out across a large number of commodity servers, and it includes cloud orchestration features designed to automate the process of adding instances of the application in a public cloud based upon sophisticated and intelligent rules. AppDynamics is offered on both a SaaS and on-premise basis.
Compuware dynatrace is a Java and .NET APM solution that is differentiated in its ability to trace individual transactions through complex systems of servers and applications. This is a different level of tracing than just understanding which process on a server is talking which process on another server – it truely means that individual transactions can be traced from when they hit the first Java or .NET server until they leave the last one in the system (usually to hit the database server). This tracing combined with in depth code code level diagnostics via byte code instrumentation is what distinguishes dynatrace. Dynatrace is also the only vendor that can trace individual transactions from their inception in the user’s browser through the entire application system.
New Relic pioneered the Monitoring as a Service category by being the first APM vendor to offer robust APM functionality on a SaaS (or more accurately MaaS) basis. The product is truly plug and play, all you do is sign up, install the agent in your Ruby, Java, .NET or PHP application and then log onto a web console that points back to New Relic’s hosted back end of the monitoring system.
VMware vFabric APMis based upon a virtual appliance that collects data from a promiscuous port on the vSwitch (or Nexus 1000v) in the VMware host and a new agent that lives inside of the Java virtual machine that hosts your web/java/database application. vFabric APM is therefore a combination of some breadth in application support, as with the virtual appliance approach it can see all TCP/IP traffic on the virtual networks, and with the Java agent it can see deeply into the performance of the actual applications. VMware will also be buidling automatic remediation into vFabric APM so that when issues occur they can be automatically addressed. The issues with vFabric APM is that is only works for applications written to the vSphere platform, which means of course that it does not support applications running on physical hardware either.
The AppOps Category of APM Tools
The AppOps category of APM tools, listed in the table below, focus upon helping the team that supports every application in production. Every means both custom developed (and if so no matter in what, how or when), and purchased applications. Since the scope of supported applications includes purchased applications (think SAP, Oracle Financial, Microsoft Exchange and SharePoint), the objective of the tool is not line of code analysis as this would be useless for a purchased application. Rather the objective of the tool is to tell you if the problem (slow response time, slow throughput or errors) is in the application or in the infrastructure, and if it is in the infrastructure provide the best guidance possible as to where.. Notice that the target audience of these tools is a someone is who may be in IT Operations, or who many be on an application support team, but not someone whose job it is to make a code change. If you are currently using a first generation APP product from a legacy vendor that is expensive, difficult to use, difficult to maintain, that probably just measures resource utilization and does not measure response time and throughput, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
AppEnsure is a SaaS delivered APM solution focused on providing application discovery, application topology discovery, end-to-end response time monitoring and automated root cause analysis across all applications (custom developed or purchased) deployed across any mixture of physical, virtual, or cloud based environments.
AppFirst is a SaaS delivered APM solution that is most frequently used by SaaS software vendors or delivered through cloud vendors to customers. It is a perfect complement to New Relic as it monitors all of the layers of the application system which an agent in the application run-time is cannot see.
BlueStripe FactFinder is based upon an agent that lives in the Windows or Linux OS that supports the application. This agent watches the network flow between that OS and everything that it is talking to. Through this process FactFinder discovers the topology map of the applications running on each set of monitored servers, and calculates an end-to-end and hop-by-hop response time metric for each application. Since FactFinder is in the OS and not in the application, FactFinder is able to calculate these response time metrics for any application that runs on a physical or virtual instance of Windows or Linux. This makes FactFinder into the only product that provides this level of out of the box functionality for such a breadth of applications.
Boundary is a SaaS delivered APM solution that focuses on using deep and near-real analysis of the network to understand how infrastructure issues are impacting application performance. Boundary is also a perfect complement to New Relic as it collects the precise set of data that leads to infrastructure impacts upon applications.
Correlsense also makes use of agents that live in the operating system, and which use interactions between the application and the OS to map application topologies and time transaction performance. This is another solution that provides excellent end-to-end and hop-by-hop application response time and transaction response time across a very wide range of applications and run-time environments.
ExtraHop Networks uses a mirror port on either the physical network or a mirror port on the VMware vSwitch to see all of the network traffic that flows between physical and virtual servers. This source of data means that ExtraHop can see application topologies and measure end-to-end response time for every TCP/IP based application on your network without requiring the installation of any agents in applications, JVM, virtual servers, or physical servers.
Splunk is the leader in collecting a large variety of log and other monitoring data and putting that into an infinitely scalable and easily searchable big data back end. While most people think of Splunk as a log analysis solution, a substantial portion of Splunk’s customers tie the logs from components of the application system together into a dashboard, which is then used to provide an end-to-end view of the behavior of the application system.
How to Choose an APM Tool for Virtualizing Business Critical Applications
The most important thing to do when choosing an APM tool is to focus upon what problem you are trying to solve, and whose problem you are trying to solve. This leads to the following process:
- Is this a custom developed application and the job of the tool is to support the development team of the application by finding bugs in production? If so, focus upon a DevOps tool. Is the problem supporting the application in production and is the job of the tool to support an application support person who does not own the code? If so, focus upon and AppOps tool.
- Seriously think about the tradeoffs of an on-premise solution vs an APM as a Service solution. Those tradeoffs were outlined in this post about CA’s Reaction to APM as a Service. Take a look at what other SaaS services your company is using. If you are already using a SaaS delivered CRM solution like SalesForce.com, then trusting an APM vendor with the data about your application is not a steep hill to climb.
- Do not consider any APM solution that does not measure response time and throughput for the target set of applications. Inferring application performance from resource utilization is a dead idea that needs to be put to rest.
- Despite some of the marketing and sales messaging that goes back and forth between the vendors, the DevOps category of tools does not compete with the AppOps category of tools. In fact you might be well served buying one of each. AppEnsure, AppFirst and Boundary are all perfectly sensible complements to SaaS delivered New Relic or AppDynamics. Extrahop, BlueStripe, and Correlsense are perfectly sensible complements on an on-premise installation of AppDynamics or dynaTrace.
- Pursue a strategy of instrumenting every important application in production with an appropriate APM solution. If you are virtualizing business critical applications (for many organizations that is all that is left to virtualize), then base-lining the performance of the application with an APM solution on physical hardware and then using that baseline as the SLA for its virtualized instance is really the only sensible way to overcome objections from application owners to the virtualization process. Therefore choosing the right set of APM solutions is a critical part of your virtualization initiatives for 2012 and 2013.
Modern virtualization aware and cloud aware APM tools are a critical part of successfully ensuring the performance and availability of business critical applications and performance critical applications in shared and dynamic data centers created by either virtualization or the use of a cloud.