The Software Defined Data Center (SDDC) will be a highly dynamic environment with constantly changing configuration and resource allocation settings driven by various forms of automation including the provisioning of workloads from service catalogs, the scaling of workloads in response to demand, and the migration of workloads across hosts for workload balancing and prioritization reasons. Due to Agile Development, applications are now changing more quickly than ever before. So we are going to have rapidly changing applications running on a rapidly changing software infrastructure. This will drive the need for SDDC Application Performance Management.
The SDDC Management Stack Reference Architecture
The SDDC is going to require an entirely new management stack. This is because managing an SDDC will be a fundamentally different exercise than was the exercise of running applications that changed once a year on dedicated physical hardware that was updated on a three year refresh cycle. The complete rationale for why you will need an entirely new management stack can be found in “Building a Management Stack for Your Software Defined Data Center“.
The need for SDDC Application Performance Management
Consider the following scenario. You have 100 business critical applications in your company. 80% of these are applications purchased from vendors. You have to support these applications in production and are held responsible for their operation. But you obviously do not own the code, nor do you have access to the code. If there is a bug in the application you report the symptoms of that bug to the vendor of the application who is responsible for fixing it. However most of the problems with these applications are not bugs in code, but rather issues in the infrastructure that impact the operation and performance of these applications.
The other 20% of your applications are custom developed. They are developed by your development teams and maintained in production by your development teams. When there is bug that impacts the operation of these applications in production your development team has to find and fix that bug.
The above two scenarios were in place prior to the arrival of virtualization, and even with virtualization are in place now prior to the arrival of the SDDC. The SDDC will make APM into even more of an imperative for the following reasons:
- Most tools which claim to monitor “application performance” in fact do nothing of the kind. Most tools that claim to monitor application performance only monitor the resources used by the application. When an end user of an application complains about the performance of the application they are complaining about the response time of the application. Therefore in order to be able to have a meaningful conversation with applications owners and business constituents about application performance, IT Operations is going to need tools that measure the response time of every application in production.
- The injection of the new layers of software that comprise the SDDC into the stack of software that supports the applications is going to be viewed with great suspicion by application owners and business constituents. Just like Citrix was always blamed for the performance of applications delivered over Citrix, and just like VMware was (and is) always blamed for the performance of newly virtualized applications, the SDDC (and its owners, IT Operations) are going to be guilty until proven innocent when it comes to the response time of applications running on the SDDC. This will again require that IT Operations have tools that monitor the response time, throughput, and error rate of every application in production.
- While first generation APM tools targeted only custom developed applications, the SDDC will be running every business critical application that you own. Therefore you will need tools appropriate for custom developed applications (DevOps) and tools appropriate for purchased applications (AppOpps). DevOps tools are needed to find problems in the code you own and are responsible for. AppOpps tools are needed to find issues in the software and hardware infrastructure that are impacted all of your important applications.
Attributes of Modern Virtualization and Cloud Aware APM Solutions
- Response Time and Throughput: The single most important metric when measuring applications performance, and especially applications performance for applications running in virtual or cloud environments is applications response time. The reason for this is that this metric accurately reflects the service level that the application is delivering to its users, and it is therefore a metric that the applications owners will easily buy into as one that represents their concerns. It is therefore essential that you choose an APM solution that can measure response time for your applications in the most realistic (as close as possible to what the user actually sees) and comprehensive (across all tiers of the applications system) as possible. All of the solutions profiled in this article focus upon response time, so there is not column for this criteria in the table below, as it is met by all of the vendors. For certain applications, throughput is also an extremely important metric. This is so for applications that do a large number of transactions, where throughput within a response time threshold is how servers should be sized. It is also true for applications that run very large transactions (batch jobs) where the completion of the batch job within a certain period of time is a critical issue.
- Deployment Method. This is where you have to make some difficult tradeoffs. The first tradeoff is an on premise solution vs a SaaS delivered solution (Monitoring as a Service). The advantage of MaaS is that you do not have to maintain the back end, and as the vendor adds features to the product, they just upgrade the back end and you get the new features. The advantage of an on-premise solution is that data about the performance of your business critical applications is not sent over the Internet to someone else’s data center.
- Data Collection Method. This and Supported Application Types (directly below) is where you make your tradeoff in the breadth of the applications that you can manage with your APM solution vs the depth of the analysis. You basically have three sources of data to choose from in a modern APM solution. The first choice is to collect the data from the network via a physical or virtual appliance that sits on a physical or virtual mirror port. The virtue of this approach is that it works for every application that you have – irrespective of how it was built, or whether it was built or purchased. The next choice is a modern transaction oriented agent inside the operating system. These are very different agents than the legacy agents that just capture resource utilization statistics. These agents capture the detail of how the applications interact with the OS, and how the processes that comprise the application communicate over the network that connects all of the servers that host the application. The last choice is to use an agent that lives in the application run time environment. This provides for the deepest level of diagnostics and transaction tracing, but only works for applications that are written to the specific run times supported by the APM vendor (you get depth, but you give up breadth).
- Breadth of Supported Application Types: The APM solution has to work with and support your applications architectures. If you just have web based applications with Java middle tiers and database back ends there are many good tools to choose from. The more you diverge from HTTP/.NET/Java/SQL as the applications architecture the fewer tools there are to choose from. If your application has a proprietary front end (a Win32 client), a proprietary middle tier (maybe something written in COM+, or C++ ten years ago) and a database that no one supports then you need to look for at a tool that operates at the TCP/IP layer since instrumenting the application itself will likely be impossible. However, in so doing you will give up the insights into the business logic that Java and .Net aware tools provide. The tradeoff between depth of code analysis and breadth of applications supported is what differentiates the “DevOps” category of APM tools from the “AppOps” category of APM tools.
- Application Topology Discovery: As your application will now be “dynamic” you will need a tool that can keep up with your application and its topology no matter how it is scaled out, and no matter where a component of the application is moved. This means that if the APM tool relies upon an agent, then that agent must travel with (inside) of the application so that as the application is replicated and moved, the agent comes up and finds its management system. It is also critical that these agents map the topology of the application system from the perspective of what is talking to what. Otherwise it will be impossible to troubleshoot a system with so many moving parts.
- Private/Hybrid/Public Cloud Ready: If you are thinking about putting all or a part of an application in a public cloud, then you need an APM solution that works when there is no LAN connection or VPN between the agent in the application and the management system. Polling agents that live in clouds will not work, as you cannot assume the existence of an inbound port to poll through. Therefore the agent needs to initiate the connection, open an outbound port back to the management system, and which then needs to be able to catch the incoming traffic in your DMZ. You also need a system that is able to map the topology of your application system across the data centers that it executes in.
- Zero Configuration Required: If you are an Agile Development shop, then it is essential that you choose an APM solution that can keep up with your rate of enhancement and change in the application. Essentially this means that you need a “zero-config” APM tool, as with a rapid rate of change in the application you will have no time to update the tool every time you do a release into production.
- Deep-Dive Code Diagnostics: For some really performance critical applications, being able to trace transactions through the layers of an application system can be invaluable when it comes to understanding end-to-end performance, but this capability is traded off against breadth of platform support. The bottom line is that you cannot have both the deepest possible visibility, and the broadest possible support for applications architectures in one product.
- Support for New Application Run-Times and Languages. You need to carefully marry your application development strategy with your APM strategy. If it is appropriate for you to specify a standard language and run-time for your organization, then pick a tool that supports that language or run time. If on the other hand the requirements are so diverse and the needs for responsiveness are so demanding that you are ending up with not just Java and .NET,but also some combination of PHP, Python, Ruby, and Node-JS then look either for a solution that supports your set of languages or one that is agnostic as to how the application is built. Recognize that if you go the agnostic route you will lose line of code visibility as a result.
The “DevOps” Category of APM Tools
The DeveOps category of APM tools, listed in the table below, focus upon helping the portion of the development team that is supporting a custom developed application in production, rapidly resolve issues with the application itself. After making sure that the tools supports your application and its execution environment, the single most important criteria for selecting a tool in this category is “time to bug”. Time to bug means from the time that the problem occurred, how long did it take the tool to notice the problem (how close to real time is the instrumentation), and then once the alert is raised, how many “clicks to resolution” are there. Notice that the target audience of these tools is a developer who can and probably will need to make a code change to resolve the problem. Therefore these tools are targeted specifically at developers and at applications where the enterprise owns the code and has the responsibility for maintaining it. If you are currently using a second generation APM product from a legacy vendor that is expensive, difficult to use, difficult to maintain, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppDynamics||Monitor custom developed Java and .NET applications across internal and external (cloud) deployments||On Premise/SaaS||Agent inside of the Java JVM or the .NET CLR||Java/.NET|
|AppNeta (TraceView)||Monitor custom developed Java and .NET applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM or the .NET CLR||Ruby/Java/
|dynaTrace (Compuware)||Monitoring of complex enteprise applicatons that are based on Java or .NET but which may include complex enterprise middleware like IBM MQ and CICS||On Premise||Agent inside of the Java JVM or the .NET CLR||Java/.NET, Websphere Message Broker CICS, C/C++||
|New Relic (RPM)||Monitor custom developed Java, .NET, Ruby, Python, and PHP applications across internal and external (cloud) deployments||SaaS||Agent inside of the Java JVM, NET CLR, or the PHP/Python runtime||Ruby/Java/ .NET/PHP/Python||
|Quest (Foglight)||Monitor custom developed Java and .NET applications and trace transactions across all physical and virtual tiers of the application||On-Premise||Agent inside of the Java JVM or the .NET CLR||Java/.NET|
AppDynamics is a Java/.NET APM solution based upon an agent that does byte code instrumentation for Java and .Net based applications. AppDynamics is different from the first generation of Java APM solutions in that it installs and works out of the box, it is designed and priced for applications scaled out across a large number of commodity servers, and it includes cloud orchestration features designed to automate the process of adding instances of the application in a public cloud based upon sophisticated and intelligent rules. AppDynamics is offered on both a SaaS and on-premise basis.
AppNeta TraceView is a SaaS delivered APM service that supports applications written in PHP, Ruby, Python, and Java. Since it is a cloud delivered service, it is particularly appropriate for new applications written in new languages like PHP, Python, and Ruby, deployed to new deployment environments like public clouds. TraceView includes the ability to trace transactions across multiple tiers of an application system, including tiers that reside in different data centers.
Compuware dynatrace is a Java and .NET APM solution that is differentiated in its ability to trace individual transactions through complex systems of servers and applications. This is a different level of tracing than just understanding which process on a server is talking which process on another server – it truely means that individual transactions can be traced from when they hit the first Java or .NET server until they leave the last one in the system (usually to hit the database server). This tracing combined with in depth code code level diagnostics via byte code instrumentation is what distinguishes dynatrace. Dynatrace is also the only vendor that can trace individual transactions from their inception in the user’s browser through the entire application system.
New Relic pioneered the Monitoring as a Service category by being the first APM vendor to offer robust APM functionality on a SaaS (or more accurately MaaS) basis. The product is truly plug and play, all you do is sign up, install the agent in your Ruby, Java, .NET or PHP application and then log onto a web console that points back to New Relic’s hosted back end of the monitoring system.
Quest Foglight deeply monitors J2EE and .Net applications servers allowing for visibility into the transaction layer of web based applications systems. Foglight also traces these transactions across all of the tiers of the application system. For these reasons, Foglight is an excellent choice for enterprises looking to virtualize line of business applications or deploy these applications in distributed and scaled out environments.
The AppOps Category of APM Tools
The AppOps category of APM tools, listed in the table below, focus upon helping the team that supports every application in production. Every means both custom developed (and if so no matter in what, how or when), and purchased applications. Since the scope of supported applications includes purchased applications (think SAP, Oracle Financial, Microsoft Exchange and SharePoint), the objective of the tool is not line of code analysis as this would be useless for a purchased application. Rather the objective of the tool is to tell you if the problem (slow response time, slow throughput or errors) is in the application or in the infrastructure, and if it is in the infrastructure provide the best guidance possible as to where.. Notice that the target audience of these tools is a someone is who may be in IT Operations, or who many be on an application support team, but not someone whose job it is to make a code change. If you are currently using a first generation APP product from a legacy vendor that is expensive, difficult to use, difficult to maintain, that probably just measures resource utilization and does not measure response time and throughput, and that does not fit into your scaled out and distributed deployment strategy, then you should replace that tool with one of the ones below.
|Vendor/Product||Product Focus||Deployment Method||Data Collection Method||Supported App Types||Application Topology Discovery||Cloud Ready||“Zero- Config”||Deep Code Diagnostics|
|AppEnsure||Manage response time & throughput of every Windows & Linux application; purchased & custom developed, physical & virtual, remote & local, private, hybrid & public cloud. Includes automated application discovery, topology mapping & root cause analysis.||On Premise/ SaaS||Agent inside of the Windows or Linux Operating System||All TCP/IP on Windows or Linux|
|AppFirst||Monitor every application in production irrespective of source and deployment||SaaS||Agent inside of the Windows or Linux Operating System||All TCP/IP on Windows or Linux|
|BlueStripe FactFinder||Monitor every application in production irrespective of source and deployment||On Premise||Agent inside the Windows, Linux, AIX or Sun Operating System||All TCP/IP on Windows, Linus, AIX, or Sun OS||
|Boundary||Monitor the impacts of network flows upon the application||SaaS||Agent inside of the Windows or Linux operating system||All Linux TCP/IP applications|
|Confio Software IgniteVM||Monitor database performance especially in conjunction with the performance of the underlying storage||On Premise||Agentless collection of detailed database data and storage latency data from vSphere||DB2, Oracle, and SQL Server Database applications running on vSphere|
|Correlsense||Monitoring transactions of complex enterprise architecture that are based on large variety of platforms, languages and middle-tiers||On Premise||Agent inside the Windows, Linux, AIX or Sun Operating System||All TCP/IP on Windows, Linus, AIX, or Sun OS|
|ExtraHop Networks||Monitor every application in production irrespective of source and deployment||On Premise||From a mirror port on a physical switch or the vSphere vSwitch||All TCP/IP regardless of platform|
|INETCO Insight||Monitor every application in production irrespective of source and deployment||On Premise||From a mirror port on a physical switch||All TCP/IP regardless of platform|
|Riverbed Cascade||Monitor application level packet data and flow data to determine application performance from the perspective of the network||On-Premise||Flow Collector, physical appliance on a physical mirror port or virtual appliance on the VMware vSwitch||All TCP/IP regardless of platform|
|Splunk||Collection of logs and many other metrics into an easily searchable “big data” database||On Premise/ SaaS||A wide variety of collectors that interface to log sources and other sources of data||An application for which a log of some type is generated|
AppEnsure is a SaaS delivered APM solution focused on providing application discovery, application topology discovery, end-to-end response time monitoring and automated root cause analysis across all applications (custom developed or purchased) deployed across any mixture of physical, virtual, or cloud based environments.
AppFirst is a SaaS delivered APM solution that is most frequently used by SaaS software vendors or delivered through cloud vendors to customers. It is a perfect complement to New Relic as it monitors all of the layers of the application system which an agent in the application run-time is cannot see.
BlueStripe FactFinder is based upon an agent that lives in the Windows or Linux OS that supports the application. This agent watches the network flow between that OS and everything that it is talking to. Through this process FactFinder discovers the topology map of the applications running on each set of monitored servers, and calculates an end-to-end and hop-by-hop response time metric for each application. Since FactFinder is in the OS and not in the application, FactFinder is able to calculate these response time metrics for any application that runs on a physical or virtual instance of Windows or Linux. This makes FactFinder into the only product that provides this level of out of the box functionality for such a breadth of applications.
Boundary is a SaaS delivered APM solution that focuses on using deep and near-real analysis of the network to understand how infrastructure issues are impacting application performance. Boundary is also a perfect complement to New Relic as it collects the precise set of data that leads to infrastructure impacts upon applications.
Confio Software IgniteVM is a tool that combines deep visibility into the performance (response time) if database queries, combined with a time correlated view of the storage latency of the arrays that support that database in a VMware vSphere environment. Since a very high percentage of the problems in application performance are tied to database performance and the underlying storage performance issues, IgniteVM is extremely useful for customers running performance critical database applications on a vSphere platform.
Correlsense also makes use of agents that live in the operating system, and which use interactions between the application and the OS to map application topologies and time transaction performance. This is another solution that provides excellent end-to-end and hop-by-hop application response time and transaction response time across a very wide range of applications and run-time environments.
ExtraHop Networks uses a mirror port on either the physical network or a mirror port on the VMware vSwitch to see all of the network traffic that flows between physical and virtual servers. This source of data means that ExtraHop can see application topologies and measure end-to-end response time for every TCP/IP based application on your network without requiring the installation of any agents in applications, JVM, virtual servers, or physical servers.
INETCO Insight uses a mirror or span port on the physical switches to collect detailed data about the flows between the components of your applications. INETCO Insight relies on a Unified Transaction Model framework to re-construct multi-tier and multi-hop transactions, mining relevant transaction information and business context from the decoded fields. INETCO Insight provides network performance data, application payload intelligence, and detailed transaction response times and completion metrics all in one view, for every transaction.
Splunk is the leader in collecting a large variety of log and other monitoring data and putting that into an infinitely scalable and easily searchable big data back end. While most people think of Splunk as a log analysis solution, a substantial portion of Splunk’s customers tie the logs from components of the application system together into a dashboard, which is then used to provide an end-to-end view of the behavior of the application system.
How to Choose an SDDC Application Performance Management Solution
The most important thing to do when choosing an APM tool is to focus upon what problem you are trying to solve, and whose problem you are trying to solve. This leads to the following process:
- Is this a custom developed application and the job of the tool is to support the development team of the application by finding bugs in production? If so, focus upon a DevOps tool. Is the problem supporting the application in production and is the job of the tool to support an application support person who does not own the code? If so, focus upon and AppOps tool.
- Seriously think about the tradeoffs of an on-premise solution vs an APM as a Service solution. Those tradeoffs were outlined in this post about CA’s Reaction to APM as a Service. Take a look at what other SaaS services your company is using. If you are already using a SaaS delivered CRM solution like SalesForce.com, then trusting an APM vendor with the data about your application is not a steep hill to climb.
- Do not consider any APM solution that does not measure response time and throughput for the target set of applications. Inferring application performance from resource utilization is a dead idea that needs to be put to rest.
- Despite some of the marketing and sales messaging that goes back and forth between the vendors, the DevOps category of tools does not compete with the AppOps category of tools. In fact you might be well served buying one of each. AppEnsure, AppFirst and Boundary are all perfectly sensible complements to SaaS delivered AppNeta, New Relic or AppDynamics. AppEnsure, Extrahop, INETCO, BlueStripe, and Correlsense are perfectly sensible complements on an on-premise installation of AppDynamics or dynaTrace.
- Pursue a strategy of instrumenting every important application in production with an appropriate APM solution. If you are virtualizing business critical applications (for many organizations that is all that is left to virtualize), then base-lining the performance of the application with an APM solution on physical hardware and then using that baseline as the SLA for its virtualized instance is really the only sensible way to overcome objections from application owners to the virtualization process. Therefore choosing the right set of APM solutions is a critical part of your virtualization initiatives for 2013 and an absolutely essential part of your strategy of migrating to an SDDC.
SDDC Application Performance Management will be a critical part of ensuring that the applications that matter to your business are highly available and perform well in you software defined data center. Running rapidly changing applications on a highly dynamic software infrastructure will lead to intractable problems unless proper APM tools are deployed in your SDDC.