New Relic is a company founded and run by executives, managers and technical staff that were previously involved with Wily Technology. Wily created and lead the market for enterprise J2EE application management and was ultimately acquired by CA. The team at New Relic therefore has extremely deep experience in what it takes to monitor the performance and reliability of enterprise applications, and is bringing this experience to bear in a completely new approach with New Relic RPM.
What’s so Different about New Relic RPM?
Before we delve into the differences, let’s set the stage for how things are normally done in the application management (or application performance management) industry. A vendor produces an agent that gets installed where the significant pieces of the application run. This agent collects detailed data about how the various objects and methods that comprise the application work. This data is then forwarded to a server running the back end management components for the monitoring product. The back end server collects the data, stores it in a database, provides a real time console to the application, provides alerts on exceptions, diagnostics on exceptions, and trending/analytical reports. These products have traditionally been complicated, expensive, time consuming to install and configure, and have required a substantial IT infrastructure footprint at the customer site in order to support the software.
World class application performance management solutions have been implemented inside of the four walls of the data center, on top of an expensive infrastructure and were typically deployed against very expensive and business critical online/transactional applications. These products typically cost between $100,000 and multiple millions of dollars to buy, and had commensurately high total costs of ongoing maintenance and ownership.
Against this backdrop, the team at New Relic decided to reinvent application performance management and focus their solution upon the following concepts:
1. Application Performance Management should be trivially simple to deploy, and should work instantly without requiring customization or configuration.
2. Application Performance Management should be trivially easy to buy, easily affordable and provide almost instant time to value
3. Application Performance Management should be “cloud aware.” This means that if all or portions of your application live in one or more clouds that the APM system should easily handle this form of distributed execution.
Issues Unique to Monitoring Cloud Hosted Applications
Since RPM is designed to monitor applications that live in one or more clouds, we should explore exactly what it means to deal with the unique aspects of APM in cloud environments. The first set of challenges which must be addressed when performing APM for cloud hosted applications are the same challenges that must be addressed when monitoring applications that live on a virtual infrastructure – since we can safely assume that most clouds either today already live on a virtual infrastructure or will do so shortly. These issues are explored in detail in the white paper available for download at the end of this review, and are summarized below:
- Dynamic capacity. In virtual environments, capacity can be added automatically, and in many cases while the application is running. Therefore inferring application performance as a reciprocal of capacity utilization no longer works once an application is virtualized.
- Shared capacity. Virtualization puts guests into resource pools which share and pool CPU and memory capacity. Furthermore some virtualization platforms (like VMware) actually share memory across guests. Therefore whatever the number is that gets reported as the amount of a resource that is being used by the application can be warped, or made irrelevant by the degree of sharing that occurs in virtualized environments.
- Timekeeping issues. Virtualizing a guest causes its perception of elapsed time to warp as a function of how much that guest gets scheduled out by the hypervisor. This impacts time based metrics (like CPU utilization) collected by the guest OS and makes these metrics suspect and of dramatically reduced value.
- Dynamic configuration. In a virtual infrastructure, a guest may move between physical hosts, creating new “maps” of how the application is constructed. These moves may be driven by automated management solutions like VMware DRS. They may be driven by a decision to move an application from an internal cloud to an external cloud. APM solutions need to keep working as these moves are made, and if they include application topology mapping features, need to automatically update these maps to reflect changes in the deployment architecture of these applications.
The net effect of these issues is that when an application is hosted on a virtual infrastructure the old method of inferring performance as a reciprocal of resource utilization no longer works. A functional approach must start with an understanding of response time on a per transaction and user action basis within the application. This approach is essential not only because it is the only one that will work, but because it is the one that users and applications teams will insist upon in order to feel comfortable about “their” application residing on a shared/virtualized platform.
The New Relic RPM Architecture
New Relic RPM is an APM (Application Performance Management) solution focused upon applications written in Ruby-on-Rails and Java.
RPM is a unique APM solution because it is delivered on a SaaS (Software as a Service) basis in the same manner that SalesForce.com delivers its CRM system. This means that all of the complex back end pieces (applications servers, reporting servers, analysis servers, database servers, etc.) are hosted by New Relic and are not run or installed on site at the customer.
Installing the RPM agent is also trivially easy. It is a download off of the company’s web site and it easily included as a .jar file in the application that you wish to monitor. The agent initiates the communications from itself to the RPM back end over internet friendly ports and protocols (SSL:443). This means that there is no polling of the agent by the back end, and therefore no need for ports to be opened inbound to the monitored application system.
The RPM console is fully web based and is accessible simply by logging onto the customer access section on the New Relic web. Again, since the entire back end infrastructure for RPM is hosted by New Relic the web servers are owned and maintained by New Relic and all you need is a browser and access to the public Internet to get to your RPM Management Console.
Finally, buying RPM is a matter of creating an online account with a credit card. Pricing ranges from $50 to $200 per month per monitored host (physical server or virtualized guest).
Supported Applications and Platforms
RPM supports two broad classes of applications, those written with Ruby-on-Rails, and those written in Java. RPM also has support for the following platforms:
- Tomcat, Jetty, JBoss
- GlassFish – the open application server which is an implementation of Java EE 5.5
- jRuby the Java implementation of the Ruby programming language
- The Litespeed web server
- The Mongrel HTTP library and server for Ruby
- Phusion Passenger for easy deployment of Ruby on Rails applications
- The Thin Ruby HTTP Daemon
RPM, End User Experience and Apdex
Under the assumption that the vast majority of clouds already today run on a virtual infrastructure (or will shortly do so), and given the challenges that resource utilization based approaches face – it is a credit to RPM that it starts the applications performance management process with a sophisticated understanding of application response time. RPM uses Apdex, an emerging standard for characterizing response time. Apdex is interesting in that it gives a very quick and easy to understand picture not just of average response time, but also of the variation around that average. This makes Apdex into an easy to calculate and easy to understand index that can in turn be the basis of a simple and easy to understand Service Level Agreement.
The notion of variation around an average response time is critical to linking response time to user experience. Let’s look at two scenarios. In both cases the average response time is 1 second. In scenario one there is little variation around that average and almost no one ever gets a response time of more than 1.5 seconds. In the second scenario the average is again 1 second, but there are many cases of response times that are 2 seconds or even 3 seconds (the variation around the average is higher).
The way that Apdex works is that you and the business constituents define a target response time for the application (T). In the example below, T is 1 second. Apdex then characterizes all response times that are less than 1 second as Satisfied, all response times between 1 second and 4X 1 second or 4 seconds as Tolerating and all response times greater than 4 seconds as Frustrating.
RPM reports Apdex as a set of four numbers (in the example below, .96, .5, 94%, 3.4%, and 2.6%). Those numbers mean:
- .96 is the composite Apdex score. It is a range from 0 to 1 with scores closer to 1 being better.
- .5 is the response time threshold.
- 94% is the percentage of the response time that are less than .5 seconds
- 3.4% are more than .5% and less than 2% which Apdex deems as being the tolerable level (Apdex automatically defines tolerable as 4X the configured desirable average
- 3.4% are frustrated, which is greater than the 2% tolerated level.
The formula for the Apdex score is ((satisfied count)+(tolerating count/2))/total samples. For more information on Apdex and how it works see Apdex.org.
RPM therefore uses Apdex to present the average response time, and the variation around the average (the level of tolerance or frustration) as the overall Apdex index. This is a highly useful way to get a real snapshot understanding of how application response time is impacting end user experience (a high Apdex score translates directly into a high end user experience), a degraded Apdex score translates directly into degraded or poor end user experience.
RPM Summary Metrics
For each application that you choose to monitor with RPM, RPM provides you with a set of easy to understand and highly useful summary metrics.
These summary metrics consist of:
- The Apdex score (described above)
- The average response time over the last measurement interval
- The percentage of transactions that had errors
- The Throughput (requests per minute) a long with an indication of the direction of change in Throughput in the last 24 hours and 7 days.
When you select an application and drill down into its performance you are presented with the high level troubleshooting dashboard shown below. This dashboard shows the response time by applications tier (broken out by time in the JVM and the Database for the selected time period), the Apdex score over that time period, the Throughput in requests per minute, Recent Slow Transactions, Recent Errors, and Recent Events (which can be notes that have user interface elements attached to them, Deployments (changes to the application), and Alerts.
From this top level overview of the performance and health of your application, you can triage the issue by easily finding out in which sub-system the problem resides, and then diagnose the issue by drilling deeply into that sub-system.
Managing the Real Time Performance of Your Application
Obviously if response time is degrading, you want to know why. The first place to look is in the portions of your application that are the slowest. The combination of which transactions are the slowest, and which slow transactions are occurring the most frequently is the most fertile place to look for performance issues. As an example, the third transaction in the Recent Slow Transactions list above was selected. This transaction took 63 seconds to run. The drill-down into this transaction is displayed below.
You can see the RPM does a really nice job of breaking the transaction into its composite methods and calls.
If you are interested in seeing which transactions are consuming the most time across your application, RPM gives you that list. This is really useful for seeing the overall impact of slow transactions upon your application since it tells you where time is being spent, not just how slow an isolated transaction might be.
Since database activity often includes physical writes and reads from the I/O subsystem, RPM does a really nice job of profiling the performance of the database from the perspective of the application. You can easily get a list of the most time consuming database calls in your application.
RPM also gives you nice graphs of the top database operations, database throughput and database response time – this nicely broken out by type of database statement (select, delete, insert, update).
Finding Problems in your Code
Finding issues with code in production is one of the hardest aspects of Application Performance Management. Most of these problems did not show up during the test and pilot phases of the development process – so they tend to be the “hard” problems that are related to load.
RPM does a really nice job of capturing transaction traces for an individual transaction whenever the duration of that transaction is longer than four times your application’s Apdex T (configurable). This lets you determine where in your code the time is being spent – which greatly speeds up the problem resolution process.
RPM does not just stop with transactions that take too long, it also provides detailed diagnostics of errors that occur in your application. When an error occurs, RPM provides the parameters of the error and the stack trace associated with the error.
It is one thing to have a tool that provides you with real time information to rapidly diagnose and fix problems, but it is another thing entirely to have the analytical information needed to be able to take actions ahead of issues and to thereby prevent them from occurring.
One of the first things you always want to know in this regard is how your application is scaling in response to load. RPM gives you two very nice charts that tell you exactly this piece of information. The first of these plots response time as a function of load, and also tells you what time of day your load is heaviest (lighter colors indicate daytime).
The second of these shows how fast the database is responding to requests as throughput rises. The gradual slope of this line suggests that the database will become the source of slow response time although clearly this has not occurred yet.
Another one of the vexing problems that strikes applications in production is whether or not a change to the application is the cause of a performance problem or the cause of an increase in error rates. RPM indicates new deployments with a blue line at the date/time point where the deployment occurred.
Clicking on the blue line gets you to a deployment analysis that shows how the performance of each of the key pages in your site has been impacted by this new deployment. In the case below, the new deployment improved the response times for the most frequently used pages in the site.
If you are wondering how your application looks from the perspective of the database (which calls are using the most time), RPM contains a nice report that tells you exactly that.
Finally, RPM does a really nice job of tracking the high level metrics you really care about from a Service Level perspective in its SLA report.
RPM is a uniquely simple and useful Application Performance Management solution. The strengths of RPM are:
- The ease with which you can start monitoring the performance of your applications, no matter where they (or parts of them are hosted).
- The focus upon response time (with Apdex as a convenient way to measure the consistency of response time as opposed to just averages). This is critical for a number of reasons, all of which are covered in detail in the Virtualization Performance and Capacity Management White Paper linked at the end of this review.
- Coverage of Ruby-on-Rails and Java applications. RPM now covers two extremely popular deployment frameworks for applications. The focus of RPM upon deployment frameworks (as opposed to specific middleware products, or underlying operating systems) puts RPM in a strong position to increase value to applications owners over time and to add support for other deployment frameworks.
- Deep diagnostics into code and database transaction performance. It is one thing to know that your application or a specific transaction is slow. It is quite another thing to know why. RPM does a great job of finding out what part of your application is the cause of the issues you are experiencing.
- The SAAS (Software as a Service) deployment model. This is part of making the product easy to get up and running (as you do not need to install the back end pieces in your environment, or in your cloud) but the effects are more profound. As you start to distribute applications and pieces of applications into clouds, you will discover that the “heavyweight” monitoring tools that live within the four walls of your data center are not architected to cope with this level of distributed execution of your applications. Public cloud computing, or even Public/Private hybrid cloud computing requires a new architecture for Applications Performance Monitoring. RPM is an example of the kind of new architecture required.
- RPM’s pricing models. RPM is offered in three pricing models, a by the month per server model, an annual plan, and a by the hour of usage model. All models make buying RPM as easy as implementing it.
- Because New Relic controls the RPM application, New Relic can fix bugs in the product within hours and rapidly deploy new features and capabilities. The customer is always running on the “latest” product.
As with all products, it is also the case the RPM is not the perfect solution for all problems and use cases. Some places where RPM is not a good fit are:
- RPM supports applications written in Ruby-on-Rails and Java. If your application is not written in one of these languages for deployment on the associated platforms then RPM is not going to work for you.
- RPM is an applications performance monitoring solution, not an infrastructure performance monitoring solution. Forward thinking cloud vendors are likely to provide infrastructure performance monitoring as a value added service to their cloud offerings. As this occurs, the potential exists for integration between the monitoring of the application and the infrastructure in the cloud leading to a true end-to-end solution.
- The RPM agent requires the ability to open an outbound Internet connection to the RPM back end. If your application is so locked down in your internal data center that an agent running on it cannot open an outbound port, then RPM is probably not the right solution for you.
In summary, RPM is a ground breaking combination of monitoring functionality, ease of deployment, and ease of purchase. Understanding the performance of applications that you place in clouds is a significant barrier to the deployment of applications in these clouds. RPM overcomes this issue with a solution that is a unique fit for these cloud hosted applications.
If you would like to download a PDF of this review, please click the link below.[dm]18[/dm]