There has recently been an interesting three party back and forth between Heroku (a PaaS cloud vendor), Rap Genius (a customer of Heroku), and New Relic (an APM solution resold by Heroku to Heroku’s customer, Rap Genius). All three parties have been very forthcoming as to their points of view on this matter in public blog posts, so we have a relatively extraordinary opportunity to “see inside” of a real world problem and how two vendors dealt with the problem.
The Sequence of Events
- Rap Genius “noticed” that their web site seemed slower than what they wanted it to be and slower than was being reported by New Relic application response time numbers.
- Rap Genius reported the problem to Heroku. Heroku seemed unable to resolve the problem to Rap Genius’ satisfaction.
- Heroku then reported the problem to New Relic.
- The two vendors worked together in order for Heroku to provide New Relic with more information about where time was spent. Specifically, an improvement was made by New Relic to more accurately depict the time spent in Request Queuing.
A good summary of this chain of events is in the post up on VentureBeat. The important thing is that New Relic made an enhancement to their product within one week of the problem being reported to them and rolled that enhancement into production. The difference in what New Relic showed as time spent in Request Queuing is shown in the two screen shots below.
New Relic Response Time Breakdown Before Request Queuing Change
New Relic Response Time Breakdown After Request Queuing Change
Now the differences are stark and obvious. The response time went from an average of 236MS to an average of 1330MS once the new Request Queuing data was included in the response time total. The vast majority of the new difference in response time is in fact attributable to Request Queuing, which means time spent in the Heroku infrastructure, not time spent in the application.
So Who is Really Responsible for PaaS Cloud Application Performance?
If you read the VentureBeat post linked above, it is pretty clear that the customer blames Heroku. They blame Heroku for not being transparent about application performance, they blame Heroku for data missing from New Relic, and they even allege that Heroku was delivering less horsepower to Rap Genius’ applications than the customer was paying for. Let’s look at these allegations one at a time and sort this out:
- Rap Genius being misled about the performance of their application. It is always difficult to blame the customer, because “the customer is always right”, but some analysis of the facts is in order. New Relic tracks the response time of the application system from the perspective of the application server where the Ruby code and the New Relic agent are installed, and it also tracks response time from the perspective of the end user. Even if the application server numbers were off due to missing data from Heroku (as was the case in the first graph above), the New Relic end user response time numbers should have shown that end users were seeing a slower web site than would have been indicated by a response time of 236MS at the edge of the application server. So Rap Genius should have been able to see the problem in the tool that they bought from Heroku. They might not have been able to see where the problem was, but they had no excuse for being unaware of the problem.
- To further flesh out the point above, if response time is as important to Rap Genius’ business as is alleged in the VentureBeat post, then Rap Genius should have been cross checking the New Relic numbers on an ongoing basis. There are several very affordable services that will load the pages of your web site from various places in the world and report those numbers back to you. While any APM vendor is always going to strive to give you the best possible information about the response time of your applications, the quality of their data is sometimes subject to the environment in which the application is running (more about this in the next bullet below).
- When New Relic updated its agent with the new Request Queuing metrics, it posted a blog about what it had done, why it had done it, and most importantly was very up front about the limitations of the approach used to collect the data. The blog post is here. The key statement in the post is “Since the Ruby agent lives in your Rails or Ruby application we can’t measure queue time directly. There’s no way for the agent to intercept a request before the application (and thus the agent) has received it”. This brings up an important characteristic of PaaS clouds that cannot be overlooked. If you are running a Java application in a JVM, the New Relic agent is doing byte code instrumentation of that JVM. There is little that can go on in that JVM that the New Relic Java agent cannot see.
- In contrast to the JVM example above, in a PaaS Cloud like Heroku, the New Relic agent is not running in the PaaS layer itself, it is running inside of the customer’s application. So the New Relic agent is abstracted from the underlying PaaS layer to the same extent that the actual Ruby or Rails application is abstracted from that layer.
The bottom line to this analysis is that the ball is entirely in Heroku’s court to provide more transparency into the operation of its PaaS cloud layer, both to APM vendors like New Relic and to its customers. The CTO of Heroku to his and Heroku’s credit has published a blog acknowledging where Heroku had fallen short and explaining what Heroku was doing to fix it. The lesson to the community is that just as it is completely unacceptable for Amazon to hide its latency and the composition of that latency from its customers, it is completely unacceptable for a Paas Cloud vendor to do the same. Both Heroku and New Relic deserve a tremendous amount of praise and respect for recognizing a situation (perhaps a bit belatedly in the case of Heroku) and then acting quickly to address the situation in the most aggressive and transparent manner possible. At the risk of pointing out the obvious, Amazon remains guilty of steadfastly clinging to its worse than useless SLA.
The Future of Pass Cloud Application Performance
At the end of the day, this is all about transparency and trust. The customer of the cloud is going to have to trust the cloud vendor that the cloud vendor is being transparent about how the various layers of the cloud service are performing and that the data being provided by the cloud vendor and the APM partners of the cloud vendor is as accurate as possible. A really good next step for a PaaS cloud vendor like Heroku would be to do something like what Red Hat has done with the partnership with Correlsense for OpenShift. Similar agent-based solutions are available from AppEnsure, AppFirst, BlueStripe, and Boundary. For a PaaS vendor like Heroku, working with an APM vendor like Correlsense means embedding an agent into the guts of the Linux operating system that underlies the PaaS platform. Such an agent, due to where it sits in the stack, would have visibility to exactly where every layer of the PaaS stack is spending time executing customer requests. When PaaS vendors step up to that kind of transparency and visibility for their customers, then customers will trust PaaS clouds to run their most performance sensitive workloads.
Customers using PaaS Cloud offerings like Heroku are clearly reliant upon both Heroku and partnering monitoring vendors like New Relic to provide complete information about PaaS Cloud Application Performance. Being fully transparent in this regard is likely to prove to be both a technical and a business challenge for the PaaS cloud vendors.