The Virtualization Practice is in the middle of physically moving our data center servers from one city to another. Since it proved impractical to keep them online while they were on the moving van, we moved the web site to a public cloud until we could rebuild our data center in our new location. We encountered several interesting issues with public cloud computing in this process.
Public Cloud Reality
As documented inÂ Public Cloud Reality is Much Different than the Hype, there is much to think about when moving applications that must work all of the time and must work well all of the time into a public cloud. The bottom line to this post is that this is far from the “look Ma no hands” exercise with which public cloud computing is often promoted. InÂ Public Cloud Reality: Application Security is in your HandsÂ we documented where the line was in between what the cloud provider should be responsible for, and where the cloud customer has to take responsibility. The bottom line to this post is that the security of your application is your problem, not the cloud providers problem.
Cloud Application Performance Management
Just like the security of your application is yourÂ responsibilityÂ in the cloud,Â the performance of your application is yourÂ responsibilityÂ in the cloud. The bottom line is that if you are running an application that you care about in a public cloud you are crazy if you do not monitor that application for response time on a continuous basis. We are fortunate enough to have chosen New Relic to monitor our site some time ago. We chose New Relic because our site is based upon WordPress, and WordPress is written in PHP, and New Relic happened to be an easy to use APM as a Service offering with a PHP agent.
Since we have been using New Relic for quite some time, we have a very good feel for what constitutes “normal” performance. We define normal performance as a response time from the perspective of the WordPress application server of less than 500MS. That means that when requests come in from the Internet for pages, it takes the app server and the database server together less than 500MS to respond to those requests. This is a number that we watch very closely. Upgrades to WordPress have affected this number. Upgrades to WordPress plugins have affected this number. Changes in how external services that we link to have affected this number. And guess what, moving to a public cloud affected this number.
If you look a the lower right corner of the image below, you will see an area highlighted in red. That shows what happened to our Application Response Time when we moved our web site into the cloud. Notice that the response times are roughly double the desired 500MS. What this means is that we are not going be in this particular public cloud for very long, and we are pretty unlikely to be an any public cloud for very long.
The Cloud SLA Isssue
This problems highlights the issue with cloud SLA’s. While our web site was unacceptably slow, the cloud providers SLA’s were not being violated. This is because the SLA’s of our cloud provider along with those of Amazon and just about everyone else are worse than useless. They take time to read, but mean nothing. The underlying problem is that the cloud vendor is not even measuring what they need to measure to be able to be a meaningful part of the process of ensuring application performance. What the cloud vendor needs to measure is their infrastructure latency – from our app server to our database server over their network, to their storage and back again. If that number was, by example, less than 50MS then the cloud vendor could credibly say to us, “looks like it is not our infrastructure, but rather it is your application”. But measuring infrastructure latency on a multi-tenant basis (doing it individually for each tenant across the shared infrastructure of the cloud) is today an unsolved problem.
Public cloud vendors do not measure what they need to measure in order to be able to provide infrastructure performance assurances to their customers. This is why their SLA’s are useless. This is why if you put your application in a public cloud your absolutely need to measure its response time with a cloud application performance management solution. And when you have application response time issues and are absolutely convinced (by your APM tool) that the problem is not in your application, good luck having a meaningful conversation with your cloud provider about it being their problem.