So What Should a Cloud SLA Look Like?

In “Cloud SLAs Are Worthless But Does this Matter?“, we concluded that there are some significant differences in how SLA’s are perceived between those being in place with an IT organization and those offered by a public cloud vendor. The principle difference appears to be that an IT SLA is an agreement that an IT organization strives to honor, in contrast to a public cloud SLA which is more of a marketing statement designed so that the cloud vendor can never violate it.

This may be splitting hairs, but it does bring up a fundamental point. In the eyes of a business owner or an applications owner that needs certain business functionality to work well (perform well) when that business functionality is implemented in an application then that application and all of its support infrastructure must perform well. And perform does not mean be available, it does not mean use some normal set of resources, it means respond within acceptable time frames when users initiate actions or transactions in the application. If this is in fact the standard that applications owners expect SLA’s to enforce, then 99.9% of all existing public cloud SLA’s are 1) worthless, and 2) so worthless that applications with critical response time thresholds will never migrate to public clouds until this changes.

Let’s start the problem of figuring out how public cloud vendor should structure their SLA’s by comparing the likely end-t0-end environment between the user of an internal application and the user of an application in the cloud. Let’s pick the most starkly contrasting case as a start. For a user in an enterprise using a performance critical internal application, they may well be on their corporate PC, using a corporate LAN to the data center, with then an entirely corporately owned IT infrastructure ranging from the hardware up through all of the systems software and middleware to the application itself.

A user of the same application hosted on a public cloud could use the very same PC, go first through a bit of the corporate LAN to a router, then over the public internet (owned by N different providers), to the cloud vendor where then the company owns the application, and the cloud vendor (in the case of an IaaS cloud) owns the operating system and everything beneath it.

So the first important point is this. Unless you pay for a dedicated network between your users and your public cloud vendor (which would defeat one of the economic points of using a public cloud in the first place), you will not be able to guarantee end user experience for users using applications hosted at a public cloud.

If we forget about the network between the user, and the end user’s actual environment, then what we are left with is measuring the response time of the application from the edge of the application system (the web servers) at the public cloud vendor.  But this brings up the issue of the organizational boundary between the owner of the application (the customer of the cloud provider) and the cloud provider themselves. The application owner can and should measure the response time of their application via cloud aware applications response time management tools like New Relic, AppDynamics, dynaTrace, BlueStripe, and Coradiant (now part of BMC Software). But the cloud vendor is not going to sign up to deliver an average applications response time of .5 seconds from the perspective of the customer’s web server, because the cloud vendor is not responsible for the application itself (unless we are talking about SaaS which will be the subject of an entirely different post).

Is a Meaningful SLA from a Public Cloud Vendor Possible?

If the cloud provider cannot sign up for applications response time because the cloud provider does not own the application, then what can and should the cloud provider sign up for? Again to reiterate the above points, availability while essential, is useless. Providing the customer resource utilization metrics is generally useless because the cloud provider is not going to expose constraints in his physical infrastructure to his customer – the cloud provider is just going to tell his customer what percentage of what the customer has purchased the customer is using which leaves hidden the true denominator (the physical constraint) of the calculation.

So what is needed is a measure of how responsive the layer of software owned by the cloud provider is being to the layer of software owned by the cloud customer. In the case of an IaaS cloud this means how responsive is the OS provided by the cloud vendor being to requests for work by the application owned by the cloud customer. In the case of a PaaS cloud this means how responsive is the Ruby, Java, .NET or other application framework layer being to requests for work by the applications running on these layers.

In the case of an IaaS cloud this means that the measurement point will have to be in between the application and the operating system. Currently only two vendors live at this layer in the stack – BlueStripe and AppFirst. In the case of a PaaS cloud this will mean that vendors who monitor applications written to specific applications frameworks will have to enhance their products to be able to distinguish between time spent in the application itself, vs time spent in the application framework (Ruby, Java, .NET, etc.).

A New Definition of Cloud Applications Performance Management

We have posted on many occasions about the importance of using applications response time to measure and ensure the performance of applications in virtualized and IT as a Service environments. Response time is the only reliable metric of applications performance in a virtualized system – as inferring applications performance by any other means does not work in a dynamic and shared environment like VMware vSphere, nor does it work any any kind of a public cloud.

The utter worthlessness of Amazon’s SLA’s (which were not violated in their outage – demonstrating their utter worthlessness – at least in the eyes of their customers) now creates the demand for a new for a new set of response time metrics, implemented in a new set of APM tools (or perhaps as enhancements to leaders in the virtualization aware APM space like dynaTrace, New Relic, AppDynamics, BlueStripe AppFirst, and Coradiant). Once the ability to understand the responsiveness of the layer of software owned by the cloud provider to the layer owned by the cloud customer exists, then we can build reasonable and relevant SLA’s aroud those metrics. Until then, public cloud provider SLA’s will remain so worthless that all of the IT professionals who take the “over my dead body” positions with respect to public clouds will remain absolutely correct.