Infrastructure as a Service (IaaS) clouds allow you quickly provision and scale up operating systems images (that you can then do with what you want). However, the nature of IaaS offerings is that the cloud provider purposely obscures that is really going on in his hardware environment from his customers. This leads to the “noisy neighbor” problem, as well as many other problems the customer of the cloud provider is left guessing as to what is really going on.
The Cloud Customer’s View of IaaS Cloud Performance
Using Amazon as an example, the only real view into “performance” that the customer of Amazon has is through the CloudWatch services provided by Amazon to its customers. At first glance CloudWatch seems to provide a wealth of information. The basic list of what is provided is below:
- Basic Monitoring for Amazon EC2 instances: seven pre-selected metrics at five-minute frequency, free of charge.
- Detailed Monitoring for Amazon EC2 instances: seven pre-selected metrics at one-minute frequency, for an additional charge.
- Amazon EBS volumes: eight pre-selected metrics at five-minute frequency, free of charge.
- Elastic Load Balancers: ten pre-selected metrics at one-minute frequency, free of charge.
- Amazon RDS DB instances: thirteen pre-selected metrics at one-minute frequency, free of charge.
- Amazon SQS queues: eight pre-selected metrics at five-minute frequency, free of charge.
- Amazon SNS topics: four pre-selected metrics at five-minute frequency, free of charge.
- Amazon ElastiCache nodes: twenty-nine pre-selected metrics at one-minute frequency, free of charge.
- Amazon DynamoDB tables: seven pre-selected metrics at five-minute frequency, free of charge.
- AWS Storage Gateways: eleven pre-selected gateway metrics and five pre-selected storage volume metrics at five-minute frequency, free of charge.
- Amazon Elastic MapReduce job flows: twenty-three pre-selected metrics at five-minute frequency, free of charge.
- Auto Scaling groups: seven pre-selected metrics at one-minute frequency, optional and charged at standard pricing.
- In general, what CloudWatch is providing is a view into the resource consumption and activity of your instances. This view is virtual in nature, and is abstracted from the underlying physical reality of supporting the environment. In other words, when CloudWatch tells you how much CPU your instance is using, that is a function of how much the instance is using divided by how much has been allocated to that instance. The actual amount of CPU that is available on the physical server that your instance is running on never factors into the equation at all. The same is true for all of the CloudWatch metrics – they are from a virtual perspective, and do not surface any contention that may be occurring at the physical layer in the infrastructure.
- As you can see above, you can get some metrics for free at 5 minute intervals, and more metrics at 1 minute intervals if you pay for them. In The Real-Time Big Data vSphere Management Problem, we discussed the need for real time metrics, not one minute or five minute metrics (as way too much can go wrong in 59 seconds, or 4 minutes and 59 seconds). This same need presents itself here. Running performance critical applications in a shared tenant public cloud and only getting, in the best case, visibility every one minute is just not going to work for a lot of people.
- The focus upon resource utilization metrics as a proxy for infrastructure performance is fatally flawed. In Timekeeping in VMware Virtual Machines, VMware gets credit for being brutally honest about what happens to time based metrics collected from the perspective of the virtual machines (hint, opening Task Manager in a virtualized instance of Windows Server is an exercise in futility). The only way around this is to measure end-to-end Infrastructure Latency and to surface it as the metric that definitively demonstrates the Quality of Service that the cloud provider is providing his or her customer. The absence of such an approach by cloud vendors to date (including Amazon) will limit the adoption of public cloud services, as the metrics provided by CloudWatch are a poor substitute for measuring what is really going on and how long it is taking.
The Cloud Providers View of IaaS Cloud Performance
Now from the perceptive of the cloud vendor all is well and good. Supposedly the cloud vendor is making sure that resource bottlenecks are not impacting workload performance, but we really do not know if they are or not. There is no SLA that commits the cloud provider to this, and no data is forthcoming as to how will it is being done if at all. Furthermore, it is well understood that the way that the cloud provider makes money is by sharing his underlying physical hardware to a greater degree (and not disclosing this fact) than the enterprise customer would like be comfortable with.
The IaaS Cloud Performance Management Problem will continue to be one of two major factors impeding the adoption of public cloud services (multi-tenant security being the other one). Inferring performance from resource utilization metrics does not work in a simple single tenant virtualized environment (vSphere in your data center). It is worse than useless in muti-tenant public cloud environments that are build up upon a virtualization platform. The only known fix for this issue is for the cloud vendors to embrace end-to-end infrastructure latency as the quality of service metric and to surface this metric on a per tenant and per image basis to their customers.