On July 2nd 2012, Dell announced that it has entered into a definitive agreement to buy Quest Software. Quest will become part of the Dell Software Group, which is being run by John Swainson, formerly the CEO of CA. In “Dell a Virtualization Management Leader?” posted almost a year ago, we explored how Dell might combine the product assets that it has licensed from DynamicOps (sold by Dell as VIS Creator – see the product review here). The basic idea was the monitoring of the virtualized environment would be combined with the ability of VIS Creator to dynamically provision services so that dynamically provisioned services could be offered with performance and availability assurances. The idea that Dell could bring the entire portfolio of Quest assets to bear fundamentally transforms both the notion of automated service assurance of dynamically provisioned services, and the entire systems management business.
Articles Tagged with Virtualization Performance Management
Microsoft threw down the gauntlet today, right at the feet of Amazon’s AWS – launching a revamped PaaS offering, a brand new IaaS offering (run whatever you want in an Azure hosted image), and significant partnerships with ecosystem vendors that will add value to Azure and round out its value with Microsoft Azure customers.
Big changes are afoot in the management software business. First Quest Software agrees to get acquired by a private equity firm – usually a sign that some cuts need to be made that would trash a company’s stock if it were publicly held. Then rumors crop up that Dell was going to acquire Quest which would certainly transform the virtualization management business as discussed in this post. Now comes news from Bloomberg, that the Dell/Quest deal is off, at least for the time being. So what is really going on here?
Infrastructure as a Service (IaaS) clouds allow you quickly provision and scale up operating systems images (that you can then do with what you want). However, the nature of IaaS offerings is that the cloud provider purposely obscures that is really going on in his hardware environment from his customers. This leads to the “noisy neighbor” problem, as well as many other problems the customer of the cloud provider is left guessing as to what is really going on.
The Cloud Customer’s View of IaaS Cloud Performance
Using Amazon as an example, the only real view into “performance” that the customer of Amazon has is through the CloudWatch services provided by Amazon to its customers. At first glance CloudWatch seems to provide a wealth of information. The basic list of what is provided is below:
- Basic Monitoring for Amazon EC2 instances: seven pre-selected metrics at five-minute frequency, free of charge.
- Detailed Monitoring for Amazon EC2 instances: seven pre-selected metrics at one-minute frequency, for an additional charge.
- Amazon EBS volumes: eight pre-selected metrics at five-minute frequency, free of charge.
- Elastic Load Balancers: ten pre-selected metrics at one-minute frequency, free of charge.
- Amazon RDS DB instances: thirteen pre-selected metrics at one-minute frequency, free of charge.
- Amazon SQS queues: eight pre-selected metrics at five-minute frequency, free of charge.
- Amazon SNS topics: four pre-selected metrics at five-minute frequency, free of charge.
- Amazon ElastiCache nodes: twenty-nine pre-selected metrics at one-minute frequency, free of charge.
- Amazon DynamoDB tables: seven pre-selected metrics at five-minute frequency, free of charge.
- AWS Storage Gateways: eleven pre-selected gateway metrics and five pre-selected storage volume metrics at five-minute frequency, free of charge.
- Amazon Elastic MapReduce job flows: twenty-three pre-selected metrics at five-minute frequency, free of charge.
- Auto Scaling groups: seven pre-selected metrics at one-minute frequency, optional and charged at standard pricing.
- In general, what CloudWatch is providing is a view into the resource consumption and activity of your instances. This view is virtual in nature, and is abstracted from the underlying physical reality of supporting the environment. In other words, when CloudWatch tells you how much CPU your instance is using, that is a function of how much the instance is using divided by how much has been allocated to that instance. The actual amount of CPU that is available on the physical server that your instance is running on never factors into the equation at all. The same is true for all of the CloudWatch metrics – they are from a virtual perspective, and do not surface any contention that may be occurring at the physical layer in the infrastructure.
- As you can see above, you can get some metrics for free at 5 minute intervals, and more metrics at 1 minute intervals if you pay for them. In The Real-Time Big Data vSphere Management Problem, we discussed the need for real time metrics, not one minute or five minute metrics (as way too much can go wrong in 59 seconds, or 4 minutes and 59 seconds). This same need presents itself here. Running performance critical applications in a shared tenant public cloud and only getting, in the best case, visibility every one minute is just not going to work for a lot of people.
- The focus upon resource utilization metrics as a proxy for infrastructure performance is fatally flawed. In Timekeeping in VMware Virtual Machines, VMware gets credit for being brutally honest about what happens to time based metrics collected from the perspective of the virtual machines (hint, opening Task Manager in a virtualized instance of Windows Server is an exercise in futility). The only way around this is to measure end-to-end Infrastructure Latency and to surface it as the metric that definitively demonstrates the Quality of Service that the cloud provider is providing his or her customer. The absence of such an approach by cloud vendors to date (including Amazon) will limit the adoption of public cloud services, as the metrics provided by CloudWatch are a poor substitute for measuring what is really going on and how long it is taking.
The Cloud Providers View of IaaS Cloud Performance
Now from the perceptive of the cloud vendor all is well and good. Supposedly the cloud vendor is making sure that resource bottlenecks are not impacting workload performance, but we really do not know if they are or not. There is no SLA that commits the cloud provider to this, and no data is forthcoming as to how will it is being done if at all. Furthermore, it is well understood that the way that the cloud provider makes money is by sharing his underlying physical hardware to a greater degree (and not disclosing this fact) than the enterprise customer would like be comfortable with.
The IaaS Cloud Performance Management Problem will continue to be one of two major factors impeding the adoption of public cloud services (multi-tenant security being the other one). Inferring performance from resource utilization metrics does not work in a simple single tenant virtualized environment (vSphere in your data center). It is worse than useless in muti-tenant public cloud environments that are build up upon a virtualization platform. The only known fix for this issue is for the cloud vendors to embrace end-to-end infrastructure latency as the quality of service metric and to surface this metric on a per tenant and per image basis to their customers.
We all pretty much know that we can buy Infrastructure as a Service (IaaS), Development/Run time Platforms as a Service (PaaS), Software as a Service (Saas), Security as a Service, Cloud Storage as a Service, among other things – but we can also buy monitoring as a service. We can buy monitoring at both the infrastructure level and the application level as a service. This is an intriguing idea, and one that is rapidly gaining traction. However Monitoring as a Service (MaaS) carries with it some unique benefits, but it also carries with it some trade-offs especially when evaluated against on-premise solutions.
There have recently been a spate of articles and blogs that are attempting to create a contest between “Network Performance Management” tools and “Application Performance Management” tools. This includes a Network Computing survey that finds fault with APM solutions, and a SOA World Magazine comparison that tries to compare the two types of solutions. This is silly and unproductive. It is far more productive to approach this problem from the perspectives of what your needs and applications look like.