VMware has recently posted an article up on its Communities site in the Business Critical Applications section about monitoring SQL Servers that are supporting a business critical application running in a vSphere environment. If this is how VMware thinks critical infrastructure services (database servers, applications servers, web servers, messaging servers, etc.) that support business critical applications should be monitored then it is no wonder that so many customers are struggling to get their business critical application virtualized.Before we get around to eviscerating VMware on this issue let’s step back and look at the situation as it regards virtualizing business critical applications. IT has virtualized the low hanging fruit which is applications that IT owns and can virtualize without getting anyone’s permission. Business critical applications have teams that own them and support them, and those teams answer to vocal and powerful constituents (sometimes including business unit executives, a CFO and a CEO if the application is revenue generating). These teams, in general see no benefit, and only risk to putting an application that runs on its own dedicated hardware into a shared services environment.
The Approach Advocated in the VMware Post
The approach taken by VMware in this post is encapsulated in the following statement, “The methodologies for monitoring SQL Server on vSphere (our italics added for clarity) stay the same as performance monitoring SQL Server in a physical environment. The approach starts with using the common SQL Server performance counters to find resource constraints in CPU, memory, network, and disk resources, and then supplements this approach with ESXTop metrics.
There are three problems with this approach:
- You will find it extremely challenging to set thresholds for all of these resource counters that either do not generate lots of false alarms (because you set the threshold too low) or that miss problems (because you set the threshold too high).
- If you get data from the Windows Server Perfmon counters, all rate over time data (which includes all of the CPU, memory swap rate, network I/O rate, and disk I/O rate metrics) will be warped by the timekeeping problem in virtual machines documented in this VMware White Paper. In other words, you cannot use the same approach to performance management in the a virtual world as you used in a physical world.
- Even if you set the thresholds perfectly and diligently follow up on every contention issue as it is reported (having now set yourself up with a new full-time job), you will still miss performance problems in the application caused by the database. This is because not every application performance problem that is caused by the database will show up in these resource contention and utilization counters.
Which basically means that if you take the approach advocated by VMware in this post, that you will not do anything to help the owner of the application feel comfortable about hosting that application in your vSphere environment. In fact you will leave yourself defenseless in “blamestorming” meetings as trying to explain that the database is not at fault by saying that these counters are in their normal ranges is both unconvincing, and incomprehensible to the application owner. This puts you in the position of saying the wrong thing in a language that the application owner does not understand (can’t possibly imagine why this would work). Therefore the fundamental premise of the VMware post is wrong – you cannot monitor anything in the virtual world in the same way that you did in the physical world.
The Right way to Monitor SQL Server (or any Database) Performance
One of our fundamental tenants about performance management for virtualized systems and applications is that you cannot infer the performance of either a virtualized system, nor an application running on it by looking at resource utilization metrics. The reasons for this are numerous and they are explained in length in our Performance and Capacity Management White Paper, but the short version of the story is that resources are so shared and so abstracted in a vSphere environment that numbers that describe their utilization are essentially useless.
If resource utilization metrics are not the key performance indicators for a database server, then what would that metric be? The key metric to judge both the performance of an application and an infrastructure element like a database server is response time, or the latency incurred in the requests made of the database server.
What is most ironic here is that VMware has a product that measures this latency perfectly, and does so with little to no effort for the administrator. That product is AppSpeed. A screen shot of AppSpeed measuring database latency is shown below. Note that in the case of AppSpeed and all of the other products shown below, database latency is within acceptable bounds, and therefore whatever the resource counters are showing is irrelevant.
Database Latency Measured by VMware AppSpeed
AppSpeed is a virtual appliance that works by cracking into the HTTP, Java, and SQL protocols as they flow through the vSwitch in the vSphere host. AppSpeed is therefore a VMware unique solution – unlike the agent based solutions described below that work whether the application is virtualized or not.
Why are Latency and/or Response Time So Important?
There are both technical and a business/perception reasons as to why latency is such an important metric. The technical reason is that in a dynamic and shared infrastructure the only way to really know what the performance is of an element of that infrastructure is to measure how long it is taking to do what you are asking it to do. You simply cannot infer how long things are taking, you have to measure it directly. The business/perception reason is that if the application owner says that the application is slow, then they mean slow in response time terms. That means that your answer as to where the problem is or is not needs to be given to them in terms that match their description of the problem – which means in response time or latency terms. In other words, if the application owner says that the response time for action X in the application is taking too long your answer as to why it is not the database server should be based upon latency, not CPU Ready, memory swap rates, network I/O rates, or disk queue lengths. Note that it is not that these metrics are useless, it is that you want to start with latency, get alerted when it rises to an unacceptable level, and then use these metrics to help figure out the root cause of the rise in latency. Doing it the other way would require running ESXtop every couple of seconds, which is generally considered to be a bad idea.
Finally latency and/or response time are the ultimate “canary in the coal mine metrics” – which means that if latency is bad then there is a problem, and if latency is not bad then no matter how screwed up the underlying resource utilization metrics might be – there is not really a problem at all. So by starting with latency you stop yourself from spending a huge amount of time going down rabbit holes suggested by the resource metrics, when in fact there is not really a problem.
Where Can You Get Database Latency Information?
The good news is that you can get information as to how long it is taking your database to respond to requests for work from a variety of excellent APM solutions in addition to VMware AppSpeed. Here are some good examples:
New Relic RPM (Monitoring as a Service)
New Relic is the only hosted APM solution that supports Java, .NET, Ruby and PHP applications and that offers out of the box (instant time to value) monitoring from the end user’s browser to the database and back. New Relic has just added Real User Monitoring (RUM) as a free additional feature of its service – all you have to do to use it is to turn on the new feature in the New Relic web dashboard.
AppDynamics (Available on Site or as a Service)
AppDynamics is the only Java transaction monitoring solution that is affordable for scaled out open source environments like hundreds and thousand of Tomcat servers, and that automatically discovers each application and the end to end transactions within those applications.
DynaTrace (an on-premise solution)
dynaTrace is the only APM solution that can track individual transactions from the user’s browser through the web layer the Java/.NET layer the database layer and back to the browser in a real time and deterministic manner. dynaTrace supports Java and .NET as well as an extensive set of enterprise middleware making it well suited for highly complex multi-tier service oriented applications.
BlueStripe (an on premise solution)
BlueStripe is the only APM solution that provides end to end and hop by hop response time for every Windows and Linux application in your environment irrespective the language and deployment architecture of that application.
Attempting to infer the performance of a database server by looking at resource utilization metrics will fail technically, organizationally, and politically. In a shared and dynamic environment, the only way to truly assess the performance of an element of the infrastructure or an application is to measure how long it is actually taking to do its job. This makes response time and its companion metric latency into the two most important metrics for virtualized and cloud based systems. You should not attempt to virtualize a performance critical application unless you use a product like one of the ones listed above to first baseline the response time profile of the application while it is still on physical hardware, and then continue to monitor the response time of that application as you progressively virtualize pieces of it.