CPU Contention in a VMware Virtual Environment

When using virtualization technology system administrators have a lot of tools available to make our day-to-day operation and administration of our environments easier to work with and speeds up the time it takes to do a lot of administration tasks. Take for example, the ability we have to add resources to a virtual machine.  You can add processors, memory and or increase disk space within a matter of minutes and very little downtime.  On a physical host you would need to purchase the hardware first and wait for it to arrive and then schedule the downtime to add the resources to the machine.  This speed and power can be both a blessing and a curse.  Once application owners understand how easy it is to add resources to the virtual machines then comes the requests for additional resources any time the application owners think there is the slightest bit of need for any additional resources. 

In this post I am going to focus mainly on CPU usage and especially working with multi-processor virtual machines in your environment.  In VMware ESX and ESXi the vmkernel handles the scheduling of CPU resources to the virtual machines.  Now with multi-processor virtual machines there is a catch.  If you have a dual CPU virtual machine the scheduler must have two processors available at the same time for the virtual machine or it will wait until the proper amounts of resources are available.  This wait time waiting is called %ready. The %ready numbers can be monitored in real time using the tool esxtop from inside the service console, resxtop from the vMA appliance, or by tools such as vKernel’s Capacity Analyzer and Optimization Pack, and Vizioncore’s vFoglight . The higher %ready number the greater the contention in the environment for CPU resources.  Let’s look at a practical example;

You have a dual quad core host, which will give you a total of eight CPU cores available for your virtual machines.  In theory, you can have eight single processor virtual machines, four dual processor virtual machines or two quad core virtual machines before any contention happens and the %ready numbers start to climb. In theory this would be a safe assumption, but we also need to take into account the hypervisor and its management appliances, which is always pinned to CPU0.

Even in the best designed environments there will be some CPU contention and that is okay.  Any %ready number less than 5% is considered the optimal area to be in.  Once your %ready number climbs in between five and ten percent you need to start to pay attention when adding more virtual machines and or CPU cores to the virtual machines.  We can call this the warning area.  Now once the %ready numbers climb higher than ten percent you will have reached the danger area where performance will be impacted for those virtual machines.  Your host could show a %50 overall CPU utilization and still have CPU contention in the environment affecting the performance of your virtual machines.

To review, CPU contention is one of the hidden issues you might find in your environment unless you know where to look.  The best tool to use when looking for any CPU contention in your environment is esxtop from inside the service console of the host, resxtop from the vMA appliance, or by tools such as Akorri BalancePoint, Zenoss, vKernel’s Capacity Analyzer and Optimization Pack, and Vizioncore‘s vFoglight.  The best defense against CPU contention is knowledge and understanding of how the scheduler interacts with multi-processor virtual machines and to take in that account when using multi-processor systems.

In my opinion, it is best to scale out than up, and this will give you the best overall ROI in the virtual environment. When looking at and watching your %ready, any numbers five and under are good numbers.  When the numbers get in between five and ten percent you are in the yellow or warning area and you should keep an eye out on those specific virtual machines when adding any CPU resources to the environment.  Once the %ready number climbs higher than ten then you are in the red or danger area where performance problems will become more and more prevalent.

A good whitepaper to read on Performance Analysis Methods can be found at http://www.vmware.com/files/pdf/perf_analysis_methods_tn.pdf.  This is a good read to dig deeper on any performance related issues.

Posted in IT as a Service, SDDC & Hybrid CloudTagged , , ,