Exploring a Limitation of VMware DRS

I have been a big fan of VMware’s Distributed Resource Scheduler (DRS). VMware DRS is a service or feature that will dynamically allocate and balance computing resources across the hosts in a cluster. In all of the environments I have work with so far, DRS has been a fantastic tool for getting and maintaining that balance across all the hosts in a cluster. Recently though I have come across a limitation of VMware’s DRS that is worth mentioning.

I have been working with virtualization since long before the introduction of multi-core processors.  In the beginning when multiprocessor virtual machines were in its infancy we were very careful and very selective about the number of these multiprocessor virtual machines that we had running in our environment.

Fast forward to today and we have the technology available to us to be able to deploy hosts with six cores per processor.  That gives us a lot of processing power to take advantage of. Along with this great power also comes great responsibility.  In one specific instance, the infrastructure that I was working with had ageing hosts that were slated for server refresh but the client had big plans for high powered virtual machines in the mean time.  When we originally deployed the cluster we were deploying single processor virtual machines and had plenty of horse power available.  The client then switched gears and we started deploying two processor virtual machines and the cluster continued to perform well and the overall CPU performance of the host and the virtual machines continued to perform well.

The cluster in question was an eight node cluster with each node having four processors with four cores each and 128 GB of ram.  With the continued great success and stability of the cluster the client moved on to adding four processor virtual machines and although the CPU and memory of the hosts appeared to be in great shape I was starting to get calls about the performance of the newer four processor virtual machines that we had deployed.

The limitation of DRS that I had mentioned at the start of the post had really come into view. DRS was doing its job and the hosts were equal across the cluster in both CPU and memory but the problem was that DRS had loaded a host with too many of the four processor virtual machines and when examining the results from esxtop I could see the %RDY of the four processor virtual machines were well over 100 and in some cases topped over 200.

Based on the results we were seeing, the question we then asked ourselves was, is the problem with the virtual machine or were we oversubscribed?  Now this is where things got really interesting and we could not believe the results we were getting in our next test.  We changed the VMware DRS setting to manual so we could control the placement of the virtual machines on the hosts in the cluster and monitor the %RDY as we did so.  Our first test was to even out the load of the four processor virtual machines as equally as we could across all the hosts in the cluster.  We had more dual processor virtual machines then we had single processor and tried different combinations of spreading the load.  We were still not getting the results we were expecting.

In an attempt to try something different we placed all the four processor virtual machines on a single host and cleared the rest of the virtual machines off so it was just the four processor virtual machines. Our thought process for this was that the CPU scheduler needed to have four processors available at a time for each of the four processor virtual machines and by grouping the four processor virtual machines together the CPU scheduler would have an easier time scheduling the CPU cycles to the virtual machines.  This actually paid off and the %RDY times were much better and the performance of the four processor machines increased dramatically.

DRS works well but it does have its limitations since it only considers the overall CPU and memory percent when deciding where to place the virtual machines across the hosts in the cluster. In a previous post, A Look at VMTurbo Monitoring, I started to examine what VMTurbo has to offer and bring to the table and as far as I know it is the only third party product that truly expands on VMware’s DRS to take in account other factors like %RDY when the VMTurbo product decides where to place the virtual machines and when working with a spread of many different flavors of virtual machines of all shapes and sizes.

I think as we move forward and the technology continues to get better and better at such an incredible pace, a pace that budgets of companies will in many case never be able to keep up with, we must continue to evaluate the true capabilities of the environments, that we support, and even when we are forced to slow down or stop the pace and scope of the projects that we are asked to deploy. When that is not an option we have to be willing to examine and push for the proper third party tools that we will need to help maintain the balance and performance of the infrastructure.

Posted in IT as a ServiceTagged , ,