In my last post I was Exploring a Limitation of VMware DRS and I have encountered another situation that had similar symptoms but the resolution was quite different. This problem was occurring on a VMware ESX 3.5 cluster that was specifically affecting Windows 2008 R2 64bit virtual machines that were configured with four processors and eight gigabits of RAM. These virtual machines were taking an extreme amount of time to perform a reboot. During the reboot ESXTOP was showing insane %RDY with spikes climbing over 200. When the reboot would finally finish several services would have failed to start.
During the troubleshooting process, we wanted to see if there was any difference in the boot speed when we performed a cold reboot. The cold boot completed faster than a reboot but the symptoms of extremely high %RDY results were still being seen as well as services being unable to start during the boot process of the virtual machine.
Considering the size and amount of resources these virtual machines were using, my original thought was that there was contention or a CPU scheduler issue that would have to be dealt with. Further examination of ESXTOP did not seem to indicate any real contention issues. Each host in the cluster had 16 cores and 128GB of RAM that should have been able to handle these beefy virtual machines without any issues.
Even though Windows 2008 R2 is a supported operating system for the VMware ESX3.5 infrastructure I really began to think that this issue could be a misconfiguration of the guest operating system or at least something to do with Windows 2008R2 when virtualized in general.
I have not noticed any of these issues on the other Windows 2008 R2 virtual machines that were deployed with single or dual processors. There were a total of four of these four processor virtual machines with a matched pair running on separate clusters. Only two of the nodes that were running together on a cluster were experiencing this issue of slow reboots. The other pair that was configured the exact same way was not showing any of these symptoms at all. A google search turned up the VMware KB article Slow reboot of vSMP virtual machines on ESX when a lot of guest memory is page-shared.
Per the article this problem actually occurs because of changes in the architecture of certain CPUs. These changes affect the way that ESX hosts perform COW (Copy-on-Write) memory operations when using vSMP in a virtual machine.
The solution, as it turns out, was to disable page-sharing either at a host level or a virtual machine level. You will need to be careful if you plan on making this change because the virtual machine will allocate all of the memory it has assigned to it. This behavior can cause memory paging and memory over-subscription which will slow down the over all performance of this host and virtual machines.
To disable page-sharing on the ESX host:
- Log in to VirtualCenter (or the ESX host directly) with an administrative account using the VMware Infrastructure (VI) Client.
- Click on the ESX host on which you want to disable page-sharing.
- Click the Configuration tab.
- Click the Advanced Settings link.
- Click Mem in the Advanced Settings window.
- Look for the Mem.ShareScanGHz option and set the value to 0. Note: By default, Mem.ShareScanGHz is set to 4.
- Click OK.
- Reboot the ESX host.
If disabling page-sharing for the ESX is not an option, you can disable page-sharing for the virtual machine.
- To disable page-sharing in a virtual machine:
- Right-click on the virtual machine in the VI Client Inventory and choose Edit Settings.
- Click Options and click Advanced > General.
- Click Configuration Parameters.
- In the dialog box that appears, click Add Row.
- Enter sched.mem.pshare.enable and set its value to False.
One last thing to make note of, this issue and the fix only needs to be considered and used on VMware ESX3.5 platform. This problem was addressed and resolved in vSphere. Slow performance of virtual machines that use more than one vCPU on an ESX host when using certain hardware
Caution: The most noticeable symptom is your virtual machine taking a significant amount of time to reboot, but it does not take significant amount of time for a fresh power on. If you are not seeing slow reboot times of virtual machines, this article does not apply to you. Do not turn off page-sharing if you are not experiencing these symptoms.