I saw a question get posted on twitter that kind of intrigues me a little. The question was pretty straight forward. “How many virtual machines should I be able to run on a host?” That is really a fair question in itself but what I find intriguing is that this is the first question he asks. Is this really the first thing administrators think to ask when designing their environment? After all there is no set formula on how many virtual machines you can run on a host. You can be a little more exact when working with VDI because for the most part all the virtual machines would be set up pretty much the same way and the numbers can be a little more predictable. That would not be the case when working with server virtualization. You are going to have servers all with different configurations and amount of resources provisioned to the virtual machines. This variation is what will change your slot count and the amount of virtual machines you can run on the host.
If we look at the physical limitation of VMware ESX server, the number of virtual machines you can run is 300 virtual machines per host. Just because you can do something does not mean you should. In this specific case the hosts were going to be HP DL580s running twenty four processors and 256GB RAM each. Those boxes are pretty beefy and should be able to run at least sixty to one hundred virtual machines each without any issues. Your mileage may vary but let’s just say that you can get one hundred virtual machines per host.
In theory, if we had five of those HP hosts in a cluster, we should be able to run a total of 500 virtual machines but would have no room for any failure. Let’s change the number of virtual machines a little and say we have eighty virtual machines per host for a total of 400 virtual machines in the cluster. Now here is my point. When a host failure happens, and this is when, not if, how long is it going to take to start eighty virtual machines on the four remaining hosts? From my point of view it would take way too long. In one of the environments that I worked on was with smaller hosts and larger clusters. With this customer we had around twenty or so virtual machines running on a host which was in a cluster of eight. We had a host crash during the night and all the virtual machines were able to recover for the most part without any alerts being sent out. That means that all virtual machines on the crashed host were able to recover within five minutes.
Although it is important to know the limitations and have an understanding of what your infrastructure will be able to run, I also believe it is just as important when planning and building your environment to make sure you keep recovery in mind so that the time it takes for all systems to recover from a failed host will match the Service Level Agreement (S.L.A) from your management and company expectations.
It is better to think about this during the design phase then to have to answer what is taking too long during an actual outage. How many of you know, right now, how long it takes to recovery from a host failure in your environment?