A customer recently asked me, can we virtualize our Tier 1 App that receives 7Billion requests per day? My initial response was, on how many servers? Their answer was 15. This is quite a shocking set of numbers to consider. Add into this numbers such as 150K sessions per second, the need for a firewall, and sub-second response time and you end up with a few more shocking numbers. So could such workloads be virtualized? or is it too big for Virtualization?In effect, this becomes an architectural challenge, one worthy of those who choose to become a VCDX, but the answer is not that simple. Consider their current configuration:
- 15 Dual Quad-Core nodes with 24G of memory in each and 300G of SAS drive space.
- They constantly run low on memory and drive space (such that most of their tooling is there to remove unneeded data after rolling it up)
- Add to this 4 network adapters (all in use)
- And a inter-machine communication rivaled only by High Performance Computing (HPC) based systems.
So if we were to virtualize as is, we would need the following amount of resource:
- 15*8 or 120 vCPUs
- 360G of Memory
- 3.6TB of Disk
- Extremely low latency network with 4 vNICs per VM each using VMDirectPath
The low latency network components will limit the number of VMs per host quite a bit, as the VMs in effect need to bypass as much of the hypervisor as possible to achieve the required levels of network IO. In addition, we need to be cognizant of the security requirements of a firewall that currently handles upward towards 150K sessions per second. So we are either going to have to continue to use a single physical firewall or multiple virtual firewalls. Virtual firewalls are restricted to handling around 2K sessions per second (sometimes more, sometimes less). Which means to handle the workload required, we may need 75 load balanced virtual firewalls to keep up with the requirements of 7Bn requests per day.
So how could one design such a configuration today with modern hypervisor technology? Well, for starters, our limiting factors appear to be networking and the fact that we never want to overcommit memory or CPU. This is in effect a HPC application, so we need to remove any function that would impede CPU performance. This includes some of the major features of virtualization: CPU, Network, Storage, and memory overcommit. So what would we need to do? This type of solution is all about the workload, and when you think of this particular workload we know that each VM needs the following resources to run the current workload:
- 8 Cores
- 24G of Memory
- 4 low latency vNICs
- 300G of low latency storage
This implies that in order to virtualize just like we were in the physical world we would need a host that could handle minimally 1 of these VMs, but ideally 2-4 of these VMs, so hardware choices come into play. In addition, we have to consider the base requirements of running a hypervisor, that one pCPU is dedicated to the hypervisor. This leads us to the following conclusion about hardware that is required, just to match the current physical environment.
- Dual Hex Core CPUs.
- 32G of memory
- 4 Intel VT-d/SRIOV pNICs for the VM
- 2 pNICs for the hypervisor (redundant management)
Which would leave 3 cores for other uses, such as for the a per hypervisor firewall. If we wanted to double up VMs per node or move to 4 VMs per node, our hardware requirements change drastically and increase in cost. The ideal consideration for this type of workload would be:
- Quad 12-core CPUs
- 144G of Memory
- 20 Intel VT-d/SRIOV pNICs for the VM
- 2 pNICS for the hypervisor (redundant management)
Given these numbers, it may be better to use Cisco VM-FEX technology or 4 port Infiband cards with EoIB configurations which, as I discovered at the OpenStack Conference, is the way HPC loads are being considered by researchers. However, this has gotten me thinking more about the workload than about the technology required to virtualize it. Would it not be better to build up a SaaS solution and design the code specifically to run within a cloud. Granted, given current technology, it would require many more boxes than 15 or even 30 to run.
So the real question is, why are we shoehorning into virtual and cloud environments our physical definition of a workload. Could we not, instead, redesign the workloads for the cloud? At the same time we could redefine the concept of HPC to be more cloud-aware? Which will require us to understand how other tenants within a cloud impact latency for HPC.
Even so, the limiting factor for virtualizing such workloads or placing them into the cloud is ultimately the cost to do so. But anything that impacts the bottom line will impact the ability to place such workloads into the cloud.