Distributed Virtual Switch Failures: Failing-Safe

In my virtual environment recently, I experienced two major failures. The first was with VMware vNetwork Distributed Switch and the second was related to the use of a VMware vShield. Both led to catastrophic failures, that could have easily been avoided if these two subsystems failed-safe instead of failing-closed. VMware vSphere is all about availability, but when critical systems fail like these, not even VMware HA can assist in recovery. You have to fix the problems yourself and usually by hand. Now after, the problem has been solved, and should not recur again, I began to wonder how I missed this and this led me to the total lack of information on how these subsystems actually work. So without further todo, here is how they work and what I consider to be the definition for fail-safe.

There are three failure modes available for security tools:

  • Fail-Closed – This is generally the state you want for firewalls, and many security devices when they fail. This mode implies that no traffic will route through the device until the failure is corrected.
  • Fail-Open – This is generally the state you want for network switches. This mode implies that all traffic will be routed through the device. How it routes depends on what actually failed.
  • Fail-Safe – This is a growing concept that implies that should a device fail, another device or mechanism within the device picks up the traffic sends it onto its destination with the proper checking and routing.

Fail-Safe is the holy grail of security tools. They all want to fail-safe. This is achieved in hardware by having redundant systems that can take over the workload when the primary device fails. Any good security and networking plan builds for such failures. This is why we often see network switches in pairs, why load balancers are in use, why clustering technology is in use, etc.


2011 02 10 09 51 48
Figure 1: Network Control Planes

However, when we enter a virtual environment our network is flattened, and no matter how we try to change this, the network remains flat. I have described in detail how the network stack works in several other articles (Blade Physical-Virtual Networking and Virtualization Security,Rethinking vNetwork Security) but now it is time to consider VMware vNetwork Distributed Switches in more detail. In all of my other diagrams they are nothing more than a layer within the stack, but in reality they are much more than that.

In reality, the vNetwork Distributed Switch (vDS) Control Plane extends from the hypervisor into VMware vCenter as shown in Figure 1.  While the traditional VMware vSwitch’s control plane lives wholly within the hypervisor,  the vDS does not. This implies that for vDS to function properly that VMware vCenter must ALWAYS be running. This is a case of a Failed-Close system. The following catch-22 problem can occur.


If the Host running vCenter dies and you are using vDS and VMware vCenter is running on a vDS, HA will fail as the VM cannot reboot on a new host. Why? Because VMware vCenter is not running, and VMware vCenter does the port assignment of VMs to the vDS in use as only vCenter knows whether or not a port on the vDS is available.


The solution is a Fail-Safe style of solution. The vDS code within vCenter should download to each host, the current port assignments so that HA can work as well as assigning to each host participating in the vDS a port-range that can be used for new VMs until vCenter comes up including a certain amount of In Case of Emergency (ICE) ports over the normal allotment for a given host. So if you set up vDS to only allow 10 ports and you have a 2 node cluster, the split may be 5 per host, but you may need 10 for a single host if the other node does not boot. So in this case the ICE ports would be 2x the normal allotment or 10 ports per node. Yes, this could exceed the normal 10 ports allowed in this vDS, but would maintain a running virtual environment. How many ICE ports should be a configurable option.

If HA does not work, then we run into a serious catch-22 issue. Currently the only way to fix this problem is to ensure VMware vCenter is running on something besides a vDS, and that in an HA scenario, it boots first, then all other VMs so that the vDS control plane is available to perform port assignments. In my particular case, I had to go into the host and at the management console (Service Console in this case), reconfigure a VMware vSwitch to connect to the uplink used by the dVS and physically move the VMware vCenter VM to this vSwitch. Thankfully, management appliance access was still available so I was using both the vSphere Client and console access to fix this issue. The reason, is that when you play around with the ports for the management console you absolutely need to do this from the console as you temporarily disrupt communication outside the host.


VMware vShield Drivers
Figure 2: VMware vShield in the Stack

The other issue I had was another availability problem, but related to vMotion slowing to a crawl, it took 15-20 minutes to perform a vMotion. The reason for this was interesting, but once more produced a Fail-Closed system. The security device I was using, VMware vShield Manager and the VMware vShield Endpoint Security module were both on networks not seen or available to the service console. Because of this, the vShield Endpoint vSCSI filter (VFILE) was not available. As you can see in Figure 2, VMware vShield Endpoint sends its data through the SCSI driver using the VFILE filter to the EPSec Transport layer and the EPSec Driver. The transport layer then routs the traffic to the EPSec virtual appliance for consideration (marked as DSVA). The path of which I speak is the dashed red-line.

If for any reason the DSVA (EPSec virtual appliance) is unreachable (or VMware vShield Manager for that matter) the VFILE filter timeouts and retries several times before giving up. This timeout is set fairly high, or the number of retries is also set very high by default.


The problem occurs if virtual networking has an issue (such as the dVS Failed-Closed discussed above) or the appliance dies for some reason, or even VMware vShield Manager becomes unreachable as it checks licensing. The culprit during a vMotion is the fact that during a vMotion, the virtual machine needs to be quiesced, and that requires data to be written from memory to disk. Any file that is opened or closed within the virtual disk gets sent over the EPSec Transport by the VFILE filter. Which causes all the timeouts I had seen.


The solution to this problem is to use a different networking structure to do the transport so that the EPSec virtual appliance does not need to be reachable over the network. VMsafe-Net does this for firewalls, EPSec should do this as well.  It also should have a administrative settable timeout and if the EPSec virtual appliance is not reachable do not retry in the midst of a vMotion, just sync the memory to disk, mark the files to be checked once the EPSec virtual appliance is reachable, and proceed. EPSec should NOT get in the middle of critical actions, or impact critical actions. This solution would make EPSec and vShield Manager failures Fail-Safe instead of Fail-Closed. Furthermore, licensing should be stored within the hypervisor for these and other functions for a short period of time. Once more make use of ICE style licensing as you never know if the EPSec Virtual Appliance will be running on the host to which VMs have been moved.

Actually, when you use EPSec, VMware will insist the best practice is to have EPSec installed on every host in a cluster, but this does not account for disaster recovery issues where you JUST need to get things running. When I finally was able to remove VMware vShield from my hosts for further testing, I found that the VMs would not boot because the VFILE filter was no longer available. This is a Failed-Closed system, once more we need it to Fail-Safe and allow the boot, but start collecting files to check within the EPSec Driver installed within each VM.

Trend Micro has a Fail-Safe model within their Deep Security 7.5 product such that if EPSec is not available, it switches off that aspect and uses the Deep Security Agent within the VM for all other aspects of security eventually Trend Micro will have a Fail-Safe mode for Anti-Malware as well. However, Deep Security makes use of vShield Endpoint and has all the limitations that we have discussed above. This is not a Trend Micro issue, but a VMware vShield Endpoint issue. VMware vShield Endpoint must Fail-Safe.

More Networking

Since Cisco Pushing More vNetwork into Hardware Fail-Safe within the virtual network becomes all important. No one has said much on how this technology will work, other than it is a paravirtualized driver based on VMXNET3. But the questions remain:

  • Does this technology make use of the VMware Introspection APIs (VMsafe-Net)
  • Will it be secured much the same way as virtualized components are today, will this truly be a physical only security model, or some combination?

I do not see this technology moving to a physical-only security model considering VMware’s investment in VMware vShield Edge for use with vCloud Director but more a hybrid model as the vNIC still connects to the Portgroup and then from there may go direct to hardware passing through the Cisco Nexus 1000V. As you can see in Figure 2, the purple dashed line shows this path. The VMsafe-Net introspection APIs would still come into play.  What happens if the VMsafe-Net virtual appliance also dies? The Third Party vendors solved this issue with a Fail-Safe model of downloading the majority of the firewall rules into the VMsafe-Net driver installed on every host in a cluster and not by keeping them in the firewall virtual appliance. vShield Edge, App, and Zones does not do this. So when the vShield Edge or App virtual appliance dies, we are now once more Failed-Closed locking you out of the system. Perhaps with this change coming from Cisco, Cisco can work with VMware to create a hybrid hardware/software device that Fails-Safe as needed.

Putting it Together

Critical systems such as networking and security within the virtual and cloud environments need to Fail-Safe! Any product that does not have a fail-safe mode should be considered as incomplete, which unfortunately includes many virtualized components that make up modern hypervisors such as VMware vNetwork Distributed Switch. I call on each vendor to improve their products to account for all failure modes. In a disaster, Fail-Safe could save you quite a bit of time and energy.

Posted in IT as a Service, SDDC & Hybrid Cloud, SecurityTagged , , , , , , ,