At The Virtualization Practice, we have systems running in the cloud as well as on-premises. We run a 100% virtualized environment, with plenty of data protection, backup, and recovery options. These are all stitched together using one architecture: an architecture developed through painful personal experiences. We just had an interesting failure—nothing catastrophic, but it could have been, without the proper mindset and architecture around data protection. Data protection these days does not just mean backup and recovery, but also prevention and redundancy.
Articles Tagged with Failure Modes
In my virtual environment recently, I experienced two major failures. The first was with VMware vNetwork Distributed Switch and the second was related to the use of a VMware vShield. Both led to catastrophic failures, that could have easily been avoided if these two subsystems failed-safe instead of failing-closed. VMware vSphere is all about availability, but when critical systems fail like these, not even VMware HA can assist in recovery. You have to fix the problems yourself and usually by hand. Now after, the problem has been solved, and should not recur again, I began to wonder how I missed this and this led me to the total lack of information on how these subsystems actually work. So without further todo, here is how they work and what I consider to be the definition for fail-safe.