Data Protection techniques should be implemented and tested long before they are needed. This is a necessary component of any IT organization. However, the most recent VMware communities podcast brought to light several implementation aspects of Data Protection, specifically about Disaster Recovery: organizations still do not test their DR plans and organizations are waiting for a hardware refresh to implement a DR plan.
The first issue is about services that are too important to fail, and the other is about planning. However, what they both boil down to is dollars and the Fear that either loss of money, or that data protection is too expensive to implement properly without the latest hardware. In both cases, it is a political debate.There are many forms of data protection and redundancy required within an organizations compute environment. From the need for redundant and meshed switch fabric, multiple blade and other compute node chassis, to the need to backup or replicate data elsewhere. In a virtual environment many of these aspects are often abstracted away by higher order functions like virtual machines, virtual networking, high availability, and other aspects of modern hypervisors. In a highly agile environment, there is a belief that the hardware only matters to get the environment up and running, but once it is running, the hardware is not much of a concern anymore. With tools like SRM, Backup, and Replication tools designed specifically for hypervisor environments, all we need to do is provide the hardware and go.
This is not really the case, we need to often pick our hardware with redundancy in mind, hence the best practices of when you are ordering something new, always get N+1 of the devices (where N is the Needed/Required amount). In some shops, this is N+2 with the extra being placed in a closet for emergency use. This same logic goes for switch fabric needed for networking and storage access. Redundancy needs to run through the entire environment.
Redundancy must also be considered for any hot-site as well, ensuring your hot-site matches your primary environment. Even so, how we move data from site A to site B must also come under scrutiny. We can either:
- Copy the data using stretch layer-2 and storage clustering (ala EMC VPLEX)
- Replicate the Data using Continuous Data Protection (CDP) (ZeRTO, Data gardens, Falconstor), near CDP (Veeam, Quest vRangerPro), or Point in Time mechanisms (or backup mechanisms)
- Backup the Data to a hot site (Quantum, Veeam, Quest vReplicate, VMware VDR, etc.)
The goal is to move the data to a safe location. But once it is in a safe location, it is time to test your Disaster Recovery mechanisms. So how can these test be done safely, without minimal political impact? Workloads deemed too sensitive to fail are a major issue, which are the workloads that require the best data protection and the most important of all DR testing.
There are several layers of data protection available with several testing scenarios:
- Backup your data and then use an automated backup testing tool, that restores data either to your hot-site or more importantly into a sandbox that is protected from interfering with your business critical and running workloads. Tools such as Veeam SureBackup, Virtual Sharp, or your home grown mechanism
- Setup a Sandbox to run your replicated data by hand which works with VMware SRM, and all the other replication and backup tools
- Push the Big Red Button on your production environment (the political ‘never do’ option).
While sandbox testing of data protection mechanisms is a must, the key is really good documentation on what to do in case there is a disaster. Your organization should have easy to understand and up-to-date Disaster Recovery documentation that covers all cases of failure and how to correct. In addition, you should have a Disaster Recovery testing plan that includes not only Sandbox testing, but actual live failure of components. Often there is only one way to determine if your existing redundancy is working properly, that is to force its use.
If Applications that are too big to fail exist within a virtual environment, in most cases they are often moved from node to node, without the application owner even notified, using storage vMotion, or other forms of live migration. They are often moved as administrators still need to upgrade hardware, patch hypervisors, replace hardware, etc. This makes a good starting point on discussing redundancy methods and DR testing. In many ways, Live Migration and storage vMotion are the lowest form of data protection. They are used by administrators to keep workloads running while they manage the underlying hardware. In addition, these too big to fail applications have already experienced a high availability event and required a reboot which is another low-level form of data protection.
Now it is time to test the higher order forms of data protection, building upon what already works. Use the next step: Sandbox testing. Then once that proves out, move onto real-world testing. But first ensure all your documentation is in order. Plan, then plan some more. Good planning is required for good testing and to combat politics, fear, and the bottom line. But testing is a must! How many companies realized their DR plans were not adequate after the recenter spurt of natural and man-made disasters such as Katrina?