The Virtualization Practice was recently offline for two days, we thank you for coming back to us after this failure. The reason, a simple fibre cut that would have taken the proper people no more than 15 minutes to fix, but we were way down on the list due to the nature of the storm that hit New England and took 3M people off the grid. Even our backup mechanisms were out of power. While our datacenter had power, the rest of the area in our immediate vicinity did not. So not only were we isolated from reaching any clouds, but we were isolated from being reached from outside our own datacenter. The solution to such isolation is usually remote sites and location of services in other regions of a country, this gets relatively expensive for small and medium business, can the Hybrid Cloud help here?
Actually, we were never 100% isolated, individually our 3G/4GLTE phones all worked and as such would allow us to individually reach the outside, I was on twitter through out these dog-days, keeping everyone informed and trying hard not to complain too much about the slowness of the telcos in the area. At the end of the first day I was coming up with other plans to come back online, but they either required me to buy hardware not available in my area or use services that were not available in the area until power was restored. By that time the emergency was over and we were in good shape.
- 1:40AM Sunday – 80′ of several 100 year old tree falls on power lines snapping the fiber cables. Generator pops on. Datacenter still has power.
- 1:43AM Sunday – Power was cut from the area due to an 80′ tree dropping over but not breaking the primary and secondary lines. There was a split in the neutral cable for service lines but no where else. This was a decision made by the fire department.
- 2:00AM Sunday – Power off test nodes of the vSphere 5 Migration test cluster
- 8AM Sunday – Enter the datacenter and vMotion all VMs to a small number of vSphere 4.1 nodes to the absolute minimum of nodes and was able to power off 2 more nodes. In addition, power off all other sundry equipment such as scanners, printers, etc.
- Afternoon Sunday – Department of Public Works comes out to remove the 80′ of tree on the wires. These guys are really good at this. Using a Backhoe Loader they were able to lift the tree while cutting the limbs and pieces. I have never seen a back hoe wielded like a scalpel before.
- First call to Verizon brings out a technician who watches all this and decides that fixing the fibre was too dangerous. Leaves without discussing other solutions. Later National Grid tells me that it was not too dangerous.
- Monday Morning – National Grid shows up and stops all traffic as they see the high voltage lines running over the road within reach of anyone. But by then a school bus, trash truck, and many cars have already passed underneath. If those wires were live, which no one knew or not, a school bus, whomever was driving that should be let go.
- Monday Afternoon – Calls to Verizon, Comcast lead to dead ends, no power, no network.
- Tuesday Morning – National Grid comes out with a line truck to fix the damage to the poles, brace a damaged pole and get the lines back where they belong while fixing the line.
- 3:30PM Tuesday – Verizon comes out and reattaches the fibre.
- 4:30PM Tuesday – Datacenter connected once more
Plan then Plan some more, it is a learning experience
So what does this all mean, it means that even though we planned for several points of failure, we ended up with a single point of failure. The power to the entire area. Even though we had power, we ended up with no primary network connection, and the secondary network connection was also off due to lack of power on the streets surrounding the datacenter. So do we need a tertiary source of networking? Perhaps. Is there a better solution, and there may be, and that is one of:
- Use some form of Hybrid Cloud mechanism where you have everything in the cloud on standby and pay for storage and utilization when the systems are in use.
- Use some form of 3G/4G/4GLTE connection during emergencies
There are problems with each of these solutions that need to be overcome.
- Cost issues: Bandwidth costs and the knowledge that some vCloud providers charge more than $500 per VM, which is a bit pricy given a like box inside an enterprise may only be 2x that cost and paid for out of pocket and not amortized over years.
- How do I get the data from the local vCloud to the remote vCloud safely in a timely and continual manner?
- How to access our own systems in an outage if not using a vCloud.
Solution 1: Extend Wireless Networking to be for the Datacenter
Let me handle the last solution first, if we were to get a Cradle Point device with a series of MiFI cards, we could provide expensive but possible connectivity to our own environment. This is a not a clear win as the ongoing data plan costs can add up to a sizable bill during a long term outage. So while this is under consideration, it is not a very good long term solution.
Solution 2: Extend into a Hybrid Cloud
Cost and data movement are two major considerations when considering a Hybrid Cloud solution. If you have 100s of VMs, vCloud providers can price themselves out of the market. For small businesses some have already priced themselves out of the market. The other consideration can be solved with various technologies such as connector software and replication software which the provider must also support in order for migration to be successful. Which means we add more to our evaluation list for vClouds.
We first need to get the data to the cloud, there are a few possible solutions:
- Ship the Data by disk
- Ship the Data over a secure connection
- Make use of VMware vCloud Connector
- If you want more security AFORE Solutions Cloudlink which also supports vCloud Connector
- Replication Tools from Quest, Veeam, Data Gardens, Zerto, VMware, etc.
The first option is time consuming and if you have time constraints this becomes a tough solution. In addition, can you be sure that your data has not been copied in transit? The second is much more palatable from a time constraint, but you also have to worry about end point security and which of the cloud administrators can look at your data. For these solutions, disk or data encryption is a must before you transfer the data even if you transfer using a secure tunnel, the reason is that while the secure tunnel protects the data in motion, we also have to protect the data at rest even from the cloud provider administrators. In VM encryption reduces this risk but does not alleviate it fully.
AFORE Solutions Cloudlink, if supported by your chosen vCloud, has both data in motion and data at rest encryption, but even so, once the data is in the cloud a vSphere administrator could look at the data, so we also need to maintain data or in-VM disk encryption to lower our overall risk, while increasing integrity, and confidentiality of our data even from the cloud provider administrators.
The other class of tools that could get our data into a cloud would be to use replication tools provided by third parties. We would in essence use a Replication Receiver Cloud, that would also be able to run our VMs. On going replication is a must going forward as when a power outage or fibre cut happens we can easily login to the cloud portal via our 3G/4G/4GLTE devices and power on the VMs. If such replication is near realtime we should have minimal data loss. So a cloud that supports either VMware’s SRM Replication, Zerto, Veeam, Quest, or Data Gardens (and other) replication is also a must.
At this time all we have is a list of requirements for each vCloud provider, and which we choose depends on answers to these questions, however, the security of our data is our own, in such we have a list of new security protocols and procedures to implement before we go to the cloud.
Our list of security issues is:
- Ensure we have global DNS capability outside our environment either as a backup name server or as a primary (either solution requires this functionality)
- Ensure in-VM disk and data encryption is in use to lower our overall risk
- Ensure any link to the cloud from our 3G/4G/4GLTE devices is over an encrypted tunnel (a VPN which is what we use now)
- Determine which VMs we need to place in a vCloud and what networking we will have available to protect them.
Our list of vCloud requirements are:
- Support for vSphere 5, vCloud 1.5, vCloud Connector
- Support for our own set of vNetworks (an External, a DMZ network, a management network, and an internal network)
- Support for AFORE Solutions Cloudlink (a very nice to have)
- Support for one form of Replication software such as Veeam, Quest, VMware, Zerto, or Data Gardens.
- Support the ability to failback when the emergency is over
- Low cost, as the primary use is a means to backup/replicate and run ONLY during an emergency.
Cost will be the largest issue for a small businesses and depending on the number of VMs to put in a vCloud cost could be an issue for enterprises as well. The other issues around security and replication are a matter of what a vCloud supports for moving data to their environment. In our case we would want to move whole VMs initially and replicate data from then on.
vCloud Providers tell us if you meet these requirements?