A typo report on twitter has lead me to a set of thoughts with respect to data. Where are your Datasores? What is a datasore? Unlike a Data Store which holds data, a datasore is a place where data becomes either painful to manage or protect. Or where the data exceeds your capability to handle it. A data sore should never happen, but with the explosion of data being moved, protected, managed, and mined we have exceeded certain limits of our existing set of tools. How do we find data sores and alleviate them? Does alleviating them require us to re-architect our entire data usage and storage mechanisms?
What defines a datasore? Once we have defined it, how do we alleviate it or recognize it? Here is my definition of a datasore:
A place within your data architecture where data becomes painful to manage or protect.
While that is a rather generic example, let us look deeper into one aspect of datasore: protection.
The problem with data protection these days is the shear volume of data. In the past we had night long backup windows with less than a TB of data. Now our backup windows a becoming shorter with greater than TBs to backup. So full backups become datasores as they could impact the entire environment.
The remedies to the data protection datasore include items such as:
- Change File Tracking (looking to see if files have changed regardless of blocks).
- Change Block Tracking (looking at data that has changed at the block level).
- Active Block Tracking (looking for unused data within a filesystem at the block level)
- Source Deduplication (deduplicate data before sending it over the wire)
- Source Compression (compressing data before sending it over the wire)
- Target Deduplication (deduplicating data after receiving it over the wire either within one backup file or across multiples)
Some of these remedies exist in nearly all virtual and cloud backup tools. But what do these remedies solve? They all reduce the amount of data either to backup, or the amount of data to be stored somewhere else as part of a backup, perhaps to tape.
Even so these tool features only handle one aspect of this particular data sore. The other aspect is to perhaps determine what data should actually be protected more rigorously than say the entire system. Perhaps we need to tie into our data protection scenarios the business risk if the data is lost or stolen. Firmly making a tie between security and data protection based on risk.
For example, if we look at virtual desktops (VDI) we know that the OS and base configuration will never change, but the user data within those desktops will change. Given this, should we be backing up every VDI session? I would hope not. Perhaps we should only backup the master template or golden image and the user profiles and data sources. We can assume that this type of data management is already taking place but also requires a VDI architecture that makes use of user profiles and remote data sources. For large installations, I can see this as the norm, but what about smaller installations?
Looking at purely data protection as a data sore for large environments and perhaps for smaller ones as well we know the following tools provide many of our remedies:
- Symantec Netbackup and BackupExec as well as nearly all agent-full backup solutions will provide incremental backups based on change file tracking.
- Veeam, Symantec, Zerto, Quantum vmPro, Quest vRanger, PhD Virtual, and many others are using change block tracking as well as active block tracking to reduce the overall data to be sent over the wire.
- Veeam, Zerto, Quest vRanger, PhD Virtual amongst others perform source deduplication.
- Quantum, Symantec, and many others perform target deduplication in one form or another.
All data protection tools for the virtual and cloud environments use deduplication and compression for storage.
As data sizes increase we will start to get data sores where data congregates, as such we need to improve our tools to handle this inevitability. While we just looked at specific remedies for data protection and some of the tools that provide those remedies, are there other remedies and newer ideas required to alleviate just this one form of a datasore?
At the moment there are incomplete thoughts on how to handle datasores that appear in big data, do we just need a new way of thinking about data or better tools?
I would like to thank Mike Laverick for reporting this typo and spurring thoughts on datasores. So what are your datasores? Corporate Policies about Data or Data Protection, quantity of data, etc?