At the recent Misti Big Data Security conference many forms of securing big data were discussed from encrypting the entire big data pool to just encrypting the critical bits of data within the pool.  On several of the talks there was general discussion on securing Hadoop as well as access to the pool of data. These security measures include RBAC, encryption of data in motion between hadoop nodes as well as tokenization or encryption on ingest of data. What was missing was greater control of who can access specific data once that data was in the pool. How could role based access controls by datum be put into effect? Is such protection too expensive given the time critical nature of analytics or are there other ways to implement datum security?

Encryption up and down the stack

Encryption up and down the stack (click image to expand)

Securing your pool of data used for security analytics is crucial to the security of your environment as that data, itself, is crucial and could give an attacker a map to your entire virtual and hybrid cloud environment. Yet, we still need to act on that data. So how can we access the data securely while protecting the pool of data. The common thoughts are encryption at rest, but as we discussed in Virtualizing Business Critical Applications – Integrity & Confidentiality, where you put encryption is just as important as the type of encryption. As we see in the figure to the right, too low (encrypting fabric controllers, encrypting HBAs, encrypting storage controlers, and self encrypting drives) and you are only protecting against data loss in the case of drive failure and the drive is not destroyed on site but else where. Yet, when you encrypt too high (within the application) you could over encrypt and impact performance.

Since it is nearly impossible to encrypt an entire pool of big data (due to volume, velocity, and variety) how can we protect the individual datum within the pool of data? If that data is PII, tools like that from Dataguise can encrypt, tokenize, redact, etc. the data as it enters the pool and retroactively. Perhaps, we can use something similar on ingest of critical data to encrypt parts or all of it such that the tools can properly decrypt it as needed. But then we get into the entire definition of critical, and derived data can also be considered critical. Is this even possible given how Hadoop works?

Hadoop works by having a central management system that distributes chunks of data to each Hadoop node. Each node then works on that data to produce new data through the map reduce process. In essence, we may end up with a data explosion, where the datum we want to protect is spread through many Hadoop nodes and in a wide variety of forms. So, does that imply that each Hadoop node must encrypt all this data but decrypt it to perform further analysis, then re-encrypt the output during map reduce or decrypt on presentation to the level of the person viewing the data?

This would make for a very complex set of encryption and decryption rules. Not only that, we would most likely want each stream of data or type of datum to have its own encryption or tokenization, much the same way we deal with PII. In this way we can use the encryption keys as a way to protect data from view by those without the rights to view that data or data derived from the original data. Or for that matter even to act upon the data.

To me, this seems clumsy at best. So for now, our best bet is to tokenize PII and other critical data on ingest and on output, but to leave the pool untouched. While this implies the pool of data itself must be protected properly, the security of the pool is limited to control of access to the data, but also control of access to the derived output from the data while performing encryption or tokenization on ingest as well as some form of encryption of data at rest but not necessarily in motion.

The critical component of this that has yet to be done, is the control of the output from big data analytics. I see filters on ingest, but where are the similar filters on output? Are there people that need access to all the data, sure. But does everyone need access to all the data? I would say not. Nor do I think humans need access to the data within the pool, if we limit the access to just applications, we have a pool of data that limits access. If we go further and limit access to properly signed programs, then all programs must then be blessed (signed) before they can be used, and we have one more control over access to our data.

The tools we need then are:

  • Protection of PII and other critical data on input/ingest (DataGuise)
  • Protection of PII and other critical data retroactively for what ever is already in the pool (DataGuise)
  • Protection of PII and other critical data on output
  • Protection of PII and other critical data in motion (Requires implementing IPsec/tunnels within Hadoop)
  • Limit access to the pool of data to properly signed applications (Requires Hadoop to check signatures of programs before running them)
  • Encryption of data at rest either in the storage fabric (Brocade Fabric encrypting switches) or at the drive level (Self Encrypting Drives)

Some of these measures exist today (ingest, retroactive protection, encrypted links between hadoop nodes, and encryption at rest) but the others (signatures and output level redaction) do not exist. Perhaps that is the direction we should go to increase security of our big data environments. Big data pools are becoming repositories of not just user access data for websites but security data and data related to the inner workings of a business, as such we must protect that data in some fashion. How do you protect that data today?

Share this Article:

Share Button
Edward Haletky (377 Posts)

Edward L. Haletky, aka Texiwill, is the author of VMware vSphere(TM) and Virtual Infrastructure Security: Securing the Virtual Environment as well as VMware ESX and ESXi in the Enterprise: Planning Deployment of Virtualization Servers, 2nd Edition. Edward owns AstroArch Consulting, Inc., providing virtualization, security, network consulting and development and The Virtualization Practice where he is also an Analyst. Edward is the Moderator and Host of the Virtualization Security Podcast as well as a guru and moderator for the VMware Communities Forums, providing answers to security and configuration questions. Edward is working on new books on Virtualization.

[All Papers/Publications...]

Connect with Edward Haletky:


Related Posts:

3 comments for “Big Data Security

  1. Pingback: Big Data Security
  2. September 8, 2013 at 1:00 PM

    A clearly laid out explanation of a complex issue – thanks for the post

  3. July 28, 2014 at 4:42 PM

    Certes Networks CryptoFlow Solutions for Data in Motion may now be an option – our vCEPs run at the hypervisor level in virtualized environments and like our HSM encryptors are managed by our TrustNet Manager which makes it incredibly simple to set up complicated IPsec Group VPNs (without tunnels) in minutes. We can manage upto 10gB per second with nano second delay. Could this work in your Big Data Hadoop environment – lets find out – Give Certes a call and we can engage our engineers.

Leave a Reply

Your email address will not be published. Required fields are marked *


8 + nine =