Big Data Security

At the recent Misti Big Data Security conference many forms of securing big data were discussed from encrypting the entire big data pool to just encrypting the critical bits of data within the pool.  On several of the talks there was general discussion on securing Hadoop as well as access to the pool of data. These security measures include RBAC, encryption of data in motion between hadoop nodes as well as tokenization or encryption on ingest of data. What was missing was greater control of who can access specific data once that data was in the pool. How could role based access controls by datum be put into effect? Is such protection too expensive given the time critical nature of analytics or are there other ways to implement datum security?

Encryption up and down the stack
Encryption up and down the stack (click image to expand)

Securing your pool of data used for security analytics is crucial to the security of your environment as that data, itself, is crucial and could give an attacker a map to your entire virtual and hybrid cloud environment. Yet, we still need to act on that data. So how can we access the data securely while protecting the pool of data. The common thoughts are encryption at rest, but as we discussed in Virtualizing Business Critical Applications – Integrity & Confidentiality, where you put encryption is just as important as the type of encryption. As we see in the figure to the right, too low (encrypting fabric controllers, encrypting HBAs, encrypting storage controlers, and self encrypting drives) and you are only protecting against data loss in the case of drive failure and the drive is not destroyed on site but else where. Yet, when you encrypt too high (within the application) you could over encrypt and impact performance.

Since it is nearly impossible to encrypt an entire pool of big data (due to volume, velocity, and variety) how can we protect the individual datum within the pool of data? If that data is PII, tools like that from Dataguise can encrypt, tokenize, redact, etc. the data as it enters the pool and retroactively. Perhaps, we can use something similar on ingest of critical data to encrypt parts or all of it such that the tools can properly decrypt it as needed. But then we get into the entire definition of critical, and derived data can also be considered critical. Is this even possible given how Hadoop works?

Hadoop works by having a central management system that distributes chunks of data to each Hadoop node. Each node then works on that data to produce new data through the map reduce process. In essence, we may end up with a data explosion, where the datum we want to protect is spread through many Hadoop nodes and in a wide variety of forms. So, does that imply that each Hadoop node must encrypt all this data but decrypt it to perform further analysis, then re-encrypt the output during map reduce or decrypt on presentation to the level of the person viewing the data?

This would make for a very complex set of encryption and decryption rules. Not only that, we would most likely want each stream of data or type of datum to have its own encryption or tokenization, much the same way we deal with PII. In this way we can use the encryption keys as a way to protect data from view by those without the rights to view that data or data derived from the original data. Or for that matter even to act upon the data.

To me, this seems clumsy at best. So for now, our best bet is to tokenize PII and other critical data on ingest and on output, but to leave the pool untouched. While this implies the pool of data itself must be protected properly, the security of the pool is limited to control of access to the data, but also control of access to the derived output from the data while performing encryption or tokenization on ingest as well as some form of encryption of data at rest but not necessarily in motion.

The critical component of this that has yet to be done, is the control of the output from big data analytics. I see filters on ingest, but where are the similar filters on output? Are there people that need access to all the data, sure. But does everyone need access to all the data? I would say not. Nor do I think humans need access to the data within the pool, if we limit the access to just applications, we have a pool of data that limits access. If we go further and limit access to properly signed programs, then all programs must then be blessed (signed) before they can be used, and we have one more control over access to our data.

The tools we need then are:

  • Protection of PII and other critical data on input/ingest (DataGuise)
  • Protection of PII and other critical data retroactively for what ever is already in the pool (DataGuise)
  • Protection of PII and other critical data on output
  • Protection of PII and other critical data in motion (Requires implementing IPsec/tunnels within Hadoop)
  • Limit access to the pool of data to properly signed applications (Requires Hadoop to check signatures of programs before running them)
  • Encryption of data at rest either in the storage fabric (Brocade Fabric encrypting switches) or at the drive level (Self Encrypting Drives)

Some of these measures exist today (ingest, retroactive protection, encrypted links between hadoop nodes, and encryption at rest) but the others (signatures and output level redaction) do not exist. Perhaps that is the direction we should go to increase security of our big data environments. Big data pools are becoming repositories of not just user access data for websites but security data and data related to the inner workings of a business, as such we must protect that data in some fashion. How do you protect that data today?