De-Risk Your Data to Accelerate Your Cloud Journey: Part 2 — Design and Potential Pitfalls

Eric McCarty
7 min readJul 12, 2021

Reducing the risk of your data while moving it to the cloud can help you get early wins in your cloud journey without adding unnecessary risk to your business.

Photo by NASA on Unsplash

Part 1 of this series can be found here.

In part 1, we discussed some of the history that has caused highly-regulated and/or risk averse (from here, known as HRRA) companies to be gun-shy on moving large sets of data to the public cloud, and the three traditional options most of these companies take:

  • “Wall Off” Data Environment On-Prem
  • Wait for Perfect InfoSec/Governance in the Cloud
  • Increase Risk Appetite

While most HRRA companies seem to take the middle option, there is a 4th option: the de-risk data pipeline.

Designing Your Acceleration

Your mission: Make your data as valuable to your business but as worthless to criminals as possible.

Remember from Part 1: Analysts have to collaborate and need access to a ton of data to be valuable. Access to large sets of data is the key component of an analysts success (Data Democratization).

Yes but…

In most analytics use cases, specific personally identifiable information (PII) data is not necessary to complete work.

Yes but…

It is difficult to understand if the data they have access to might have sensitive data in it.

With that being said, the key to accelerating your cloud journey is to de-risk your data prior to moving to the cloud. Creating a managed pipeline that does this for you prior to moving data to the cloud is a key way to reduce risk, opening up your cloud analytics journey.

Fig. 2-The De-Risk Data Pipeline

The key component to de-risk your data is to create a “de-risk data pipeline” for all data moving to the cloud. The output is a de-risked data asset that, if done correctly, significantly reduces your risk in the cloud. The output of this pipeline is data assets that significantly reduce your risk and allows you to leverage the public cloud while you work to solidify your IAM, firewall, encryption, logging, monitoring, and governance processes. Note that this can be done to accelerate your initial cloud journey, or to accelerate your multi-cloud journey if leveraging a new service provider.

This pipeline should classify data into three funnels:

  • Restricted data: This is data that is high-risk/low-reward for analytics. If SSN gets exposed, you will be in the news for all the wrong reasons, and no good analytic business decisions have been made using SSN. While you will eventually need to figure this out to run mission-critical apps in the public cloud, do yourself a favor and irreversibly redact it before you send it to the public cloud in your analytics journey. While you may be tempted to say “the data is encrypted in the cloud,” remember that it only takes one mis-configuration on a service account to expose that data, and then you are back in “have to get that perfect” mode, slowing you down. Don’t think about it, REDACT IT.
Fig. 3-Example of Redaction
  • Confidential data: This is data where referential integrity has value to analytics, but the underlying data does not. In this funnel, the data could still be valuable to hackers and/or bring reputational risk if it was to be leaked, but analysts may need to know that the value is consistent across systems. A good example here is email address. Very rarely is there value for an analyst to know the exact value of an email address. However, knowing that the email address from two different systems are the same can be VERY important when trying to integrate them or to determine like email addresses across systems. In this case, the best thing to do is to replace the sensitive element with a reversible pseudonym or a token value. This provides value to the analyst without exposing the underlying value. And if required in certain use cases, the value could be reversed at a later date or on-premises in batch.
Fig. 4-An example of tokenization. Note that the replaced values in both systems are the same, maintaining consistency for any join or integration scenarios.
  • Internal Use Data: This is everything else: data that shouldn’t really be on the open internet, but information that would not bring about reputational risk if exposed.

A word of warning: while looking at what elements you should place into the three funnels, you will be tempted to overload the first two funnels. One tends to overthink what possible nefarious things people could do with things like “account balance” or “credit score,” but remember you are taking away what makes those important: the true identifiable information that ties it to individuals. A list of balances or credit scores are just a list of numbers if not tied to a specific person. 95% or more of your elements should really fall into the third funnel. No company has ever been in the news because a list of zip codes was leaked on the internet. The question of which elements should go where is one outside of the scope of this blog, but it’s a passion of mine so feel free to reach out if you ever want to discuss this topic (I have very strong opinions, but backed by experience!).

Potential Pitfalls

With a properly constructed de-risk data pipeline, your data and analytics journey to the cloud can move forward with a higher degree of confidence without sacrificing value to your analytics community. However, there are some pitfalls that you must be cautious of to be successful. All of these I have experienced personally.

  • Avoid Fixing All Prior Ills: There will be a temptation to see your cloud journey as an opportunity to clean up all the technical debt that you have built up over the years. While this is a reasonable goal, this can also significantly slow down your cloud progress, resulting in missed timelines and frustrated users. You should always have a backlog of technical debt to tackle, but it should be independent of your cloud backlog. Avoid the classic, hard-line stance of “nothing moves to the cloud that isn’t pristine.”
  • Avoid Adding too Much New Functionality: There will also be a temptation to use your cloud journey as an opportunity to add functionality you always wanted in your cloud journey: sophisticated data lineage, robust metadata, mature data governance, and more. These are all wonderful goals to have, but if it’s a requirement before your move to the cloud, you will miss out on the benefits that the cloud has to offer. These are all highly complex, sometimes multi-year journeys that can be programs on their own. Keep your cloud backlog hyper-focused on moving high-value, de-risked data assets to a place where it is easily accessible by your analysts to bring value sooner.
  • Avoid Over-Engineering Your Pipeline: I know what you are thinking: de-risking and moving each data asset is a distinct unit of work, this is a perfect use case for containers or a serverless model. This may be a controversial take, but unless you have a very sophisticated data engineering practice and a sophisticated operations team, keep it simple. It’s likely you will have a mixture of data assets arriving at all hours of the day in many different data formats, and orchestrating and scaling a container platform across those distinct units of work becomes a heavy burden if your engineering team is new to that model. I’m not saying you need to use legacy ETL tools to handle this work necessarily. But big beefy processing servers that can handle file sizes from 5KB to 5TB, while not in style, are well understood and mature platforms for data processing.
  • Avoid Adding Extra Burden to Moving Cloud Data Assets: Most companies have some kind of process for assessing workloads that go to the cloud. But companies should ensure that the de-risk data pipeline is that extra burden, and meets all the needs that whatever governance, committee, or overview board has with all data assets that pass through it. If you have an extra burden for every asset that you do not have for an on-prem pipeline, time-strapped data engineering teams will opt for the on-prem solution every time, hampering your cloud data progress. In other words, do the hard work up front so that data that goes through the de-risk data pipeline to your cloud data lake/data warehouse wont have extra burdens than your data pipelines on-prem do not have.

Now all of this is well and good, but what about the nuts and bolts of getting this done? How do you accurately classify data? How do you redact and tokenize in a consistent way? In part 3 of this series, I will dive deeper into the technical mechanics of creating a de-risk data pipeline so you can get more value out of your cloud analytics journey.

Part 3 can be found here.

--

--

Eric McCarty

Data Specialist Engineer at Google and former Technical Architect at USAA and Walgreens. Opinions are of my own and not of Google.