What is Data Obfuscation?
Data privacy has never been more of a concern than it is today. Not only does the world run on data, but data breaches keep growing in frequency and scale. Privacy Rights Clearinghouse’s Chronology of Data Breaches lists more than 9,000 data breaches made public since 2005. That's over 10 billion data records breached. According to IBM, data breaches are also getting more costly. Data obfuscation could have prevented the disclosure of many of those records, even if the breaches were successful.
Data obfuscation is a process to obscure the meaning of data as an added layer of data protection. In the event of a data breach, sensitive data will be useless to attackers. The organization — and any individuals in the data — will remain uncompromised. Organizations should prioritize obfuscating sensitive information in their data.
Top data obfuscation methods
If you ask ten people the definition of data obfuscation, you'll get 12 different answers. That's because there are many different methods, each designed for specific purposes. Obfuscation is an umbrella term for a variety of processes that transform data into another form in order to protect sensitive information or personal data. Three of the most common techniques used to obfuscate data are encryption, tokenization, and data masking.
Encryption, tokenization, and data masking work in different ways. Encryption and tokenization are reversible in that the original values can be derived from the obfuscated data. Data masking, on the other hand, is irreversible if done correctly. Let's take a brief dive into these three main types of data obfuscation:
- Encryption is very secure, but you lose the ability to work with or analyze the data while it’s encrypted. The more complex the data encryption algorithm, the safer the data will be from unauthorized access. Encryption is a good obfuscation method if you need to store or transfer sensitive data securely.
- Tokenization substitutes sensitive data with a value that is meaningless. This process can't be reversed. However, you can map the token back to the original data. Tokenized data supports operations like running a credit card payment without revealing the credit card number. The real data never leaves the organization, and can't be seen or decrypted by a third-party processor.
- Data masking substitutes realistic but false data for original data to ensure privacy. Using masked out data, testing, training, development, or support teams can work with a dataset without putting real data at risk. Data masking goes by many names. You may have heard of it as data scrambling, data blinding, or data shuffling. The process of permanently stripping personally identifiable information (PII) from sensitive data is also known as data anonymization or data sanitization. Whatever you call it, fake data replaces real data. There is no algorithm to recover the original values of masked data.
Data masking vs data obfuscation in other forms
Data masking is the most common data obfuscation method. The fact that data masking is not reversible makes this type of data obfuscation very secure and less expensive than encryption.
A unique benefit of data masking is that you can maintain data integrity. For example, testers and application developers can use datasets populated with realistic data. Minimizing use of real production data protects the organization from unnecessary risk.
How can fake data have data integrity? In the case of obfuscated data, integrity does't mean accurate data. Rather, it means that the dataset maintains its functionality in spite of data anonymization. For example, a credit card number can be replaced by a different 16-digit numerical value that will pass the checksum for a valid credit card number. If it fails the checksum, it does not have data integrity. Any references to other fields must remain functional to maintain integrity, as well.
In short, there are two major differences between data masking and data obfuscation methods like encryption or tokenization:
- Masked out data is still usable in its obfuscated form
- Once data is masked, the original values cannot be recovered
Benefits of data obfuscation
The most obvious and essential benefit of data obfuscation is hiding sensitive data from those who are not authorized to see it. There are benefits beyond simple data protection:
- Risk and regulatory compliance: Privacy regulations including GDPR require minimization of personal data. With data obfuscation, you can store and disclose minimal personal data. Obfuscation reduces risk of fines, and protects data even if breached.
- Data sharing: With data sharing growing in importance, data masking is the way forward. You can share with third parties, or even make datasets public, when you mask sensitive information.
- Data governance: Data obfuscation is a key component of controlling data access. If you think about it, many business operations don't need unrestricted access to real data. If non-production environments don't require personal data, don't expose sensitive information. That only opens your organization to risk. An obfuscation plan should be part of your data governance framework. And while static data masking creates one masked dataset, dynamic masking offers granular controls. With dynamic data masking, permissions can be granted or denied at multiple levels. Those with a business need can have access to real data, while others will only see what they need to see.
- Flexibility: Data masking also benefits from being highly customizable. You can select which data fields get masked and exactly how to select and format each substitute value. For example, U.S. Social Security numbers have the format of nnn-nn-nnnn, where n is an integer from 0–9. You can opt to substitute the first five digits with the letter x. You could substitute all nine digits with random numbers. Any substitution is possible, it only depends on what best suits your use case.
Different data obfuscation techniques yield different benefits. The best method will depend on the data sources and your use case. At a health clinic, a patient's health information may need to be temporarily obscured in transit. A research study may want to strip PII altogether.
Challenges of data obfuscation
Just as data obfuscation has its benefits, it also has its challenges. The biggest challenge is planning, which can eat up a lot of time and resources. Data management is always an enterprise-wide effort. Data owners, data stewards, and users of the data should all be involved in planning data obfuscation efforts. Even selecting which data needs to be obfuscated may take more effort than you imagine. If your organization struggles with data health, you may not have a clear understanding of where all sensitive data is stored.
Let's look at challenges for each obfuscation method:
- Encryption can obfuscate structured and unstructured data, but format-preserving schema offer less protection.
- Tokenization is strictly used for structured data fields such as credit card numbers or Social Security numbers. As a database increases in size, the performance and security of tokenization becomes difficult to scale.
- Data masking implementation can demand significant effort. Data masking’s great customizability has a downside: you'll need to customize each field to your specifications.
Data masking and the cloud
Organizations of all sizes and industries are turning to cloud technologies. Cloud-based services speed up data delivery and offer more flexibility than on-premises solutions. While cloud computing has proven to be as safe as, if not safer than, keeping data on premises, some still have security concerns.
Data obfuscation can mitigate these concerns. If data is obfuscated before being ingested into a cloud-native data repository, it will be useless to an attacker even if breached. The stolen data would contain only fake data substituted by data masking. Using a cloud-native data service with data masking tools built into extract, transform, and load (ETL) processes simplifies implementation.
Data obfuscation best practices
Measure twice, cut once — the old carpenter’s adage applies just as well to data obfuscation planning. Successful data obfuscation is best achieved by following best practices. Include these steps in your data obfuscation plan:
- Get buy-in and support from your data owners, data stewards, and management
- Identify sensitive data by collaborating with your organization’s departmental data stewards
- Include data privacy regulations, policies, and standards that your organization must comply with
- Determine the data masking techniques, rules, and formats for each piece of sensitive data. Organizing data into groups with common characteristics can simplify this process
- Select a tool to automate as much as possible
Unless there is a specific need for your obfuscation technique to be reversible, use irreversible data masking. It is the surest way to protect sensitive data, and the masked dataset will be equally useful as test data.
For data masking to be done right, you must ensure that data integrity is maintained. Data integrity is essential so that the masked data can be used as effectively as the original data. For example, you'll want to plan for future analysis of credit card usage. You may want to know how many credit card numbers in your dataset are issued from each bank. Since the first six digits of a credit card number are the bank identifier number (BIN), that's all you need to see. If you obfuscate the other digits you'll get the information you need, maintain integrity, and protect sensitive data.
How to make data obfuscation work for you
There are several types of data obfuscation, and the right method depends on the task at hand. The most common use cases are testing, training, application development, and support. These call for data masking — permanently replacing sensitive data with realistic fake data. Masked data can maintain the integrity of the original dataset. It can't be decrypted. You can customize it to meet your specific needs.
Data masking has many benefits for data governance, risk, and compliance. That said, be aware that doing the job right may consume time and resources. Using best practices will make the process much more efficient. The best way to cut costs and effort is to start with a solid plan and automate data masking processes wherever possible.
Talend Data Fabric helps you simplify the data masking process. Talend's comprehensive suite of apps focuses on data integration and data integrity. Talend Data Fabric empowers companies to collect, govern, transform, and share healthy data.
Are you ready to reduce your regulatory footprint, realize savings, and reduce risk? Share quality data across your organization without exposing sensitive information. Try Talend Data Fabric today for data you can trust.
Ready to get started with Talend?
More related articles
- Building a Data Governance Framework
- Data governance with Snowflake: 3 things you need to know
- Data Governance Tools: The Best Tools to Organize, Access, Protect
- Data governance framework – guide and examples
- Five Pillars for Succeeding in Big Data Governance and Metadata Management with Talend
- Structured vs. unstructured data: A complete guide
- What is a data catalogue, and do you need one?
- What is data stewardship?
- What is Data Governance and Why Do You Need It?
- What is Data Lineage and How to Get Started?
- What is Metadata?
- What is Data Access and Why is it Important?