Data Warehouse Testing (vs ETL Testing)
For most companies, the cost of bad data impacts 15% to 25% of overall business revenue. On a macro level, poor data quality costs the U.S. economy as a whole $3.1 trillion per year. Along with this financial hit, one in three business leaders do not trust their own company’s data. After understanding the drastic negative impact of unreliable data, it makes sense why ensuring the veracity and integrity of data has quickly assumed high priority.
The most logical first step to testing data quality? Starting in your data warehouse, where all the pieces of the puzzle come together. In this article, we aim to understand what data warehouse testing is and how it’s essential for decision-making in organizations.
What is data warehouse testing?
Data warehouse testing is the process of building and executing comprehensive test cases to ensure that data in a warehouse has integrity and is reliable, accurate, and consistent with the organization’s data framework. This process is crucial for modern businesses because of the increasing emphasis on data analytics and the way complex business insights are identified on the assumption that data is trustworthy.
The process of data warehouse testing is not simply executed after data from multiple sources land inside the warehouse. Rather, it addresses the complete data pipeline, when data is in flight during extract, transform, and load (ETL) operations. By validating data at intermediate stages, it becomes possible to isolate and resolve problem areas quickly
The testing also covers business intelligence (BI) reports and dashboards that run using the consolidated data as its source. Introducing this additional layer of validation confirms the quality of data after all ETL operations are complete.
In essence, data warehouse testing encompasses both ETL testing and BI testing, two important aspects of any warehouse.
Data warehouse testing vs. ETL testing
The terminology of data warehouse testing is often used interchangeably with ETL testing. However, it’s important to recognize that ETL testing is only one part of data warehouse testing.
The objective of data warehouse testing is to make sure that the consolidated data inside a warehouse is reliable enough for the organization to base its decisions on and that no information has been lost during the transformations. It also aims to confirm that the BI applications and reports point to the right data columns and pull out the exact information that users are looking for, thereby increasing the confidence level on such apps and reports.
On the other hand, the goal of ETL testing is to ensure that the transfer of data from heterogeneous sources to the central data warehouse occurs with strict adherence to transformation rules and is in compliance with all validity checks.
Since ETL is the most common method of extracting data from source to destination, ETL testing is an integral component to data warehouse testing.
Benefits of data warehouse testing
Although the primary benefit of data warehouse testing is the ability to test data integrity and consistency, there are many advantages to instating a reliable process. For example, data warehouse testing is an extension of the rigorous testing mindset that IT teams apply to aid development and deployment activities.
Let’s look at a few top benefits that businesses gain through data warehouse testing:
- High confidence in data quality, the key building block of a superlative analytical model
- Data acquisition errors are spotted early on rather than by data analysts or domain experts at a later point, when the cost of fixing issues may be high
- The financial impact due to subpar data quality can lead to losses in millions — a pegs the number at an average of $15 million
- Reputational loss due to bad data can prove to be costly. For example, systemic weaknesses in its reporting systems cost its reputation along with fines to the tune of 5.9 million pounds.
- Businesses spend huge resources in setting up a data warehouse, and testing data veracity is integral to justifying this dollar spend
- Non-compliance of regulatory acts can incur heavy penalties. The average cost of a data breach, a possible consequence of not testing data thoroughly, is estimated to be around $8.19 million in the US, as per a recent
Challenges to data warehouse testing
While there is no doubt that data warehouse testing is important to a successful business, the execution of the process is not often straightforward due to the involved complexities and challenges. Let’s discuss the common obstacles:
Heterogeneity of data
As we need to validate data from heterogeneous sources, the testing scenarios can get unwieldy. For example, the source could be an application, OLTP database such as Oracle or DB2, or even a flat file (different formats). It could reside on premises or in the cloud. The destination again could be an in-house data warehouse or a cloud data warehouse such as Amazon Redshift or Microsoft Azure.
The testing framework needs to have the ability to plug in to these diverse systems. Additionally, the testing team members need to have the expertise to deftly switch working between different platforms.
High volumes and scalability
The number of datasets can get overwhelming as organizations aim to assimilate data into a common repository. The testing framework needs to be able to run through large volumes of data executing stress and regression tests. In addition, it needs to be able to scale as new sources come into the picture or as data load increases.
The pressure of having to deliver test results quickly sometimes leads to shortcut methods such as sampling, in which datasets are tested partially. Although it helps in avoiding the handling of large data volumes, sampling increases the probability of errors in data.
Data mapping and transformations
Data is often represented differently in systems and data types, and column names tend to vary. Hence, one of the preliminary requirements to activate data warehouse tests is to clearly map data between source and destination databases. This is a difficult task and prone to errors when done manually.
Additionally, checks have to be done for data completeness and quality, accounting for intermediate transformations, duplicates, and other redundancies.
The solution? An automated data integration tool
An automated data integration solution is the answer to achieve end-to-end data warehouse testing. A comprehensive solution should contain the following features:
- Data connectors
- Automatic charting of data columns and types
- Map associations between one schema and another
- Visual representations
- Automatic code generation feature to resolve duplicates, orphan, and missing records through minimal intervention
The data warehouse testing process
Testing a data warehouse is a multi-step process that involves activities such as identifying business requirements, designing test cases, setting up a test framework, executing the test cases, and validating data.
1. Identify the various entry points
As loading data into a warehouse involves multiple stages, it’s essential to find out the various entry points to test data at each of those stages. If testing is done only at the destination, it can be confusing when errors are found as it becomes more difficult to determine the root cause.
Examples of entry points: sources, various points between ETL such as before and after transformation, staging database which is often used as a temporary store before loading into the warehouse, and BI engine where reports read from warehouse data.
2. Prepare the required collaterals
Two fundamental collaterals required for the testing process are database schema representation and a mapping document.
The mapping document is usually a spreadsheet which maps each column in the source database to the destination database. It also includes complex SQL queries (might need multiple table joins) that compare the two columns to evaluate if the data has landed correctly inside the destination.
A data integration solution can help generate the mapping document, which is then used as an input to design test cases.
3. Design an elastic, automated, and integrated testing framework
ETL is not a one-time activity. While some data is loaded all at once and some through batches, new updates may trickle in through streaming queues. A testing framework design has to be generic and architecturally flexible to accommodate new and diverse data sources and types, more volumes, and the ability to work seamlessly with cloud and on-premises databases.
Also, integrating the test framework with an automated data solution (that contains features as discussed in the previous section) increases the efficiency of the testing process.
4. Adopt a comprehensive testing approach
The testing framework needs to aim for 100% coverage of the data warehousing process. For instance, although the primary focus here is on the data itself, application components such as ETL tools, reporting engines, or GUI applications need to be included in the testing framework. Also, it’s important to design multiple testing approaches such as unit, integration, functional, and performance testing.
The data itself has to be scrutinized for many checks that includes looking for duplicates, matching record counts, completeness, accuracy, loss of data, and correctness of transformation. Additionally, data on BI reports needs to be evaluated for adherence to business rules along with quality checks. Even the metadata and mapping document needs to be validated for accuracy.
The cloud and the future of data warehouse testing
The growth in cloud adoption has been one of the pathbreaking trends for the technology industry and for data revolution in particular. The Future of the Cloud Study by LogicMonitor estimates that 83% of enterprise workloads will be in the cloud by 2020.
This has specific repercussions for data warehouse testing. For one, it expands the scope of data sources to external ones, over which organizations do not have any control. For instance, the type of source could be an XML file, a video, or a series of tweets. This means that other than conventional data types, the testing framework needs to be robust in dealing with semi-structured and unstructured data formats.
The warehouse itself could be a data lake on the cloud managed by a third-party provider. The hosting platform may not impact data sanity, but this implies that the testing framework need to interface with cloud-native applications easily as well as manage data security.
However, the rapid growth of cloud promises a brighter future for data warehouse testing because it makes it possible to explore and experiment with advanced testing tools and integration solutions available in the cloud. For instance, machine learning libraries in the cloud can help address data quality issues.
Getting started with data warehouse testing
The complexity of data warehouse testing with its various use cases makes a compelling case for bringing in sophisticated automation into this framework. By using a reliable ETL tool, for instance, the focus on testing remains on data rather than on the peripheral infrastructure.
Talend Data Fabric offers a single suite of self-service apps for data integration and integrity. With a centralized solution, Talend Data Fabric collects data across systems, performs governance, transforms data to new formats, and improves overall data quality — aiding the data warehouse testing process. Ensure your data reliability by establishing an easy-to-use but effective data warehouse testing process. Give Talend a try today.