The activities of most companies, regardless of their size, specifics, or the industry they belong to, revolve around the most critical element – data. Managing it is quite a challenge because it’s often distributed. Imagine a company that stores its data in multiple sources such as physical computer disks, cloud, CRM systems, or HR tools. In addition, these files are stored in various formats like XML, CSV, or TXT. This makes us deal with extensive data resources that do not form a coherent whole and, thus, are difficult to manage and look at holistically from a business point of view. They simply do not translate into efficient and intelligent applications, prolong processes, and require manual, time-consuming work.
The question arises – how do you make it all consistent so you can get the benefits of your data? With help comes the ETL process.
ETL – Definition
ETL stands for Extract, Transform, and Load. Each of these words represents one of the stages of the data integration process.
- Extract – exporting data from different sources.
- Transform – transforming data to meet the requirements of the target system.
- Load – loading transformed data into the target system like a data warehouse.
ETL thus enables data from many separate sources to be aggregated, transformed into a consistent format, and transferred to a target system, such as a CMS or ERP, to prepare it for analysis and enable organizations to extract valuable information.
Now that we know the definition of ETL, let’s take a closer look at each step in the process.
ETL – Process
Data Extraction
The first stage of the ETL process, already mentioned, is Extract, which means pulling data from all available sources. These can be databases, files in various formats (including flat files), CRM or ERP systems, applications and tools or transaction systems like Oracle or Microsoft. The data also comes from streams maintained in real-time.
At this stage, the data is placed in temporary storage, where the next stages will occur. This is also the right time to look at your data – where it comes from, who owns it, and what problems exist in its creation, storage and use. That’s why it’s not uncommon at this stage to conduct a data audit to determine its quality, size, increments, or formats. All this is in order not to duplicate specific errors at subsequent stages and to prepare the data so that, in the end, it is as consistent as possible and ready for further analysis.
Data Transformation
The collected data is ready to be further transformed to meet both the requirements of the target system in which it will find itself and – equally important – its business purpose. This is not, after all, a process just to keep the data in order. The most desirable result is to obtain integrated and qualitative data, the analysis of which will help provide the necessary information for the organization to make accurate decisions.
At this stage, the data aggregated into temporary storage is cleaned, repaired, and supplemented. This way, inaccuracies or errors – duplicate data, incomplete or erroneous information – are reduced. This is also the moment to fill in missing records (e.g., add new columns), break down the contents of fields, and group and aggregate data if necessary. You can then test the data using available tools or by creating your own tests.
Data Loading
Has our data been transformed, cleaned, and ready for further action? It’s time to move it to the target system. This is where it will be stored and made available for further use in analysis, reports, or business applications. A key element of this stage is to ensure that the data has been loaded correctly and that its structure meets the needs of users and the system’s requirements.
Data can be loaded in its entirety (full load) or in batches incrementally (incremental load). If you opt for the first option, i.e., transferring the data in its entirety, the volume may increase over time, slowing down the process. The second option allows for faster data processing.
What could be our target system? Most often, it’s a data warehouse, but other options are data hubs or data lakes – here, we have a choice of cloud or on-premises. Once the data has been loaded, it is ready for BI analysis or through other data analytics tools.
ETL – Testing
Testing is one of the necessary activities in the ETL process. ETL testing ensures that data is correctly processed and loaded into the target system at every stage.
ETL testing is performed before transferring data to the testing environment. This approach differs from database testing in scope and steps that should be considered.
The main goal of ETL testing is the identification and mitigation of data defects as well as general errors, which can be found before the transformation of data for analytical reporting purposes.
Here is the list of the typical tasks connected with the ETL testing:
- Understanding data used in reporting
- Data model overview
- Source mapping – target
- Checking source data
- Verifying data in the target system
- Verifying data transformation calculations and the aggregation rules
- Sample data comparison between source and target systems.
Both ETL testing and database testing include data validation. This, however, does not look the same in both instances. ETL testing is mainly performed on the system data, while database testing is commonly executed in transactional systems, sending data from different applications to the transactional database.
ETL testing includes:
- Validating data transferred from the source database to the target system.
- Verify the amount of data in the source and target systems.
- Verify that the data extraction and transformation are according to the requirements and specifications.
- Check the relations between the tables: whether joins and keys are kept after transformation.
Database testing puts more emphasis on data accuracy, data correctness, and right values. It includes:
- Checking if primary and foreign keys are maintained.
- Checking if the table columns have correct values.
- Verification of the data accuracy in the columns, e.g., the “Months” columns, should not contain values greater than 12.
- Verify the missing data in the columns, e.g., check if there are empty columns that should contain correct values.
ETL Testing – Types of Tests
Categorization of the ETL tests is based on the testing and reporting goals. Test categories differ depending on the organization’s standards or the client’s requirements. Generally speaking, ETL testing is categorized by the following:
- Testing number of records – matching records number in the source and target systems.
- Testing source data – includes checking data correctness between the source and target systems. It also embraces data integration, checking threshold values and checking duplicated data in the target system.
- Data mapping or transformation testing – verifies object mapping in source and target systems. Includes checking data functionality in the target system.
- Testing end data – generating reports for the end users to verify if the data meets the users’ expectations. It also means finding discrepancies in the reports and checking data in the target system to confirm the report is correct.
- Testing system integration – testing each system separately and joining the results to find any discrepancies. Three approaches can be used for this type of testing: top-down, bottom-up, and hybrid.
Based on the system structure, the ETL testing (regardless of the tool used) can be divided into the following categories:
- Testing a new system – in this type of test, a new system is built and verified. The entry data is taken from the clients or end users (or different sources), and a new database (data warehouse) is created. Next, the data is verified in the new system using ETL tools.
- Migration testing – in the migration testing, the clients have an existing database, and a new ETL tool is searched for to improve efficiency. It includes migrating data from the existing system using the new ETL tool.
- Change testing – in the change testing, new data from different sources are added to the existing system. The clients can change or add new ETL rules.
- Report testing – creating reports to test data correctness. The reports are the “final result” of every system. When testing the reports, their layout, data, and calculated values are within the scope of the tests.
ETL Testing – Challenges
ETL testing is different from database testing or other conventional tests. It may be unavoidable to face various problems. Here are presented some typical issues that can be encountered in the ETL testing:
- Loss of data during the ETL process
- Incorrect, incomplete, or duplicated data
- The system contains historical data, so they are too big and too complex to run ETL tests in the target system
- ETL testers usually do not have access to the ETL tool task schedule. They sometimes do not even have access to the BI reporting tools to see the final layout of the report and the data in it
- Difficulties in creating test cases because of too big and complex data or lack of data
- There is no information on the end users’ requirements and no flow of business information
- The ETL testing incorporates various complex SQL concepts used for the data validation in the target system
- No documentation
- Unstable testing environment.
ETL Testing – Testing Methods
Choosing the proper testing method before starting the testing is particularly important. Many approaches to testing can be used. Below there are brief descriptions of a few testing approaches in the BI projects:
- Comparing the number of records between the source and target databases:
This test takes less time than other testing techniques. It checks data amounts in the source and target systems without checking the values of the system data. It also has no impact on aggregations or checking ascending or descending order by the data mapping.
- Testing target-source data:
In this type of testing, the subject of the testing is the values of the data transferred from the source to the target system. It checks the system value after its transformation. It is used in, e.g., financial and banking projects.
- Data integration:
The tester checks the data range. All data in the target system are valid.
Example: The ‘Age’ attribute should not have values greater than 100. No value should be greater than 12 in the ‘Month’ column.
- Data checking and testing restrictions:
This includes carrying out different testing activities, like checking data type, data length, and indexes. Here, a tester carries out the following scenarios: primary key, foreign key, NOT NULL, NULL, UNIQUE, etc.
- Testing duplicates:
This testing means checking for no data duplicates in the target system. If the target system contains massive data, duplicated data may be found in the production system, producing incorrect data in the analytical reports. Duplicated data can be checked by using SQL instructions. The duplicates exist due to:
-
- Lack of the primary key defined
- Inaccurate mapping or environmental issues
- Manual errors during data transfer between the source and target systems.
- Data transformation testing:
This type of testing is not executed by running a single SQL instruction. It is time-consuming and requires running multiple SQL queries for each row to verify transformation rules. A tester must run SQL queries for each row and then compare the input and target data.
- Data quality testing:
It means checking dates, figures, nought values, etc. A tester performs syntax testing to report wrong symbols and incorrect order of capital/small letters and reference testing to check if data are compatible with the data model.
- Incremental testing:
It is performed to verify if the expected results execute the Insert and Update instructions. The testing takes place step by step with the old and new data.
- Regression testing:
Implementing changes in data transformation and aggregation rules to help the testers find new errors. The errors found in the process are called regression.
- Retests:
Testing is performed after the code is fixed.
- System integration testing:
Test the individual system components and, later on, integrate the modules.
- Delta/Data loading:
It means testing differences that appear during data loading. These tests check if the processes responsible for the data integrity when loading multiple databases are performed correctly.
Summary
A properly executed ETL process, with an ongoing testing component and a clear business objective, greatly benefits companies. It allows access to arranged, cleaned, and consistent data in one place. So, you can already reject the manual digging through thousands of files of different formats and from multiple sources. Instead, you can freely analyze them, turn them into valuable reports, and based on that, streamline internal processes and make beneficial business decisions.
If you are interested in ETL services for your company, contact us – we have a team of experts who are experienced in carrying out this process comprehensively for organizations in the pharmaceutical industry, among others. If you want to get an overview of how we do it, read the success story of one of our clients.