Week 3 - Extract, transform, and load (ETL)
Last updated
Last updated
This week, we will practice the the ETL process, particularly data extraction and transformation (Layer 2 in the OLAP Architecture below).
This week, the lab sheet contains 2 parts.
Part 1: Self-reading. The aim of Part 1 is to help you understand the basic concepts of ETL and data quality.
Part 2: Practice ETL skills on a simple dataset.
Extract
The process of fetching relevant data from the source systems.
Transform
The process in which we apply transformation such as
aggregation, joining with other datasets, applying rules, splitting or merging results, lookup, join, pivot, and
applying the data reduction techniques, and many others.
Load
The process of loading data into data warehouse destination tables such as fact and dimensions.
Building additional structures to improve system performance.
The source data should be loaded into the data warehouse in a consistent way and be reliable.
Architecture of data warehousing:
Data sources Data staging area Data warehouse
Data transfer or ETL is an important component of the BI and data warehousing system.
The extract step should be able to pick data from all these kinds of data sources.
Data extraction is a process of fetching relevant data from the source systems. Data extraction is a process of fetching relevant data from the source systems. Data might come from different sources such as an Excel file, a .csv file, a relational database, or a website.
The transformation step should be careful to ensure this data is consistent.
Build keys; apply data-cleansing rules on the dataset.
Fetch the data in an incremental method.
Data in the source system might be removed or replaced by newer data records.
Some data rows may appear only once in the data source and be replaced with new data.
Determine the data you need and don't need.
Determine all the data sources, both internal and external.
Establish comprehensive data extraction, cleansing rules and transformation rules.
Plan for aggregated tables.
Organise data staging area and test tools.
Write a procedure for all data loads.
Loading dimension tables and fact tables.
A staging area ensures that
We don't spend any extra activities in the extract step to convert or transform data, and also.
The data won't be loaded all at once into the memory.
Extracting data from sources into an integrated database comes with some challenges.
For very big datasets, fetching data from sources shouldn't be combined with any kind of conversion or transformation.
The main reason is that it will reduce the performance and slow down the whole ETL process.
Loading a very large dataset into memory will require a high amount of server resources that are not available at all times
The logical data map describes the relationship between the extreme starting points and the extreme ending points of the ETL system usually presented in a table or spreadsheet.
The logical data map is usually presented in a table or spreadsheet format and includes the following specific components:
Target table name: The physical name of the table as it appears in a data warehouse.
Target column name: The name of the column in a data warehouse table.
Table type: Indicates if the table is a fact or dimension.
Have a plan: the foundation of the metadata
Identify data source candidates: Identify the likely candidate data sources you believe will support the decisions needed by the business community
Analyse source systems with a data-profiling tool: Detected data anomalies must be documented, and best efforts must be made to apply appropriate business rules to rectify data before it is loaded into the data warehouse.
Context: The analysis of the source system is usually divided into two major phases:
The data discovery phase
The anomaly detection phase
Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it.
Once you understand what the target needs to look like, you need to identify and examine the data sources.
Online Extraction
Data is extracted directly from the source system.
May access source tables through an intermediate system.
The intermediate system is usually similar to the source system.
Offline Extraction
Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.
Data is either already structured or was created by an extraction routine.
Full Extraction
The data is extracted completely from the source system.
No need to keep track of changes.
Source data is made available as-is with any additional information.
Incremental Extraction
Data is extracted after a well-defined point/event in time.
The mechanism used to reflect/record the temporal changes in data (column or table).
Sometimes entire tables off-loaded from a source system into the Data Warehouse.
Can have significant performance impacts on the data warehouse server.
The data after transformation should be
Correct
Unambiguous
Consistent
Complete
Data quality checks are run at 2 places
after extraction and
after cleaning and confirming
Anomaly Detection
E.g. a row significantly different from the rest
Column Property Enforcement
Null Values in required columns
Numeric values that fall outside of expected highs and lows
The length of an attribute is exceptionally short/long
Columns with certain values outside of discrete valid value sets
Adherence to a required pattern/member of a set of pattern
Confirmation can be done in various ways
Structure Enforcement
Tables have proper primary and foreign keys
Obey referential integrity
Data and rule value enforcement
Simple business rules
Logical data checks
Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context.
Data quality is described by several dimensions
Correctness/Accuracy
Consistency
Completeness
Timeliness
Integrity/Validity
Integrity ensures that all data in a database can be traced and connected to other data
It is the process of identifying and correcting dirty data. Dirty data means incomplete, wrong, duplicate, or out-of-date data. The data warehousing community often uses the words cleanse or scrub rather than clean
Data matching is to determine that one data item is the same as another data item. Data matching is used to identify duplicate records. Matching is particularly relevant for character-based data types.
Numeric Data Types
Datetime Data Types
For example, is 03/01/2021 the same as 01/03/2021? Is it the same as 2021-03-01T00:00:00Z+06?
We need some logic here to match them, such as by comparing the components or by comparing the date, month, and year.
Character-based Data Types
For example, a customer named “Mr. Aleck Stevenson”. We need to match/recognise that “Mr. Aleck Stevenson” is the same as “Mr. S Aleck”, “Mr. Stevenson Aleck” and “Mr. Alec Stevenson”.
Data quality rules (data validation) are filters to prevent dirty data from getting into the warehouse. Based on the data location, there are three kinds of validation
Incoming data validation
Checking the incoming data only on its own, without referring to the data already in the warehouse.
These rules verify that the data from the source systems is valid, e.g. within the expected range and in the right format
Cross-Reference Validation
Cross-reference validation is where we check the incoming data against the data in the warehouse.
To make sure that the value of the incoming data is within a certain range that is calculated based on data in the warehouse
Like incoming data validation, cross-reference validation is performed on the fly when the data is being loaded into the warehouse.
Data Warehouse Internal Validation
Data warehouse internal validation is where we check the data already in the warehouse.
We don't check the incoming data
Unlike the previous two, data warehouse internal validation is performed after the incoming data is fully loaded into the warehouse.
The purpose of doing this is to verify the quality of the data in the warehouse at the aggregate level.
This can be done by comparing the totals over a period of time against a known standard value.
Consider the workflow and try to practice the solution for the scenario below.
The dataset below is an original transaction dataset. The file contains 20 transactions with the following columns:
Transaction_ID: Unique identifier for each transaction
Date: Date of purchase
Customer_Name: Name of the customer
Gender: Gender of the customer (M/F)
Customer_Type: Type of customer (Regular/Premium)
Product: Name of the product purchased
Product_Category: Category of the product
Unit_Price: Price per unit of the product
Quantity: Number of units purchased
Discount: Discount amount applied to the transaction
City: The city where the purchase was made
State: State/province where the purchase was made
Country: The country where the purchase was made
The dimension tables and fact table below were processed from the transaction dataset above.
Can you reproduce these dimension tables and the fact table from the original transaction dataset?
A demonstration video link will be posted in a subpage in Week 4.
You can use any programming language, such as Python or R, or Excel to complete the above task.