Week 3 - Extract, transform, and load (ETL)

This week, we will practice the the ETL process, particularly data extraction and transformation (Layer 2 in the OLAP Architecture below).

This week, the lab sheet contains 2 parts.

Part 1: Self-reading. The aim of Part 1 is to help you understand the basic concepts of ETL and data quality.
Part 2: Practice ETL skills on a simple dataset.

OLAP Architecture

Part 1: Theoretical Concepts

What is ETL?

Extract
- The process of fetching relevant data from the source systems.
Transform
- The process in which we apply transformation such as
  - aggregation, joining with other datasets, applying rules, splitting or merging results, lookup, join, pivot, and
  - applying the data reduction techniques, and many others.
Load
- The process of loading data into data warehouse destination tables such as fact and dimensions.
- Building additional structures to improve system performance.
The source data should be loaded into the data warehouse in a consistent way and be reliable.

ETL Overview

Architecture of data warehousing:

Data sources $\rightarrow$ Data staging area $\rightarrow$ Data warehouse

ETL Lifecycle

Extract and Transform in ETL

Data transfer or ETL is an important component of the BI and data warehousing system.

The extract step should be able to pick data from all these kinds of data sources.
- Data extraction is a process of fetching relevant data from the source systems. Data extraction is a process of fetching relevant data from the source systems. Data might come from different sources such as an Excel file, a .csv file, a relational database, or a website.
- The transformation step should be careful to ensure this data is consistent.
  - Build keys; apply data-cleansing rules on the dataset.
- Fetch the data in an incremental method.
  - Data in the source system might be removed or replaced by newer data records.
  - Some data rows may appear only once in the data source and be replaced with new data.

Loading Data to the Data Warehouse

ETL Major Steps

Determine the data you need and don't need.
Determine all the data sources, both internal and external.
Establish comprehensive data extraction, cleansing rules and transformation rules.
Plan for aggregated tables.
Organise data staging area and test tools.
Write a procedure for all data loads.
Loading dimension tables and fact tables.

Data Staging

A staging area ensures that

We don't spend any extra activities in the extract step to convert or transform data, and also.
The data won't be loaded all at once into the memory.

Extracting data from sources into an integrated database comes with some challenges.

For very big datasets, fetching data from sources shouldn't be combined with any kind of conversion or transformation.
- The main reason is that it will reduce the performance and slow down the whole ETL process.
- Loading a very large dataset into memory will require a high amount of server resources that are not available at all times

Logical Data Map

The logical data map describes the relationship between the extreme starting points and the extreme ending points of the ETL system usually presented in a table or spreadsheet.

Components of Logical Data Map

The logical data map is usually presented in a table or spreadsheet format and includes the following specific components:

Target table name: The physical name of the table as it appears in a data warehouse.
Target column name: The name of the column in a data warehouse table.
Table type: Indicates if the table is a fact or dimension.

The Logical Data Map Example

Hints of Logical Data Map

Have a plan: the foundation of the metadata
Identify data source candidates: Identify the likely candidate data sources you believe will support the decisions needed by the business community
Analyse source systems with a data-profiling tool: Detected data anomalies must be documented, and best efforts must be made to apply appropriate business rules to rectify data before it is loaded into the data warehouse.

Data Extraction: Data Discovery Phase

Context: The analysis of the source system is usually divided into two major phases:

The data discovery phase
The anomaly detection phase

Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it.

Once you understand what the target needs to look like, you need to identify and examine the data sources.

Physical Data Extraction

Online Extraction

Data is extracted directly from the source system.
May access source tables through an intermediate system.
The intermediate system is usually similar to the source system.

Offline Extraction

Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.
Data is either already structured or was created by an extraction routine.

Logical Data Extraction

Full Extraction

The data is extracted completely from the source system.
No need to keep track of changes.
Source data is made available as-is with any additional information.

Incremental Extraction

Data is extracted after a well-defined point/event in time.
The mechanism used to reflect/record the temporal changes in data (column or table).
Sometimes entire tables off-loaded from a source system into the Data Warehouse.
Can have significant performance impacts on the data warehouse server.

Data Transformation

The data after transformation should be

Correct
Unambiguous
Consistent
Complete

Data quality checks are run at 2 places

after extraction and
after cleaning and confirming

Data Transformation: Cleaning Data

Anomaly Detection

E.g. a row significantly different from the rest

Column Property Enforcement

Null Values in required columns
Numeric values that fall outside of expected highs and lows
The length of an attribute is exceptionally short/long
Columns with certain values outside of discrete valid value sets
Adherence to a required pattern/member of a set of pattern

Transformation - Confirming

Confirmation can be done in various ways

Structure Enforcement
Tables have proper primary and foreign keys
Obey referential integrity
Data and rule value enforcement
Simple business rules
Logical data checks

What is data quality?

Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context.

Data quality is described by several dimensions

Correctness/Accuracy
Consistency
Completeness
Timeliness
Integrity/Validity
- Integrity ensures that all data in a database can be traced and connected to other data

Data Cleansing and Matching

It is the process of identifying and correcting dirty data. Dirty data means incomplete, wrong, duplicate, or out-of-date data. The data warehousing community often uses the words cleanse or scrub rather than clean

Data matching is to determine that one data item is the same as another data item. Data matching is used to identify duplicate records. Matching is particularly relevant for character-based data types.

Numeric Data Types
Datetime Data Types
- For example, is 03/01/2021 the same as 01/03/2021? Is it the same as 2021-03-01T00:00:00Z+06?
- We need some logic here to match them, such as by comparing the components or by comparing the date, month, and year.
Character-based Data Types
- For example, a customer named “Mr. Aleck Stevenson”. We need to match/recognise that “Mr. Aleck Stevenson” is the same as “Mr. S Aleck”, “Mr. Stevenson Aleck” and “Mr. Alec Stevenson”.

Data Quality Validation

Data quality rules (data validation) are filters to prevent dirty data from getting into the warehouse. Based on the data location, there are three kinds of validation

Incoming data validation
- Checking the incoming data only on its own, without referring to the data already in the warehouse.
- These rules verify that the data from the source systems is valid, e.g. within the expected range and in the right format
Cross-Reference Validation
- Cross-reference validation is where we check the incoming data against the data in the warehouse.
- To make sure that the value of the incoming data is within a certain range that is calculated based on data in the warehouse
- Like incoming data validation, cross-reference validation is performed on the fly when the data is being loaded into the warehouse.
Data Warehouse Internal Validation
- Data warehouse internal validation is where we check the data already in the warehouse.
  - We don't check the incoming data
- Unlike the previous two, data warehouse internal validation is performed after the incoming data is fully loaded into the warehouse.
- The purpose of doing this is to verify the quality of the data in the warehouse at the aggregate level.
  - This can be done by comparing the totals over a period of time against a known standard value.

Part 2: ETL Practices

Consider the workflow and try to practice the solution for the scenario below.

Description

The dataset below is an original transaction dataset. The file contains 20 transactions with the following columns:

Transaction_ID: Unique identifier for each transaction
Date: Date of purchase
Customer_Name: Name of the customer
Gender: Gender of the customer (M/F)
Customer_Type: Type of customer (Regular/Premium)
Product: Name of the product purchased
Product_Category: Category of the product
Unit_Price: Price per unit of the product
Quantity: Number of units purchased
Discount: Discount amount applied to the transaction
City: The city where the purchase was made
State: State/province where the purchase was made
Country: The country where the purchase was made

The dimension tables and fact table below were processed from the transaction dataset above.

Task

Can you reproduce these dimension tables and the fact table from the original transaction dataset?

Task Demonstration

A demonstration video link will be posted in a subpage in Week 4.

You can use any programming language, such as Python or R, or Excel to complete the above task.

PreviousWeek 2 - Software Setup NextTask Demonstration Video

Last updated 3 months ago