Data Warehousing Lab sheets
  • CITS3401/5504 Lab Sheets
  • Week 1 - Introduction (Self-Reading)
  • Week 2 - Software Setup
  • Week 3 - Extract, transform, and load (ETL)
    • Task Demonstration Video
  • Week 4 - Designing a New Data Warehouse
  • Week 5 - Mini End-to-End Data Warehouse Project
  • Week 6 - Association Rules Mining
  • Week 8 - Neo4j
  • Week 9 - Import data from a relational database into Neo4j
  • Week 10 - Cypher Queries
    • Week 10 Sample Solution
  • Week 11 - Advanced Cypher and APOC
    • Week 11 Sample Solution
  • Week 12 - Graph Data Science
    • Week 12 Sample Solution
Powered by GitBook
On this page
  • OLAP Architecture
  • Part 1: Theoretical Concepts
  • What is ETL?
  • ETL Overview
  • ETL Lifecycle
  • Extract and Transform in ETL
  • Loading Data to the Data Warehouse
  • ETL Major Steps
  • Data Staging
  • Logical Data Map
  • Data Extraction: Data Discovery Phase
  • Data Transformation
  • What is data quality?
  • Data Cleansing and Matching
  • Data Quality Validation
  • Part 2: ETL Practices
  • Description
  • Task
  • Task Demonstration

Week 3 - Extract, transform, and load (ETL)

PreviousWeek 2 - Software SetupNextTask Demonstration Video

Last updated 3 months ago

This week, we will practice the the ETL process, particularly data extraction and transformation (Layer 2 in the OLAP Architecture below).

This week, the lab sheet contains 2 parts.

  • Part 1: Self-reading. The aim of Part 1 is to help you understand the basic concepts of ETL and data quality.

  • Part 2: Practice ETL skills on a simple dataset.

OLAP Architecture

Part 1: Theoretical Concepts

What is ETL?

  • Extract

    • The process of fetching relevant data from the source systems.

  • Transform

    • The process in which we apply transformation such as

      • aggregation, joining with other datasets, applying rules, splitting or merging results, lookup, join, pivot, and

      • applying the data reduction techniques, and many others.

  • Load

    • The process of loading data into data warehouse destination tables such as fact and dimensions.

    • Building additional structures to improve system performance.

    The source data should be loaded into the data warehouse in a consistent way and be reliable.

ETL Overview

Architecture of data warehousing:

Data sources →\rightarrow→Data staging area →\rightarrow→Data warehouse

ETL Lifecycle

Extract and Transform in ETL

Data transfer or ETL is an important component of the BI and data warehousing system.

  • The extract step should be able to pick data from all these kinds of data sources.

    • Data extraction is a process of fetching relevant data from the source systems. Data extraction is a process of fetching relevant data from the source systems. Data might come from different sources such as an Excel file, a .csv file, a relational database, or a website.

    • The transformation step should be careful to ensure this data is consistent.

      • Build keys; apply data-cleansing rules on the dataset.

    • Fetch the data in an incremental method.

      • Data in the source system might be removed or replaced by newer data records.

      • Some data rows may appear only once in the data source and be replaced with new data.

Loading Data to the Data Warehouse

ETL Major Steps

  • Determine the data you need and don't need.

  • Determine all the data sources, both internal and external.

  • Establish comprehensive data extraction, cleansing rules and transformation rules.

  • Plan for aggregated tables.

  • Organise data staging area and test tools.

  • Write a procedure for all data loads.

  • Loading dimension tables and fact tables.

Data Staging

A staging area ensures that

  • We don't spend any extra activities in the extract step to convert or transform data, and also.

  • The data won't be loaded all at once into the memory.

Extracting data from sources into an integrated database comes with some challenges.

  • For very big datasets, fetching data from sources shouldn't be combined with any kind of conversion or transformation.

    • The main reason is that it will reduce the performance and slow down the whole ETL process.

    • Loading a very large dataset into memory will require a high amount of server resources that are not available at all times

Logical Data Map

The logical data map describes the relationship between the extreme starting points and the extreme ending points of the ETL system usually presented in a table or spreadsheet.

Components of Logical Data Map

The logical data map is usually presented in a table or spreadsheet format and includes the following specific components:

  • Target table name: The physical name of the table as it appears in a data warehouse.

  • Target column name: The name of the column in a data warehouse table.

  • Table type: Indicates if the table is a fact or dimension.

The Logical Data Map Example

Hints of Logical Data Map

  • Have a plan: the foundation of the metadata

  • Identify data source candidates: Identify the likely candidate data sources you believe will support the decisions needed by the business community

  • Analyse source systems with a data-profiling tool: Detected data anomalies must be documented, and best efforts must be made to apply appropriate business rules to rectify data before it is loaded into the data warehouse.

Data Extraction: Data Discovery Phase

Context: The analysis of the source system is usually divided into two major phases:

  • The data discovery phase

  • The anomaly detection phase

Data Discovery Phase key criterion for the success of the data warehouse is the cleanliness and cohesiveness of the data within it.

Once you understand what the target needs to look like, you need to identify and examine the data sources.

Physical Data Extraction

Online Extraction

  • Data is extracted directly from the source system.

  • May access source tables through an intermediate system.

  • The intermediate system is usually similar to the source system.

Offline Extraction

  • Data NOT extracted directly from the source system, instead staged explicitly outside the original source system.

  • Data is either already structured or was created by an extraction routine.

Logical Data Extraction

Full Extraction

  • The data is extracted completely from the source system.

  • No need to keep track of changes.

  • Source data is made available as-is with any additional information.

Incremental Extraction

  • Data is extracted after a well-defined point/event in time.

  • The mechanism used to reflect/record the temporal changes in data (column or table).

  • Sometimes entire tables off-loaded from a source system into the Data Warehouse.

  • Can have significant performance impacts on the data warehouse server.

Data Transformation

The data after transformation should be

  • Correct

  • Unambiguous

  • Consistent

  • Complete

Data quality checks are run at 2 places

  • after extraction and

  • after cleaning and confirming

Data Transformation: Cleaning Data

Anomaly Detection

  • E.g. a row significantly different from the rest

Column Property Enforcement

  • Null Values in required columns

  • Numeric values that fall outside of expected highs and lows

  • The length of an attribute is exceptionally short/long

  • Columns with certain values outside of discrete valid value sets

  • Adherence to a required pattern/member of a set of pattern

Transformation - Confirming

Confirmation can be done in various ways

  • Structure Enforcement

  • Tables have proper primary and foreign keys

  • Obey referential integrity

  • Data and rule value enforcement

  • Simple business rules

  • Logical data checks

What is data quality?

Data quality is a perception or an assessment of data’s fitness to serve its purpose in a given context.

Data quality is described by several dimensions

  • Correctness/Accuracy

  • Consistency

  • Completeness

  • Timeliness

  • Integrity/Validity

    • Integrity ensures that all data in a database can be traced and connected to other data

Data Cleansing and Matching

It is the process of identifying and correcting dirty data. Dirty data means incomplete, wrong, duplicate, or out-of-date data. The data warehousing community often uses the words cleanse or scrub rather than clean

Data matching is to determine that one data item is the same as another data item. Data matching is used to identify duplicate records. Matching is particularly relevant for character-based data types.

  • Numeric Data Types

  • Datetime Data Types

    • For example, is 03/01/2021 the same as 01/03/2021? Is it the same as 2021-03-01T00:00:00Z+06?

    • We need some logic here to match them, such as by comparing the components or by comparing the date, month, and year.

  • Character-based Data Types

    • For example, a customer named “Mr. Aleck Stevenson”. We need to match/recognise that “Mr. Aleck Stevenson” is the same as “Mr. S Aleck”, “Mr. Stevenson Aleck” and “Mr. Alec Stevenson”.

Data Quality Validation

Data quality rules (data validation) are filters to prevent dirty data from getting into the warehouse. Based on the data location, there are three kinds of validation

  • Incoming data validation

    • Checking the incoming data only on its own, without referring to the data already in the warehouse.

    • These rules verify that the data from the source systems is valid, e.g. within the expected range and in the right format

  • Cross-Reference Validation

    • Cross-reference validation is where we check the incoming data against the data in the warehouse.

    • To make sure that the value of the incoming data is within a certain range that is calculated based on data in the warehouse

    • Like incoming data validation, cross-reference validation is performed on the fly when the data is being loaded into the warehouse.

  • Data Warehouse Internal Validation

    • Data warehouse internal validation is where we check the data already in the warehouse.

      • We don't check the incoming data

    • Unlike the previous two, data warehouse internal validation is performed after the incoming data is fully loaded into the warehouse.

    • The purpose of doing this is to verify the quality of the data in the warehouse at the aggregate level.

      • This can be done by comparing the totals over a period of time against a known standard value.

Part 2: ETL Practices

Consider the workflow and try to practice the solution for the scenario below.

Description

The dataset below is an original transaction dataset. The file contains 20 transactions with the following columns:

  • Transaction_ID: Unique identifier for each transaction

  • Date: Date of purchase

  • Customer_Name: Name of the customer

  • Gender: Gender of the customer (M/F)

  • Customer_Type: Type of customer (Regular/Premium)

  • Product: Name of the product purchased

  • Product_Category: Category of the product

  • Unit_Price: Price per unit of the product

  • Quantity: Number of units purchased

  • Discount: Discount amount applied to the transaction

  • City: The city where the purchase was made

  • State: State/province where the purchase was made

  • Country: The country where the purchase was made

The dimension tables and fact table below were processed from the transaction dataset above.

Task

Can you reproduce these dimension tables and the fact table from the original transaction dataset?

Task Demonstration

A demonstration video link will be posted in a subpage in Week 4.

You can use any programming language, such as Python or R, or Excel to complete the above task.

2KB
Transaction Dataset.csv
Transaction Dataset
163B
Dim_Customer.csv
Customer Dimension Table
103B
Dim_Date.csv
Date Dimension Table
109B
Dim_Location.csv
Location Dimension Table
132B
Dim_Product.csv
Product Dimension Table
766B
Fact_Sales.csv
Sales Fact Table
OLAP Architecture
ETL Overview
ETL Liftcycle
Loading data to the data warehouse
The logical data map example
Data quality lifycycle