Week 4 - Designing a New Data Warehouse

1. Designing a Data Warehouse and Populating Tables

This part aims to help you become familiar with PostgreSQL in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:

  • how to design a data warehouse

  • how to create a database with dimension and fact tables

  • how to populate and bulk insert data into the data warehouse

1.1 Data Warehouse Creation

1.1.1 Create a PostgreSQL Database

There are two options for creating a database.

  • Option 1: Using the pgAdmin 4 GUI tool

    • Right-click the server name to expand the list.

    • Select "Create" then "Database...".

    Create a Database
    • Enter your preferred database name under the 'General' tab. Click 'Save' to save the changes and close the window.

    Database Name
    • You can check the Object Explorer panel to find the created database.

    Created Database
  • Option 2: Using the pgAdmin 4 query tool

    • Click 'Query Tool Workspace' to open the PostgreSQL editor.

    Open Query Tool
    • Select an existing server from the list, then click 'Connect & Open Query Tool'. Leave all other parameters at their default values.

    Connect the Server

You can use DDL commands, such as CREATE DATABASE, to create a database in the PostgreSQL editor.

CREATE DATABASE AdventureWorksDW;

To view all databases on this server, you can use the following command:

SELECT datname FROM pg_database;
Created Database

The command below will show you which database you're currently using.

SELECT current_database();

To change databases using pgAdmin 4's Query Tool:

  • Find and click the database dropdown menu at the top of your Query Tool window, then click '<New Connection...>'

Change Current Database
  • Select your server and the database you want to use, then click 'Save'.

Connect a New Database

1.1.2 Create dimension tables

The next step is to create database tables. First, you should analyse the CSV file 'DimAccount.csv' from the data folder (download the .zip file below). Based on the CSV data, we need to create a table schema that specifies:

  1. Column names

  2. Data types for each column

  3. Whether NULL values are allowed for each column

It's important to examine the actual data in each column to check if it contains any NULL values before determining the NULL constraint.

CREATE TABLE DimAccount (     
    AccountKey INT PRIMARY KEY,
    ParentAccountKey INT NULL,
    AccountCodeAlternateKey INT NULL,
    ParentAccountCodeAlternateKey INT NULL,
    AccountDescription VARCHAR(50) NULL,
    AccountType VARCHAR(50) NULL,
    Operator VARCHAR(50) NULL,
    CustomMembers VARCHAR(300) NULL,
    ValueType VARCHAR(50) NULL,
    CustomMemberOptions VARCHAR(200) NULL
);

Question: Can you create a table for another file?

1.1.3 Create fact tables

Use the same way to create a fact table and set the foreign key constraint:

CREATE TABLE FactFinance (
    FinanceKey INT PRIMARY KEY,
    DateKey INTEGER NOT NULL,
    OrganizationKey INTEGER NOT NULL,
    DepartmentGroupKey INTEGER NOT NULL,
    ScenarioKey INTEGER NOT NULL,
    AccountKey INTEGER NOT NULL,
    Amount FLOAT NOT NULL,
    Date TIMESTAMP,
    FOREIGN KEY (accountkey) REFERENCES DimAccount(accountkey)
);

1.1.4 Create a schema

We can generate the schema for the fact and dimension tables using pgAdmin 4 GUI by following these steps:

  1. Click on the database name to expand its contents

  2. Expand the 'Schemas' section, then right-click on 'public'

  3. Select 'ERD For Schema' to generate the schema

Generate a Schema

Below is the expected database schema. You can modify both the table details and their layout as needed.

Schema

Since we have not created all the dimension tables yet, we can add foreign key constraints later when we create other dimension tables.

For example:

ALTER TABLE FactFinance 
ADD CONSTRAINT FactFinance FOREIGN KEY (DateKey) 
REFERENCES DimDate(DateKey);
A Schema

The schema for the data warehouse is complex.

What is the schema type?

1.2 Populate Data

1.2.1 INSERT VALUES

With the tables we created, next step is inserting data into the tables. We can select an instance from the file and insert the data into the table by running INSERT INTO command. Note that you only need to type in the values that are not NULL.

INSERT INTO dimaccount (AccountKey,
    AccountCodeAlternateKey,AccountDescription,Operator,
    ValueType)
VALUES (1,1, 'Balance Sheet', '~', 'Currency');

Then, you can check your added record by SQL query and you would able to see the result.

SELECT * FROM dimaccount;
Result

1.2.2 Bulk insert

As we seen in the last section, we can add single record to the table by using the INSERT command, but if we want to add more instances to the table at the same time, or load the whole csv file into the database, using INSERT command is not an efficient way. We can use bulk insert by running COPY command instead.

We first delete the manually added record to avoid a primary key conflict.

DELETE FROM DimAccount;

Use the CSV files in the downloaded .zip file in section 1.1.2 for bulk inserting.

  • Move the files into the diecrtory that PostgreSQL can access(Local).

Show the Invisible Files

If you cannot find the location, you can run the code in the Terminal to find the path:

psql -U postgres -c "SHOW temp_tablespaces;"
export PATH=$PATH:<PATH TO YOUR PSQL COMMAND DIRECTORY> 
export PATH=$PATH:/Library/PostgreSQL/17/bin # this is an example

If the command returns an empty result, PostgreSQL is likely using the default location /tmp.

  • Move the files into the directory that PostgreSQL can access(Docker).

If PostgreSQL is running in a Docker container. You can run the following command in the Terminal to move the files:

docker cp '<PATH OF YOUR DOWNLOADED FILES>'AdventureWorkDW '<NAME OF YOUR CONTAINER>':/tmp/AdventureWorksDW

If you have Docker Desktop installed. You can also drag the downloaded files into the /tmp directory by following the steps.

View Files in Docker

Now, you can find the tmp directory. Then you can select the 'AdventureWorksDW' directory and drag it into the tmp directory in the Docker.

tmp directory
  • To bulk load data into the DimAccount table, run this command in your PostgreSQL query tool:

COPY DimAccount 
FROM '<REPLACE YOUR PATH HERE>DimAccount.csv' 
WITH (FORMAT csv, DELIMITER '|', HEADER false);
  • Check your results

Result

2. Connecting to Power BI/Tableau for Visualization

In this task, we will learn how to connect our database to Tableau or Power BI and how to visualize our data.

2.1 Download Driver

You can download pgJDBC by clicking here. Then copy the .jar file to the following folder (you may need to create it manually):

Mac: /Library/JDBC or ~/Library/JDBC

Windows: C:\Program Files\Tableau\Drivers

2.2 Connecting Tableau to PostgreSQL

  • After you set up the driver, restart Tableau, then select PostgreSQL in the Connect panel.

Connect Tableau to PostgreSQL
  1. Enter the name of the server that hosts the database that you want to connect to.

  2. Enter the name of the database.

  3. Select how you want to sign in to the server. Specify whether to use Integrated Authentication or User Name and Password. If the server is password protected, and you aren’t in a Kerberos environment, you must enter the username and password.

  4. Select Sign In.

Connection Details

2.3 Visualise the multidimension solution with Tableau

In this section, you will need to visualise your data to help with analysis and gain insights. After signing in, you can select any table you have created from the left panel and click 'refresh' at the bottom. You will then be able to see the details at the bottom and the relationships between tables in the center.

Data Visualisation

Here we select a fact table and a dimension table, 'factinternetsales' and 'dimcustomer', to analyze how commute distance impacts sales.

  • First, open a worksheet from the bottom left (Sheet1 in the following figure).

  • Drag 'CommuteDistance' from the 'dimcustomer' table on the left to the 'Columns' field at the top center.

  • Similarly, drag 'SalesAmount' to the field.

  • The metric will be summed across all dimensions except 'CommuteDistance', and you can also change the aggregation method to 'Average' or 'Count'.

Data Visualisation

Next, let's analyse how territories influence sales by joining 'DimSalesTerritory' and 'FactInternetSales'. Following the same process discussed earlier, drag 'SalesTerritoryCountry' to the 'Columns' field. You can use the sort button in the top toolbar to order the data in descending order.

Data Visualisation

Click the right top 'Show Me' button, you will see the options of different figure types. Let's show it as a map.

Data Visualisation

2.4 Connecting Power BI to PostgreSQL

Then you can open PowerBI and connect it to PostgreSQL database by the following steps. First, you can click 'Get Data'. button and select 'More' to find PostgreSQL.

Connect Power BI to PostgreSQL
Data Source

Click 'PostgreSQL' to proceed. Then enter the name of server and the port number, the format is localhost:5432 and database and click 'OK'.

In the next page, you can enter your username and password to finish the authentication.

Connection Details

2.5 Visualise the multidimension solution with Power BI

When the database is imported to PowerBI, you can find 'Data' section on the right side, and you can choose the tables and columns you need to visualize. In this lab, we select 'Addressline1' and 'YearlyIncome' from 'dimcustomer' table.

Data Visualisation

At the Visualizations section, we can choose the figure we will use. In this lab, we choose 'Map' to visualize the address

Data Visualisation

Note: for map visualization to work, you will need to enable the Map and Filled Map visuals

through the PowerBI Menu:

File -> Options and Settings -> Options -> Global -> Security

Security Tips

Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.

Data Visualisation

Last updated