Week 4 - Designing a New Data Warehouse

1. Designing a Data Warehouse and Populating Tables

This part aims to help you become familiar with PostgreSQL in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:

how to design a data warehouse
how to create a database with dimension and fact tables
how to populate and bulk insert data into the data warehouse

1.1 Data Warehouse Creation

1.1.1 Create a PostgreSQL Database

There are two options for creating a database.

Option 1: Using the pgAdmin 4 GUI tool
- Right-click the server name to expand the list.
- Select "Create" then "Database...".
Create a Database
- Enter your preferred database name under the 'General' tab. Click 'Save' to save the changes and close the window.
Database Name
- You can check the Object Explorer panel to find the created database.
Created Database
Option 2: Using the pgAdmin 4 query tool
- Click 'Query Tool Workspace' to open the PostgreSQL editor.
Open Query Tool
- Select an existing server from the list, then click 'Connect & Open Query Tool'. Leave all other parameters at their default values.
Connect the Server

You can use DDL commands, such as CREATE DATABASE, to create a database in the PostgreSQL editor.

CREATE DATABASE AdventureWorksDW;

To view all databases on this server, you can use the following command:

SELECT datname FROM pg_database;

If you don’t use double quotation marks, then PostgreSQL creates objects with lowercase names. To create, query, or manage PostgreSQL database objects with names in uppercase or mixed case, use double quotation marks.

The command below will show you which database you're currently using.

SELECT current_database();

To change databases using pgAdmin 4's Query Tool:

Find and click the database dropdown menu at the top of your Query Tool window, then click '<New Connection...>'

Select your server and the database you want to use, then click 'Save'.

1.1.2 Create dimension tables

The next step is to create database tables. First, you should analyse the CSV file 'DimAccount.csv' from the data folder (download the .zip file below). Based on the CSV data, we need to create a table schema that specifies:

Column names
Data types for each column
Whether NULL values are allowed for each column

18MB

AdventureWorkDW.zip

1.1.3 Create fact tables

Use the same way to create a fact table and set the foreign key constraint:

CREATE TABLE FactFinance (
    FinanceKey INT PRIMARY KEY,
    DateKey INTEGER NOT NULL,
    OrganizationKey INTEGER NOT NULL,
    DepartmentGroupKey INTEGER NOT NULL,
    ScenarioKey INTEGER NOT NULL,
    AccountKey INTEGER NOT NULL,
    Amount FLOAT NOT NULL,
    Date TIMESTAMP,
    FOREIGN KEY (accountkey) REFERENCES DimAccount(accountkey)
);

1.1.4 Create a schema

We can generate the schema for the fact and dimension tables using pgAdmin 4 GUI by following these steps:

Click on the database name to expand its contents
Expand the 'Schemas' section, then right-click on 'public'
Select 'ERD For Schema' to generate the schema

Below is the expected database schema. You can modify both the table details and their layout as needed.

Since we have not created all the dimension tables yet, we can add foreign key constraints later when we create other dimension tables.

When you add foreign key constraints, you need to create the DimDate table first.

For example:

ALTER TABLE FactFinance 
ADD CONSTRAINT FactFinance FOREIGN KEY (DateKey) 
REFERENCES DimDate(DateKey);

The schema for the data warehouse is complex.

1.2 Populate Data

1.2.1 INSERT VALUES

With the tables we created, next step is inserting data into the tables. We can select an instance from the file and insert the data into the table by running INSERT INTO command. Note that you only need to type in the values that are not NULL.

INSERT INTO dimaccount (AccountKey,
    AccountCodeAlternateKey,AccountDescription,Operator,
    ValueType)
VALUES (1,1, 'Balance Sheet', '~', 'Currency');

Then, you can check your added record by SQL query and you would able to see the result.

SELECT * FROM dimaccount;

1.2.2 Bulk insert

As we seen in the last section, we can add single record to the table by using the INSERT command, but if we want to add more instances to the table at the same time, or load the whole csv file into the database, using INSERT command is not an efficient way. We can use bulk insert by running COPY command instead.

We first delete the manually added record to avoid a primary key conflict.

DELETE FROM DimAccount;

Use the CSV files in the downloaded .zip file in section 1.1.2 for bulk inserting.

Move the files into the diecrtory that PostgreSQL can access(Local).

(Windows) If PostgreSQL is running locally, locate its data directory. By default, it is C:\Program Files\PostgreSQL\<version>\data. You can then drag the 'AvdentureWorksDW' directory into the 'data' directory.

(Mac) If PostgreSQL is running locally, locate itstmp directory. If you didn't specify the path when you install, the default path would be /tmp at the root directory. To move the files , run the following command in your Terminal:

cp '<PATH OF YOUR DOWNLOADED FILES>'AdventureWorksDW /tmp/AdventureWorksDW

Also, you can open the file in Finder and drag the downloaded file into the tmp directory.

If a "Permission denied" error occurs, follow these steps to resolve the issue

You should go to the root directory, and press COMMAND + SHIFT + . to show the invisible files (Mac users).

If you cannot find the location, you can run the code in the Terminal to find the path:

psql -U postgres -c "SHOW temp_tablespaces;"

If the 'command not found:psql' error occurs, you should find your psql command and add it to your PATH by the following command.

export PATH=$PATH:<PATH TO YOUR PSQL COMMAND DIRECTORY> 
export PATH=$PATH:/Library/PostgreSQL/17/bin # this is an example

If the command returns an empty result, PostgreSQL is likely using the default location /tmp.

Move the files into the directory that PostgreSQL can access(Docker).

If PostgreSQL is running in a Docker container. You can run the following command in the Terminal to move the files:

docker cp '<PATH OF YOUR DOWNLOADED FILES>'AdventureWorkDW '<NAME OF YOUR CONTAINER>':/tmp/AdventureWorksDW

If you have Docker Desktop installed. You can also drag the downloaded files into the /tmp directory by following the steps.

Now, you can find the tmp directory. Then you can select the 'AdventureWorksDW' directory and drag it into the tmp directory in the Docker.

To bulk load data into the DimAccount table, run this command in your PostgreSQL query tool:

Replace <REPLACE YOUR PATH HERE> with the actual path to your CSV file. (If you followed the previous step, the path would be /tmp/AdventureWorkDW/ for MAC and C:/AdventureWorksDW/for Windows)

COPY DimAccount 
FROM '<REPLACE YOUR PATH HERE>DimAccount.csv' 
WITH (FORMAT csv, DELIMITER '|', HEADER false);

Check your results

2. Connecting to Power BI/Tableau for Visualization

In this task, we will learn how to connect our database to Tableau or Power BI and how to visualize our data.

2.1 Download Driver

If you ensure you have downloaded and installed the driver for the PostgreSQL connector, please skip this part. Since December 2019, Power BI Desktop ships with NpgSQL, and no other installation is required.

You can download pgJDBC by clicking here. Then copy the .jar file to the following folder (you may need to create it manually):

Mac: /Library/JDBC or ~/Library/JDBC

Windows: C:\Program Files\Tableau\Drivers

2.2 Connecting Tableau to PostgreSQL

After you set up the driver, restart Tableau, then select PostgreSQL in the Connect panel.

Enter the name of the server that hosts the database that you want to connect to.
Enter the name of the database.
Select how you want to sign in to the server. Specify whether to use Integrated Authentication or User Name and Password. If the server is password protected, and you aren’t in a Kerberos environment, you must enter the username and password.
Select Sign In.

2.3 Visualise the multidimension solution with Tableau

In this section, you will need to visualise your data to help with analysis and gain insights. After signing in, you can select any table you have created from the left panel and click 'refresh' at the bottom. You will then be able to see the details at the bottom and the relationships between tables in the center.

Here we select a fact table and a dimension table, 'factinternetsales' and 'dimcustomer', to analyze how commute distance impacts sales.

First, open a worksheet from the bottom left (Sheet1 in the following figure).
Drag 'CommuteDistance' from the 'dimcustomer' table on the left to the 'Columns' field at the top center.
Similarly, drag 'SalesAmount' to the field.
The metric will be summed across all dimensions except 'CommuteDistance', and you can also change the aggregation method to 'Average' or 'Count'.

Next, let's analyse how territories influence sales by joining 'DimSalesTerritory' and 'FactInternetSales'. Following the same process discussed earlier, drag 'SalesTerritoryCountry' to the 'Columns' field. You can use the sort button in the top toolbar to order the data in descending order.

Click the right top 'Show Me' button, you will see the options of different figure types. Let's show it as a map.

2.4 Connecting Power BI to PostgreSQL

Then you can open PowerBI and connect it to PostgreSQL database by the following steps. First, you can click 'Get Data'. button and select 'More' to find PostgreSQL.

Click 'PostgreSQL' to proceed. Then enter the name of server and the port number, the format is localhost:5432 and database and click 'OK'.

In the next page, you can enter your username and password to finish the authentication.

2.5 Visualise the multidimension solution with Power BI

When the database is imported to PowerBI, you can find 'Data' section on the right side, and you can choose the tables and columns you need to visualize. In this lab, we select 'Addressline1' and 'YearlyIncome' from 'dimcustomer' table.

At the Visualizations section, we can choose the figure we will use. In this lab, we choose 'Map' to visualize the address

Note: for map visualization to work, you will need to enable the Map and Filled Map visuals

through the PowerBI Menu:

File -> Options and Settings -> Options -> Global -> Security

Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.

PreviousTask Demonstration Video NextWeek 5 - Mini End-to-End Data Warehouse Project

Last updated 4 months ago