Week 4 - Designing a New Data Warehouse
Last updated
Last updated
This part aims to help you become familiar with PostgreSQL in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:
how to design a data warehouse
how to create a database with dimension and fact tables
how to populate and bulk insert data into the data warehouse
There are two options for creating a database.
Option 1: Using the pgAdmin 4 GUI tool
Right-click the server name to expand the list.
Select "Create" then "Database...".
Enter your preferred database name under the 'General' tab. Click 'Save' to save the changes and close the window.
You can check the Object Explorer panel to find the created database.
Option 2: Using the pgAdmin 4 query tool
Click 'Query Tool Workspace' to open the PostgreSQL editor.
Select an existing server from the list, then click 'Connect & Open Query Tool'. Leave all other parameters at their default values.
You can use DDL commands, such as CREATE DATABASE
, to create a database in the PostgreSQL editor.
To view all databases on this server, you can use the following command:
If you don’t use double quotation marks, then PostgreSQL creates objects with lowercase names. To create, query, or manage PostgreSQL database objects with names in uppercase or mixed case, use double quotation marks.
The command below will show you which database you're currently using.
To change databases using pgAdmin 4's Query Tool:
Find and click the database dropdown menu at the top of your Query Tool window, then click '<New Connection...>'
Select your server and the database you want to use, then click 'Save'.
The next step is to create database tables. First, you should analyse the CSV file 'DimAccount.csv
' from the data folder (download the .zip file below). Based on the CSV data, we need to create a table schema that specifies:
Column names
Data types for each column
Whether NULL values are allowed for each column
It's important to examine the actual data in each column to check if it contains any NULL values before determining the NULL constraint.
Question: Can you create a table for another file?
Use the same way to create a fact table and set the foreign key constraint:
We can generate the schema for the fact and dimension tables using pgAdmin 4 GUI by following these steps:
Click on the database name to expand its contents
Expand the 'Schemas' section, then right-click on 'public'
Select 'ERD For Schema' to generate the schema
Below is the expected database schema. You can modify both the table details and their layout as needed.
Since we have not created all the dimension tables yet, we can add foreign key constraints later when we create other dimension tables.
When you add foreign key constraints, you need to create the DimDate table first.
For example:
With the tables we created, next step is inserting data into the tables. We can select an instance from the file and insert the data into the table by running INSERT INTO command. Note that you only need to type in the values that are not NULL.
Then, you can check your added record by SQL query and you would able to see the result.
As we seen in the last section, we can add single record to the table by using the INSERT
command, but if we want to add more instances to the table at the same time, or load the whole csv file into the database, using INSERT
command is not an efficient way. We can use bulk insert by running COPY
command instead.
We first delete the manually added record to avoid a primary key conflict.
Move the files into the diecrtory that PostgreSQL can access(Local).
(Windows) If PostgreSQL is running locally, locate its data directory. By default, it is C:\Program Files\PostgreSQL\<version>\data.
You can then drag the 'AvdentureWorksDW' directory into the 'data' directory.
(Mac) If PostgreSQL is running locally, locate itstmp
directory. If you didn't specify the path when you install, the default path would be /tmp
at the root directory. To move the files , run the following command in your Terminal:
Also, you can open the file in Finder and drag the downloaded file into the tmp
directory.
You should go to the root directory, and press COMMAND + SHIFT + .
to show the invisible files (Mac users).
If you cannot find the location, you can run the code in the Terminal to find the path:
If the 'command not found:psql' error occurs, you should find your psql command and add it to your PATH by the following command.
If the command returns an empty result, PostgreSQL is likely using the default location /tmp
.
Move the files into the directory that PostgreSQL can access(Docker).
If PostgreSQL is running in a Docker container. You can run the following command in the Terminal to move the files:
If you have Docker Desktop installed. You can also drag the downloaded files into the /tmp
directory by following the steps.
Now, you can find the tmp
directory. Then you can select the 'AdventureWorksDW' directory and drag it into the tmp
directory in the Docker.
To bulk load data into the DimAccount
table, run this command in your PostgreSQL query tool:
Replace <REPLACE YOUR PATH HERE>
with the actual path to your CSV file. (If you followed the previous step, the path would be /tmp/AdventureWorkDW/
for MAC and C:/AdventureWorksDW/
for Windows)
Check your results
In this task, we will learn how to connect our database to Tableau or Power BI and how to visualize our data.
If you ensure you have downloaded and installed the driver for the PostgreSQL connector, please skip this part. Since December 2019, Power BI Desktop ships with NpgSQL, and no other installation is required.
Mac: /Library/JDBC
or ~/Library/JDBC
Windows: C:\Program Files\Tableau\Drivers
After you set up the driver, restart Tableau, then select PostgreSQL in the Connect panel.
Enter the name of the server that hosts the database that you want to connect to.
Enter the name of the database.
Select how you want to sign in to the server. Specify whether to use Integrated Authentication or User Name and Password. If the server is password protected, and you aren’t in a Kerberos environment, you must enter the username and password.
Select Sign In.
In this section, you will need to visualise your data to help with analysis and gain insights. After signing in, you can select any table you have created from the left panel and click 'refresh' at the bottom. You will then be able to see the details at the bottom and the relationships between tables in the center.
Here we select a fact table and a dimension table, 'factinternetsales' and 'dimcustomer', to analyze how commute distance impacts sales.
First, open a worksheet from the bottom left (Sheet1 in the following figure).
Drag 'CommuteDistance' from the 'dimcustomer' table on the left to the 'Columns' field at the top center.
Similarly, drag 'SalesAmount' to the field.
The metric will be summed across all dimensions except 'CommuteDistance', and you can also change the aggregation method to 'Average' or 'Count'.
Next, let's analyse how territories influence sales by joining 'DimSalesTerritory' and 'FactInternetSales'. Following the same process discussed earlier, drag 'SalesTerritoryCountry' to the 'Columns' field. You can use the sort button in the top toolbar to order the data in descending order.
Click the right top 'Show Me' button, you will see the options of different figure types. Let's show it as a map.
Then you can open PowerBI and connect it to PostgreSQL database by the following steps. First, you can click 'Get Data'. button and select 'More' to find PostgreSQL.
Click 'PostgreSQL' to proceed. Then enter the name of server and the port number, the format is localhost:5432
and database and click 'OK'.
In the next page, you can enter your username and password to finish the authentication.
When the database is imported to PowerBI, you can find 'Data' section on the right side, and you can choose the tables and columns you need to visualize. In this lab, we select 'Addressline1' and 'YearlyIncome' from 'dimcustomer' table.
At the Visualizations section, we can choose the figure we will use. In this lab, we choose 'Map' to visualize the address
Note: for map visualization to work, you will need to enable the Map and Filled Map visuals
through the PowerBI Menu:
File -> Options and Settings -> Options -> Global -> Security
Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.
Use the CSV files in the downloaded .zip file in for bulk inserting.
If a "Permission denied" error occurs, follow these to resolve the issue
You can download pgJDBC by clicking . Then copy the .jar file to the following folder (you may need to create it manually):
tmp
directory