Week 4 - Designing a New Data Warehouse Using Databricks

1. Designing a Data Warehouse and Populating Tables

This part aims to help you become familiar with Databricks in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:

how to design a data warehouse
how to create a database with dimension and fact tables
how to populate and bulk insert data into the data warehouse

18MB

AdventureWorkDW.zip

1.1 Data Warehouse Creation

1.1.1 Create a Databricks Database

Navigate to + New -> Add or upload data.

Click on Upload files to a volume (the icon with a folder).

Drag or select csv files from the unzipped AdventureWorksDW

Click Create Volume.
Enter a name for the volume (this is similar to naming a folder).
Select workspace as the catalog.
Under Schema, click Create new schema.
Enter a name for the new schema and click Create.

After creating the new schema, click Create to finish creating the volume.

The database (schema) will now be created, and the files will be stored within the volume.

1.2 Entering Data into Database

From the +New menu, create a new notebook.

As your account grows and contains multiple databases, it is important to ensure you are working in the correct one. You can run the command below to view all available databases in your workspace.

%sql
SHOW DATABASES;

To select the correct database, use the code below. This will act similar as setting the correct working directory

spark.sql("USE week_4_lab")

1.3 Creating Schema for tables

1.3.1 Create dimension tables

The next step is to create the database tables. Start by analysing the CSV file DimAccount.csv from the dataset.

Based on the CSV data, you need to define a table schema that specifies:

Column names
Data types for each column
Whether NULL values are allowed for each column

Use the following CREATE TABLE statement as an example:

CREATE TABLE week_4_lab.DimAccount (
    AccountKey INT NOT NULL,
    ParentAccountKey INT,
    AccountCodeAlternateKey INT,
    ParentAccountCodeAlternateKey INT,
    AccountDescription STRING,
    AccountType STRING,
    Operator STRING,
    CustomMembers STRING,
    ValueType STRING,
    CustomMemberOptions STRING,
    CONSTRAINT pk_dimaccount PRIMARY KEY (AccountKey)
)

Can you create a similar table for another file?

1.3.2 Create fact tables

Use the same way to create a fact table and set the foreign key constraint.

Please note foreign keys are informational in Databricks. It is not strictly enforced like a SQL server.

https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships

CREATE TABLE week_4_lab.FactFinance (
    FinanceKey INT NOT NULL,
    DateKey INT NOT NULL,
    OrganizationKey INT NOT NULL,
    DepartmentGroupKey INT NOT NULL,
    ScenarioKey INT NOT NULL,
    AccountKey INT NOT NULL,
    Amount DECIMAL(18,2) NOT NULL,
    Date TIMESTAMP,
    CONSTRAINT pk_factfinance PRIMARY KEY (FinanceKey),
    CONSTRAINT fk_account FOREIGN KEY (AccountKey)
        REFERENCES week_4_lab.DimAccount(AccountKey)
)

1.3.3 Populating the tables

We will now create the tables using the uploaded data folders. First, we need to retrieve the folder path.

To find the folder path:

Click Catalog
Select Workspace
Open week_4_lab
Click data

Hover over one of the files; three vertical dots will appear. Click them and select Copy Path.

We will now read the file as a DataFrame. Use the actual path you copied and remember that the delimiter here is "|".

dim_account_df = spark.read \
    .option("header", False) \
    .option("inferSchema", True) \
    .option("sep", "|") \
    .csv("/Volumes/workspace/week_4_lab/data/DimAccount.csv")

Since the data file does not have headers, we need to add column names that match the columns defined earlier when creating the schema:

dim_account_df = dim_account_df.toDF(
    "AccountKey",
    "ParentAccountKey",
    "AccountCodeAlternateKey",
    "ParentAccountCodeAlternateKey",
    "AccountDescription",
    "AccountType",
    "Operator",
    "CustomMembers",
    "ValueType",
    "CustomMemberOptions"
)

We also need to ensure that the data type for each column matches the ones defined in the schema:

from pyspark.sql.types import IntegerType, StringType

dim_account_df = dim_account_df.select(
    dim_account_df["AccountKey"].cast(IntegerType()),
    dim_account_df["ParentAccountKey"].cast(IntegerType()),
    dim_account_df["AccountCodeAlternateKey"].cast(IntegerType()),
    dim_account_df["ParentAccountCodeAlternateKey"].cast(IntegerType()),
    dim_account_df["AccountDescription"].cast(StringType()),
    dim_account_df["AccountType"].cast(StringType()),
    dim_account_df["Operator"].cast(StringType()),
    dim_account_df["CustomMembers"].cast(StringType()),
    dim_account_df["ValueType"].cast(StringType()),
    dim_account_df["CustomMemberOptions"].cast(StringType())
)

Now we can create the table from the DataFrame:

dim_account_df.write \
    .mode("append") \
    .saveAsTable("week_4_lab.DimAccount")

To verify that the table has been created successfully, run a query:

%sql
SELECT *
FROM week_4_lab.DimAccount
ORDER BY AccountKey
LIMIT 5;

We will now repeat the same process for the fact table

factfinance_df = spark.read \
    .option("header", False) \
    .option("inferSchema", True) \
    .option("sep", "|") \
    .csv("/Volumes/workspace/week_4_lab/data/FactFinance.csv")

factfinance_df = factfinance_df.toDF(
    "FinanceKey",
    "DateKey",
    "OrganizationKey",
    "DepartmentGroupKey",
    "ScenarioKey",
    "AccountKey",
    "Amount",
    "Date"
)

from pyspark.sql.types import IntegerType, DecimalType, TimestampType, StringType
from pyspark.sql.functions import col, trim

factfinance_df = factfinance_df.select(
    col("FinanceKey").cast(IntegerType()),
    col("DateKey").cast(IntegerType()),
    col("OrganizationKey").cast(IntegerType()),
    col("DepartmentGroupKey").cast(IntegerType()),
    col("ScenarioKey").cast(IntegerType()),
    col("AccountKey").cast(IntegerType()),
    col("Amount").cast(DecimalType(18,2)),
    col("Date").cast(TimestampType())
)

factfinance_df.write \
    .mode("append") \
    .saveAsTable("week_4_lab.FactFinance")

%sql
SELECT *
FROM week_4_lab.factfinance
ORDER BY FinanceKey
LIMIT 5;

1.3.3 ERD diagram

We can visualise the relationships of the star schema by following these steps:

Click Catalog
Expand Workspace
Expand week_4_lab
Expand Tables
Select factfinance
Click View relationships

Since we have not yet created all the dimension tables, we can add foreign key constraints later when we create the remaining dimension tables.

Foreign key constraints can only be added after the corresponding dimension table is created.

Try adding the DimDate dimension table:

Create the DimDate dimension table (using the same pattern as DimAccount).
Use the following code to add a foreign key constraint in the fact table:

spark.sql("""
ALTER TABLE week_4_lab.factfinance
ADD CONSTRAINT fk_date FOREIGN KEY (DateKey)
REFERENCES week_4_lab.Dimdate(DateKey)
""")

Now go to your ERD and check whether it matches the diagram shown below.

The schema for this data warehouse is complex and will be built step by step as you add and connect more dimension tables.

What schema type is used here?

2. Connecting to Power BI/Tableau for Visualisation

Please note that the following tables have already been loaded into the database:

DimCustomer
FactInternetSales
DimSalesTerritory

As an exercise, you are encouraged to try adding these tables yourself.

However, you can still connect Databricks to Power BI or Tableau even if you do not manually add these tables.

2.1 Connecting PowerBI

Download Power BI Desktop from the Microsoft Store and follow the steps below:

Marketplace
Power BI Desktop Integration

After downloading the file, open it. You should see the image below.

We now need a Personal Access Token to access our database. Go back to Databricks and open Settings.

A Generate New Token form will appear. Enter a name for the token, select its lifetime, and then click Generate. A personal access token will be created.

Be sure to copy and securely store it, as it will only be shown once and cannot be retrieved again.

After generating your personal access token, enter it in Power BI and click Connect.

Once connected, the Navigator window will appear. Locate the tables you want to import, select them, and click Load to bring the data into Power BI.

In the Visualizations section, you can choose the chart type you want to use. In this lab, select Map to visualise the addresses.

For map visualizations to work, you need to enable the Map and Filled Map visuals through the Power BI menu: File → Options and settings → Options → Global → Security.

Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.

2.2 Connecting Tableau

Refer to Week 2 labsheet to Install Tableau and follow the steps below:

Marketplace
Tableau Desktop

The file should download automatically. After it finishes downloading, make sure you follow the instructions and ensure that you have installed the Databricks ODBC driver.

After installing Tableau Desktop, click File → Open and select the Tableau file downloaded from Databricks.

Next, navigate to the Data Source tab (located at the bottom left). Select the appropriate Catalog and Database (Schema), then choose the tables you want to use.

Double-click the desired table or drag it into the Data Source canvas to add it to the data model.

2.3 Visualise the multidimension solution with Tableau

In this section, you will need to visualise your data to assist with analysis and gain insights. After signing in, select any table you have created from the left panel and click Refresh at the bottom. You will then be able to view the details at the bottom and the relationships between tables in the centre.

Here, we select a fact table and a dimension table, factinternetsales and dimcustomer, to analyse how commute distance affects sales.

First, open a worksheet from the bottom left (Sheet1, as shown in the figure below).
Drag CommuteDistance from the dimcustomer table on the left to the Columns field at the top centre.
Next, drag SalesAmount.

The metric will be aggregated (summed) across all dimensions except CommuteDistance. You can also change the aggregation method to Average or Count as needed.

Next, let’s analyse how sales territories influence sales by joining DimSalesTerritory and FactInternetSales. Following the same process described earlier, drag SalesTerritoryCountry to the Columns field. You can then use the sort button in the top toolbar to order the data in descending order.

Click the Show Me button in the top‑right corner to see the available chart types. Let’s display the data as a map.

PreviousWeek 4 - Designing a New Data Warehouse Using PostgreSQL

Last updated 8 days ago

hashtag1. Designing a Data Warehouse and Populating Tables

hashtag1.1 Data Warehouse Creation

hashtag1.1.1 Create a Databricks Database

hashtag1.2 Entering Data into Database

hashtag1.3 Creating Schema for tables

hashtag1.3.1 Create dimension tables

hashtag1.3.2 Create fact tables

hashtag1.3.3 Populating the tables

hashtag1.3.3 ERD diagram

hashtag2. Connecting to Power BI/Tableau for Visualisation

hashtag2.1 Connecting PowerBI

hashtag2.2 Connecting Tableau

hashtag2.3 Visualise the multidimension solution with Tableau

1. Designing a Data Warehouse and Populating Tables

1.1 Data Warehouse Creation

1.1.1 Create a Databricks Database

1.2 Entering Data into Database

1.3 Creating Schema for tables

1.3.1 Create dimension tables

1.3.2 Create fact tables

1.3.3 Populating the tables

1.3.3 ERD diagram

2. Connecting to Power BI/Tableau for Visualisation

2.1 Connecting PowerBI

2.2 Connecting Tableau

2.3 Visualise the multidimension solution with Tableau