Week 4 - Designing a New Data Warehouse Using Databricks
1. Designing a Data Warehouse and Populating Tables
This part aims to help you become familiar with Databricks in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:
how to design a data warehouse
how to create a database with dimension and fact tables
how to populate and bulk insert data into the data warehouse
1.1 Data Warehouse Creation
1.1.1 Create a Databricks Database
Navigate to + New -> Add or upload data.
Click on Upload files to a volume (the icon with a folder).

Drag or select csv files from the unzipped AdventureWorksDW

Click Create Volume.
Enter a name for the volume (this is similar to naming a folder).
Select workspace as the catalog.
Under Schema, click Create new schema.
Enter a name for the new schema and click Create.


After creating the new schema, click Create to finish creating the volume.
The database (schema) will now be created, and the files will be stored within the volume.
1.2 Entering Data into Database
From the +New menu, create a new notebook.
As your account grows and contains multiple databases, it is important to ensure you are working in the correct one. You can run the command below to view all available databases in your workspace.

To select the correct database, use the code below. This will act similar as setting the correct working directory
1.3 Creating Schema for tables
1.3.1 Create dimension tables
The next step is to create the database tables. Start by analysing the CSV file DimAccount.csv from the dataset.
Based on the CSV data, you need to define a table schema that specifies:
Column names
Data types for each column
Whether
NULLvalues are allowed for each column
Use the following CREATE TABLE statement as an example:
Can you create a similar table for another file?
1.3.2 Create fact tables
Use the same way to create a fact table and set the foreign key constraint.
Please note foreign keys are informational in Databricks. It is not strictly enforced like a SQL server.
https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships
1.3.3 Populating the tables
We will now create the tables using the uploaded data folders. First, we need to retrieve the folder path.
To find the folder path:
Click Catalog
Select Workspace
Open week_4_lab
Click data

Hover over one of the files; three vertical dots will appear. Click them and select Copy Path.

We will now read the file as a DataFrame. Use the actual path you copied and remember that the delimiter here is "|".
Since the data file does not have headers, we need to add column names that match the columns defined earlier when creating the schema:
We also need to ensure that the data type for each column matches the ones defined in the schema:
Now we can create the table from the DataFrame:
To verify that the table has been created successfully, run a query:

We will now repeat the same process for the fact table

1.3.3 ERD diagram
We can visualise the relationships of the star schema by following these steps:
Click Catalog
Expand Workspace
Expand week_4_lab
Expand Tables
Select factfinance
Click View relationships


Since we have not yet created all the dimension tables, we can add foreign key constraints later when we create the remaining dimension tables.
Foreign key constraints can only be added after the corresponding dimension table is created.
Try adding the DimDate dimension table:
Create the
DimDatedimension table (using the same pattern asDimAccount).Use the following code to add a foreign key constraint in the fact table:
Now go to your ERD and check whether it matches the diagram shown below.

The schema for this data warehouse is complex and will be built step by step as you add and connect more dimension tables.

What schema type is used here?
2. Connecting to Power BI/Tableau for Visualisation
Please note that the following tables have already been loaded into the database:
DimCustomerFactInternetSalesDimSalesTerritory
As an exercise, you are encouraged to try adding these tables yourself.
However, you can still connect Databricks to Power BI or Tableau even if you do not manually add these tables.
2.1 Connecting PowerBI
Download Power BI Desktop from the Microsoft Store and follow the steps below:
Marketplace
Power BI Desktop Integration



After downloading the file, open it. You should see the image below.

We now need a Personal Access Token to access our database. Go back to Databricks and open Settings.



A Generate New Token form will appear. Enter a name for the token, select its lifetime, and then click Generate. A personal access token will be created.
Be sure to copy and securely store it, as it will only be shown once and cannot be retrieved again.

After generating your personal access token, enter it in Power BI and click Connect.
Once connected, the Navigator window will appear. Locate the tables you want to import, select them, and click Load to bring the data into Power BI.


In the Visualizations section, you can choose the chart type you want to use. In this lab, select Map to visualise the addresses.

For map visualizations to work, you need to enable the Map and Filled Map visuals through the Power BI menu: File → Options and settings → Options → Global → Security.

Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.


2.2 Connecting Tableau
Refer to Week 2 labsheet to Install Tableau and follow the steps below:
Marketplace
Tableau Desktop



The file should download automatically. After it finishes downloading, make sure you follow the instructions and ensure that you have installed the Databricks ODBC driver.

After installing Tableau Desktop, click File → Open and select the Tableau file downloaded from Databricks.
Next, navigate to the Data Source tab (located at the bottom left). Select the appropriate Catalog and Database (Schema), then choose the tables you want to use.
Double-click the desired table or drag it into the Data Source canvas to add it to the data model.

2.3 Visualise the multidimension solution with Tableau
In this section, you will need to visualise your data to assist with analysis and gain insights. After signing in, select any table you have created from the left panel and click Refresh at the bottom. You will then be able to view the details at the bottom and the relationships between tables in the centre.

Here, we select a fact table and a dimension table, factinternetsales and dimcustomer, to analyse how commute distance affects sales.
First, open a worksheet from the bottom left (Sheet1, as shown in the figure below).
Drag
CommuteDistancefrom thedimcustomertable on the left to the Columns field at the top centre.Next, drag
SalesAmount.
The metric will be aggregated (summed) across all dimensions except CommuteDistance. You can also change the aggregation method to Average or Count as needed.

Next, let’s analyse how sales territories influence sales by joining DimSalesTerritory and FactInternetSales. Following the same process described earlier, drag SalesTerritoryCountry to the Columns field. You can then use the sort button in the top toolbar to order the data in descending order.

Click the Show Me button in the top‑right corner to see the available chart types. Let’s display the data as a map.

Last updated