Week 4 - Designing a New Data Warehouse Using Databricks

1. Designing a Data Warehouse and Populating Tables

This part aims to help you become familiar with Databricks in creating a data warehouse. In this lab, we will use the sample data warehouse AdventureWorksDW to learn:

  • how to design a data warehouse

  • how to create a database with dimension and fact tables

  • how to populate and bulk insert data into the data warehouse

file-archive
18MB

1.1 Data Warehouse Creation

1.1.1 Create a Databricks Database

Navigate to + New -> Add or upload data.

Click on Upload files to a volume (the icon with a folder).

Drag or select csv files from the unzipped AdventureWorksDW

  1. Click Create Volume.

  2. Enter a name for the volume (this is similar to naming a folder).

  3. Select workspace as the catalog.

  4. Under Schema, click Create new schema.

  5. Enter a name for the new schema and click Create.

After creating the new schema, click Create to finish creating the volume.

The database (schema) will now be created, and the files will be stored within the volume.

1.2 Entering Data into Database

From the +New menu, create a new notebook.

As your account grows and contains multiple databases, it is important to ensure you are working in the correct one. You can run the command below to view all available databases in your workspace.

To select the correct database, use the code below. This will act similar as setting the correct working directory

1.3 Creating Schema for tables

1.3.1 Create dimension tables

The next step is to create the database tables. Start by analysing the CSV file DimAccount.csv from the dataset.

Based on the CSV data, you need to define a table schema that specifies:

  • Column names

  • Data types for each column

  • Whether NULL values are allowed for each column

Use the following CREATE TABLE statement as an example:

bullseye-arrow

1.3.2 Create fact tables

Use the same way to create a fact table and set the foreign key constraint.

Please note foreign keys are informational in Databricks. It is not strictly enforced like a SQL server.

https://docs.databricks.com/aws/en/tables/constraints#declare-primary-key-and-foreign-key-relationships

1.3.3 Populating the tables

We will now create the tables using the uploaded data folders. First, we need to retrieve the folder path.

To find the folder path:

  1. Click Catalog

  2. Select Workspace

  3. Open week_4_lab

  4. Click data

Hover over one of the files; three vertical dots will appear. Click them and select Copy Path.

We will now read the file as a DataFrame. Use the actual path you copied and remember that the delimiter here is "|".

Since the data file does not have headers, we need to add column names that match the columns defined earlier when creating the schema:

We also need to ensure that the data type for each column matches the ones defined in the schema:

Now we can create the table from the DataFrame:

To verify that the table has been created successfully, run a query:

We will now repeat the same process for the fact table

1.3.3 ERD diagram

We can visualise the relationships of the star schema by following these steps:

  1. Click Catalog

  2. Expand Workspace

  3. Expand week_4_lab

  4. Expand Tables

  5. Select factfinance

  6. Click View relationships

Since we have not yet created all the dimension tables, we can add foreign key constraints later when we create the remaining dimension tables.

circle-info

Foreign key constraints can only be added after the corresponding dimension table is created.

Try adding the DimDate dimension table:

  1. Create the DimDate dimension table (using the same pattern as DimAccount).

  2. Use the following code to add a foreign key constraint in the fact table:

Now go to your ERD and check whether it matches the diagram shown below.

The schema for this data warehouse is complex and will be built step by step as you add and connect more dimension tables.

circle-info

What schema type is used here?

2. Connecting to Power BI/Tableau for Visualisation

Please note that the following tables have already been loaded into the database:

  • DimCustomer

  • FactInternetSales

  • DimSalesTerritory

As an exercise, you are encouraged to try adding these tables yourself.

However, you can still connect Databricks to Power BI or Tableau even if you do not manually add these tables.

2.1 Connecting PowerBI

Download Power BI Desktop from the Microsoft Store and follow the steps below:

  1. Marketplace

  2. Power BI Desktop Integration

After downloading the file, open it. You should see the image below.

We now need a Personal Access Token to access our database. Go back to Databricks and open Settings.

A Generate New Token form will appear. Enter a name for the token, select its lifetime, and then click Generate. A personal access token will be created.

circle-info

Be sure to copy and securely store it, as it will only be shown once and cannot be retrieved again.

After generating your personal access token, enter it in Power BI and click Connect.

Once connected, the Navigator window will appear. Locate the tables you want to import, select them, and click Load to bring the data into Power BI.

In the Visualizations section, you can choose the chart type you want to use. In this lab, select Map to visualise the addresses.

For map visualizations to work, you need to enable the Map and Filled Map visuals through the Power BI menu: File → Options and settings → Options → Global → Security.

Refresh and you will see the YearlyIncome by Addressline1 on map. Click the button on the right top side to enter the Focus Mode.

2.2 Connecting Tableau

Refer to Week 2 labsheet to Install Tableau and follow the steps below:

  1. Marketplace

  2. Tableau Desktop

The file should download automatically. After it finishes downloading, make sure you follow the instructions and ensure that you have installed the Databricks ODBC driver.

After installing Tableau Desktop, click File → Open and select the Tableau file downloaded from Databricks.

Next, navigate to the Data Source tab (located at the bottom left). Select the appropriate Catalog and Database (Schema), then choose the tables you want to use.

Double-click the desired table or drag it into the Data Source canvas to add it to the data model.

2.3 Visualise the multidimension solution with Tableau

In this section, you will need to visualise your data to assist with analysis and gain insights. After signing in, select any table you have created from the left panel and click Refresh at the bottom. You will then be able to view the details at the bottom and the relationships between tables in the centre.

Here, we select a fact table and a dimension table, factinternetsales and dimcustomer, to analyse how commute distance affects sales.

  1. First, open a worksheet from the bottom left (Sheet1, as shown in the figure below).

  2. Drag CommuteDistance from the dimcustomer table on the left to the Columns field at the top centre.

  3. Next, drag SalesAmount.

The metric will be aggregated (summed) across all dimensions except CommuteDistance. You can also change the aggregation method to Average or Count as needed.

Next, let’s analyse how sales territories influence sales by joining DimSalesTerritory and FactInternetSales. Following the same process described earlier, drag SalesTerritoryCountry to the Columns field. You can then use the sort button in the top toolbar to order the data in descending order.

Click the Show Me button in the top‑right corner to see the available chart types. Let’s display the data as a map.

Last updated