Project 1 - Data Warehouse Design

Aim to showcase this in your portfolio! (How to create a portfolio)

NOTE: For this project, you can work individually or in a team of a maximum of two people. CITS3401 students should form groups with other CITS3401 students. CITS5504 students should form groups with other CITS5504 students. Cross-unit grouping is only allowed on rare occasions, evaluated case by case. Project 1 contributes 25% to the total assessment of this unit. The submission deadline is April 11 at 11:59 PM on LMS.

The overall objectives of this project are to build a data warehouse using real-world datasets and to carry out a basic data mining activity, in this case, association rule mining.

Datasets and Problem Domain

Every nation in the world is working towards lowering road toll for safer transportation networks. According to the most recently available 2023 data from the European Union, unfortunately, Australia is a long way behind the top performers for road safety with an average fatality rate of 4.8 deaths per 100,000 people; while Iceland has the world's safest roads – with an average fatality rate of 2.1 deaths per 100,000 people [1].

As of July 2024, Australia's national road toll is sitting at 761 lives lost so far this year, while the 12-month total of 1327 road deaths is up 10% on the year prior [2].

According to data from the Department of Infrastructure, Transport, Regional Development and Communications, the key statistics of Australia for December 2024 are below:

In order to help the government and public understand the importance of road safety and lower traffic risks, you are required to:

Build a data warehouse to store historical data on fatal crashes.
Use the created data warehouse to present key insights with a dashboard.
Support decision-making with data mining techniques.
List and describe a few suggestions for improving road safety for the government.

Data Sources

Australian Road Deaths Database - ARDD
- ARDD: Fatal crashes—December 2024—XLSX
- ARDD: Fatalities—December 2024—XLSX
- The data dictionary for the above datasets
GeoJSON is an open standard geospatial data interchange format that represents simple geographic features and their nonspatial attributes.
- GeoJSON can be seamlessly integrated with web mapping libraries, making it effortless to display and interact with geospatial data on web applications.
- These GeoJSON files can help you visualise your query results with map charts. Using excellent and meaningful map charts in your project is one of the compulsory requirements for full marks (5 marks) for the results visualisation criteria.

The dwelling count data used in this analysis is from 2021 and was extracted from the Australian Bureau of Statistics (ABS) TableBuilder tool, which provides access to official 2021 census data.

The population data used in this analysis was downloaded from the Australian Bureau of Statistics (ABS). The ABS provides demographic data broken down by various geographic area types, including:
- Local Government Areas (LGAs)
- Urban centres and localities
- Remoteness areas
- Commonwealth electoral divisions
- State electoral divisions

Special Requirements for CITS5504 and CITS3401

For CITS3401 Students: You must use both the Fatal Crashes - December 2024 dataset and the Fatalities - December 2024 dataset for your project.
For CITS5504 Students: You must use the Fatal Crashes - December 2024 dataset and the Fatalities - December 2024 dataset. Additionally,
- you are required to utilise at least one of the following datasets: Dwelling Count Data or Population Data. You may choose to incorporate both of these additional datasets if desired.
OR
- you may choose to merge your own additional data sources with the Fatal Crashes - December 2024 and Fatalities - December 2024.
For both CITS5504 and CITS3401 students, you are encouraged to identify at least 8 dimensions in your project.

Data Warehousing Design and Implementation

Design a data warehouse for the dataset(s) by following the four steps below of dimensional modelling (i.e., Kimball's four steps).

Identify the process being modelled.
Determine the grain at which facts can be stored.
Choose the dimensions
Identify the numeric measures for the facts.

To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.

Think about a few business questions that your data warehouse could help answer.
Draw a StarNet to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level of information you have access to.
Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.
Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under Azure Education) or a drawing program/website of your own choice. A Hand-printed StarNet diagram is also welcome!
Implement a star or snowflake schema using PostgreSQL. For the fact table and dimension tables, clearly state which ones are measures and dimensions, and indicate the dimension references.
Use PostgreSQL to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design.
Use Power BI/Tableau to visualise the data returned from your business queries.

Association Rule Mining

NOTE: you are not allowed to use Microsoft Visual Studio, R, or Weka to do association rule mining in this project. Only Python code will be considered (for association rule mining) in this project.

Make sure you complete the relevant lab before attempting this task. The lab content may help you complete this task.

You are expected to select a subset (or all) of the attributes to mine interesting patterns. To rank the interestingness of the extracted rules, use support, confidence, and lift.

In the submitted PDF, you need to:

Explain and discuss which association rules mining algorithms were used with references.
Explain the top k rules (where k ≥ 1) that have "Road User" on the right-hand side, ranked by lift and confidence.
Explain the meaning of these k rules in plain English.
Share insights derived from the mining results. If no meaningful rules are discovered, explore potential reasons for this outcome.
Based on the rules, provide and explain at least THREE (3) recommendations to the government on how to improve road safety for road users.

What to Submit

The following are the files needed for Project 1 submission.

NOTE:

In-text citations and end-text references are required in this project report, following the IEEE referencing style.
Include team members' information in the report, such as team members' names and student IDs.

A comprehensive PDF report that includes the following key components:
- Design, implementation, and usage of the data warehouse: Explain and discuss the design, implementation, and usage of the data warehouse to answer your queries, such as the fact table, dimension tables, and concept hierarchies design. When designing your data warehouse, you may select any number of columns/attributes as dimensions and measures. Please justify your design by explaining your assumptions or reasoning behind your choices with references.
NOTE: The data warehouse design should demonstrate concept hierarchies and good dimension modelling. An overly simple design will result in low marks for the project.
- Schema, Starnet and query footprints: Provide and explain your StarNet diagrams and the query footprints. Discuss which schema you used and justify your choice with references.
- Data cleaning, preprocessing, and ETL process: Describe your ETL process in detail. Include descriptions of the techniques used, discuss your ETL principles, explain the reasoning behind the key steps, and provide screenshots illustrating the process flow with references.
NOTE: During the ETL process, you may remove some rows from the dataset. However, the number of dropped rows should not exceed 5% of the total dataset. Removing too many rows will result in a low mark for the ETL section.
- Visualisation of query results: Present the findings from your business queries using appropriate charts, graphs, and other visualisations. Ensure the insights are clearly communicated and easily understandable to stakeholders.
- Association rules mining: See the Association Rules Mining section above.
All scripts you used (e.g. Python, PostgreSQL) with clear and well-structured comments.
The Power BI/Tableau file (.pdf file).
All CSV files you used for building and populating your database.

NOTE: All files need to be zipped up in a single zip file and submitted to LMS (Under the Projects channel that can be found in the left navigation panel).

Marking Scheme

Scheme

Marks

Schema of each dimension and concept hierarchies for each dimension

Corresponding StarNet to illustrate query capabilities of the data warehouse

At least 5 types of business queries that the StarNet can answer

Data warehouse schema and fact table design

Data cleaning/pre-processing/ETL process for data transformation (include code, screenshots, and detailed explanations)

SQL Script file for building the database and loading the datasets

Visualise corresponding to the 5 business queries with appropriate charts

Association rule mining meaningful set-up

Interpretation of top rules and suggestions

Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution

Overall report quality

Total

Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:

5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)

4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)

3 - Satisfactory (managed to meet most of the project requirements)

2 - Developing (some skills are demonstrated but need revision)

1 - Not yet Satisfactory (minimal effort)

0 - Not attempted.

Can I Use Generation AI?

Informed and educated use of generative AI tools is encouraged and should be declared. If you choose to use generative AI, please document its suggestions in your report and provide a clear rationale for accepting or rejecting each recommendation. While using generative AI is optional, as a data science professional, you are strongly encouraged to develop critical evaluation skills and build familiarity with modern tools that could enhance your productivity.

NOTE: Using generative AI without proper citation and documentation is unacceptable.

The following example is a BAD example that may risk receiving a 0 mark.

In this case, k>1 means the student should explain more than 1 top association mining rule. However, the student misunderstood the k value and the concept of lift. This is a common error when using GenAI, which sometimes generates non-existent theories - a phenomenon called 'AI hallucination'.

The explanations below are correct understandings.

Reference

[1] European Commission. "2023 figures show stalling progress in reducing road fatalities in too many countries". Mobility and Transport. https://transport.ec.europa.eu/news-events/news/2023-figures-show-stalling-progress-reducing-road-fatalities-too-many-countries-2024-03-08_en (accessed Feb. 24, 2025).

[2]S. Guthrie. "The country with the safest roads in the world". Drive. https://www-drive-com-au.webpkgcache.com/doc/-/s/www.drive.com.au/caradvice/the-country-with-the-safest-roads-in-the-world/ (accessed Feb. 24, 2025).

Last updated 3 months ago