Project 1 - Data Warehouse Design
Last updated
Last updated
Aim to showcase this in your portfolio! ()
NOTE: For this project, you can work individually or in a team of a maximum of two people. CITS3401 students should form groups with other CITS3401 students. CITS5504 students should form groups with other CITS5504 students. Cross-unit grouping is only allowed on rare occasions, evaluated case by case. Project 1 contributes 25% to the total assessment of this unit. The submission deadline is April 11 at 11:59 PM on LMS.
The overall objectives of this project are to build a data warehouse using real-world datasets and to carry out a basic data mining activity, in this case, association rule mining.
As of July 2024, Australia's national road toll is sitting at 761 lives lost so far this year, while the 12-month total of 1327 road deaths is up 10% on the year prior [2].
According to data from the Department of Infrastructure, Transport, Regional Development and Communications, the key statistics of Australia for December 2024 are below:
In order to help the government and public understand the importance of road safety and lower traffic risks, you are required to:
Build a data warehouse to store historical data on fatal crashes.
Use the created data warehouse to present key insights with a dashboard.
Support decision-making with data mining techniques.
List and describe a few suggestions for improving road safety for the government.
GeoJSON can be seamlessly integrated with web mapping libraries, making it effortless to display and interact with geospatial data on web applications.
These GeoJSON files can help you visualise your query results with map charts. Using excellent and meaningful map charts in your project is one of the compulsory requirements for full marks (5 marks) for the results visualisation criteria.
The population data used in this analysis was downloaded from the Australian Bureau of Statistics (ABS). The ABS provides demographic data broken down by various geographic area types, including:
Local Government Areas (LGAs)
Urban centres and localities
Remoteness areas
Commonwealth electoral divisions
State electoral divisions
you are required to utilise at least one of the following datasets: Dwelling Count Data or Population Data. You may choose to incorporate both of these additional datasets if desired.
OR
For both CITS5504 and CITS3401 students, you are encouraged to identify at least 8 dimensions in your project.
Design a data warehouse for the dataset(s) by following the four steps below of dimensional modelling (i.e., Kimball's four steps).
Identify the process being modelled.
Determine the grain at which facts can be stored.
Choose the dimensions
Identify the numeric measures for the facts.
To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.
Think about a few business questions that your data warehouse could help answer.
Draw a StarNet to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level of information you have access to.
Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.
Implement a star or snowflake schema using PostgreSQL. For the fact table and dimension tables, clearly state which ones are measures and dimensions, and indicate the dimension references.
Use PostgreSQL to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design.
Use Power BI/Tableau to visualise the data returned from your business queries.
NOTE: you are not allowed to use Microsoft Visual Studio, R, or Weka to do association rule mining in this project. Only Python code will be considered (for association rule mining) in this project.
Make sure you complete the relevant lab before attempting this task. The lab content may help you complete this task.
You are expected to select a subset (or all) of the attributes to mine interesting patterns. To rank the interestingness of the extracted rules, use support, confidence, and lift.
In the submitted PDF, you need to:
Explain and discuss which association rules mining algorithms were used with references.
Explain the top k rules (where k ≥ 1) that have "Road User" on the right-hand side, ranked by lift and confidence.
Explain the meaning of these k rules in plain English.
Share insights derived from the mining results. If no meaningful rules are discovered, explore potential reasons for this outcome.
Based on the rules, provide and explain at least THREE (3) recommendations to the government on how to improve road safety for road users.
The following are the files needed for Project 1 submission.
NOTE:
Include team members' information in the report, such as team members' names and student IDs.
A comprehensive PDF report that includes the following key components:
Design, implementation, and usage of the data warehouse: Explain and discuss the design, implementation, and usage of the data warehouse to answer your queries, such as the fact table, dimension tables, and concept hierarchies design. When designing your data warehouse, you may select any number of columns/attributes as dimensions and measures. Please justify your design by explaining your assumptions or reasoning behind your choices with references.
NOTE: The data warehouse design should demonstrate concept hierarchies and good dimension modelling. An overly simple design will result in low marks for the project.
Schema, Starnet and query footprints: Provide and explain your StarNet diagrams and the query footprints. Discuss which schema you used and justify your choice with references.
Data cleaning, preprocessing, and ETL process: Describe your ETL process in detail. Include descriptions of the techniques used, discuss your ETL principles, explain the reasoning behind the key steps, and provide screenshots illustrating the process flow with references.
NOTE: During the ETL process, you may remove some rows from the dataset. However, the number of dropped rows should not exceed 5% of the total dataset. Removing too many rows will result in a low mark for the ETL section.
Visualisation of query results: Present the findings from your business queries using appropriate charts, graphs, and other visualisations. Ensure the insights are clearly communicated and easily understandable to stakeholders.
All scripts you used (e.g. Python, PostgreSQL) with clear and well-structured comments.
The Power BI/Tableau file (.pdf file).
All CSV files you used for building and populating your database.
Schema of each dimension and concept hierarchies for each dimension
5
Corresponding StarNet to illustrate query capabilities of the data warehouse
5
At least 5 types of business queries that the StarNet can answer
5
Data warehouse schema and fact table design
5
Data cleaning/pre-processing/ETL process for data transformation (include code, screenshots, and detailed explanations)
10
SQL Script file for building the database and loading the datasets
5
Visualise corresponding to the 5 business queries with appropriate charts
5
Association rule mining meaningful set-up
5
Interpretation of top rules and suggestions
5
Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution
5
Overall report quality
5
Total
60
Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:
5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)
4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)
3 - Satisfactory (managed to meet most of the project requirements)
2 - Developing (some skills are demonstrated but need revision)
1 - Not yet Satisfactory (minimal effort)
0 - Not attempted.
Informed and educated use of generative AI tools is encouraged and should be declared. If you choose to use generative AI, please document its suggestions in your report and provide a clear rationale for accepting or rejecting each recommendation. While using generative AI is optional, as a data science professional, you are strongly encouraged to develop critical evaluation skills and build familiarity with modern tools that could enhance your productivity.
NOTE: Using generative AI without proper citation and documentation is unacceptable.
The explanations below are correct understandings.
Every nation in the world is working towards lowering road toll for safer transportation networks. According to the most recently available from the European Union, unfortunately, Australia is a long way behind the top performers for road safety with an average fatality rate of 4.8 deaths per 100,000 people; while Iceland has the world's safest roads – with an average fatality rate of 2.1 deaths per 100,000 people [1].
The for the above datasets
is an open standard geospatial data interchange format that represents simple geographic features and their nonspatial attributes.
The dwelling count data used in this analysis is from 2021 and was extracted from the Australian Bureau of Statistics (ABS) tool, which provides access to official 2021 census data.
For CITS3401 Students: You must use both the dataset and the dataset for your project.
For CITS5504 Students: You must use the dataset and the dataset. Additionally,
you may choose to merge your own additional data sources with the and .
Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under ) or a drawing program/ of your own choice. A Hand-printed StarNet diagram is also welcome!
In-text citations and end-text references are required in this project report, following the .
Association rules mining: See the above.
NOTE: All files need to be zipped up in a single zip file and submitted to (Under the Projects channel that can be found in the left navigation panel).
In this case, k>1
means the student should explain more than 1 top association mining rule. However, the student misunderstood the k
value and the concept of lift. This is a common error when using GenAI, which sometimes generates non-existent theories - a phenomenon called ''.
[1] European Commission. "2023 figures show stalling progress in reducing road fatalities in too many countries". Mobility and Transport. (accessed Feb. 24, 2025).
[2]S. Guthrie. "The country with the safest roads in the world". Drive. (accessed Feb. 24, 2025).