Project 1 - Data Warehouse Design
Aim to showcase this in your portfolio! (How to create a portfolio)
NOTE: For this project, you can work individually or in a team of a maximum of two people. CITS3401 students should form groups with other CITS3401 students. CITS5504 students should form groups with other CITS5504 students. Cross-unit grouping is only allowed on rare occasions, evaluated case by case. Project 1 contributes 25% to the total assessment of this unit. The submission deadline is 17 April at 11:59 PM on LMS.
The overall objectives of this project are to build a data warehouse using real-world datasets and to carry out a basic data mining activity, in this case, association rule mining.
Datasets and Problem Domain
According to IATA's analysis, the worldwide aviation industry experienced moderate growth in 2025. Passenger demand, measured by Revenue Passenger Kilometres, grew 5.3% compared to the previous year, slightly outpacing the 5.2% increase in airline capacity (Available Seat Kilometres). This imbalance drove the Passenger Load Factor to an unprecedented 83.6%, marking the highest annual utilisation rate ever recorded.
The growth trajectory showed divergence between market segments. International routes demonstrated robust expansion at 7.1% year-on-year, whilst domestic markets grew at a more modest 2.4% rate.
Regional Performance Highlights
Asia Pacific Region Carriers in the Asia Pacific region achieved the second-highest growth rate globally, with passenger traffic expanding 7.8% year-on-year.
Market-Specific Trends
Australia: The Australian domestic aviation market showed growth of 2.9% in passenger traffic (note: an earlier figure in the report cited 2.2% growth). However, airlines aggressively expanded capacity, resulting in a utilisation decline of 2.8 percentage points, bringing the load factor down to 82.2%.
The contrasting approaches between China and Australia highlight different strategic responses to post-pandemic recovery, with China prioritising load factors whilst Australia emphasised capacity restoration.
Assuming you are a business analyst, use the given data sources within a data warehouse to deliver insights to stakeholders, including governments, airline companies, or passengers.
Data Sources
The dataset is from Kaggle.
Data dictionary
Passenger ID - Unique identifier for each passenger
First Name - First name of the passenger
Last Name - Last name of the passenger
Gender - Gender of the passenger
Age - Age of the passenger
Nationality - Nationality of the passenger
Airport Name - Name of the airport where the passenger boarded
Airport Country Code - Country code of the airport's location
Country Name - Name of the country the airport is located in
Airport Continent - Continent where the airport is situated
Continents - Continents involved in the flight route
Departure Date - Date when the flight departed
Arrival Airport - Destination airport of the flight
Pilot Name - Name of the pilot operating the flight
Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)
These four datasets are sourced from OurAirports. Visit the website for more details.
Special Requirements for CITS5504 and CITS3401
For CITS3401 students, you can use the Airline Dataset only. You are encouraged to use one or more of the additional provided datasets.
For CITS5504 students, you must use the Airline Dataset along with more than one of the additional provided datasets.
All students are encouraged to use external datasets associated with the Airline Dataset.
Data Warehousing Design and Implementation
Design a data warehouse for the dataset(s) by following the four steps below of dimensional modelling (i.e., Kimball's four steps).
Identify the process being modelled.
Determine the grain at which facts can be stored.
Choose the dimensions
Identify the numeric measures for the facts.
To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.
Think about a few business questions that your data warehouse could help answer.
Draw a StarNet to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level of information you have access to.
Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.
Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under Azure Education) or a drawing program/website of your own choice. A Hand-printed StarNet diagram is also welcome!
Implement a star or snowflake schema using PostgreSQL. For the fact table and dimension tables, clearly state which ones are measures and dimensions, and indicate the dimension references.
Use PostgreSQL to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design.
Use Power BI/Tableau to visualise the data returned from your business queries.
Association Rule Mining
NOTE: you are not allowed to use Microsoft Visual Studio, R, or Weka to do association rule mining in this project. Only Python code will be considered (for association rule mining) in this project.
Make sure you complete the relevant lab before attempting this task. The lab content may help you complete this task.
You are expected to select a subset (or all) of the attributes to mine interesting patterns. To rank the interestingness of the extracted rules, use support, confidence, and lift.
In the submitted PDF, you need to:
Explain and discuss which association rules mining algorithms were used with references.
Explain the top k rules (where k ≥ 1) that have "Flight Status" on the right-hand side only, ranked by lift and confidence.
Explain the meaning of these k rules in plain English.
Share insights derived from the mining results. If no meaningful rules are discovered, explore potential reasons for this outcome.
Based on the rules, provide and explain at least THREE (3) recommendations to the government on civil aviation industry development。
What to Submit
The following are the files needed for Project 1 submission.
NOTE:
In-text citations and end-text references are required in this project report, following the IEEE referencing style.
Include team members' information in the report, such as team members' names and student IDs.
A comprehensive PDF report that includes the following key components:
Design, implementation, and usage of the data warehouse: Explain and discuss the design, implementation, and usage of the data warehouse to answer your queries, such as the fact table, dimension tables, and concept hierarchies design. When designing your data warehouse, you may select any number of columns/attributes as dimensions and measures. Please justify your design by explaining your assumptions or reasoning behind your choices with references.
NOTE: The data warehouse design should demonstrate concept hierarchies and good dimension modelling. An overly simple design will result in low marks for the project.
Schema, Starnet and query footprints: Provide and explain your StarNet diagrams and the query footprints. Discuss which schema you used and justify your choice with references.
Data cleaning, preprocessing, and ETL process: Describe your ETL process in detail. Include descriptions of the techniques used, discuss your ETL principles, explain the reasoning behind the key steps, and provide screenshots illustrating the process flow with references.
NOTE: During the ETL process, you may remove some rows from the dataset. However, the number of dropped rows should not exceed 5% of the Airline dataset. Removing too many rows will result in a low mark for the ETL section.
Visualisation of query results: Present the findings from your business queries using appropriate charts, graphs, and other visualisations. Ensure the insights are clearly communicated and easily understandable to stakeholders.
Association rules mining: See the Association Rules Mining section above.
All scripts you used (e.g. Python, PostgreSQL) with clear and well-structured comments.
The Power BI/Tableau file (.pdf file).
All CSV files you used for building and populating your database.
NOTE: All files need to be zipped up in a single zip file and submitted to LMS (Under the Projects channel that can be found in the left navigation panel).
Marking Scheme
Schema of each dimension and concept hierarchies for each dimension
5
Corresponding StarNet to illustrate query capabilities of the data warehouse
5
At least 5 types of business queries that the StarNet can answer
5
Data warehouse schema and fact table design
5
Data cleaning/pre-processing/ETL process for data transformation (include code, screenshots, and detailed explanations)
10
SQL Script file for building the database and loading the datasets
5
Visualise corresponding to the 5 business queries with appropriate charts
5
Association rule mining meaningful set-up
5
Interpretation of top rules and suggestions
5
Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution
5
Overall report quality
5
Total
60
Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:
5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)
4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)
3 - Satisfactory (managed to meet most of the project requirements)
2 - Developing (some skills are demonstrated but need revision)
1 - Not yet Satisfactory (minimal effort)
0 - Not attempted.
Can I Use Generation AI?
Informed and educated use of generative AI tools is encouraged and should be declared. If you choose to use generative AI, please document its suggestions in your report and provide a clear rationale for accepting or rejecting each recommendation. While using generative AI is optional, as a data science professional, you are strongly encouraged to develop critical evaluation skills and build familiarity with modern tools that could enhance your productivity.
NOTE: Using generative AI without proper citation and documentation is unacceptable.
The following example is a BAD example that may risk receiving a 0 mark.

In this case, k>1 means the student should explain more than 1 top association mining rule. However, the student misunderstood the k value and the concept of lift. This is a common error when using GenAI, which sometimes generates non-existent theories - a phenomenon called 'AI hallucination'.
The explanations below are correct understandings.


Last updated