Project 1 - Data Warehouse Design

Aim to showcase this in your portfolio! (How to create a portfolioarrow-up-right)

circle-exclamation

The overall objectives of this project are to build a data warehouse using real-world datasets and to carry out a basic data mining activity, in this case, association rule mining.

Datasets and Problem Domain

According to IATA's analysis, the worldwide aviation industry experienced moderate growth in 2025. Passenger demand, measured by Revenue Passenger Kilometres, grew 5.3% compared to the previous year, slightly outpacing the 5.2% increase in airline capacity (Available Seat Kilometres). This imbalance drove the Passenger Load Factor to an unprecedented 83.6%, marking the highest annual utilisation rate ever recorded.

The growth trajectory showed divergence between market segments. International routes demonstrated robust expansion at 7.1% year-on-year, whilst domestic markets grew at a more modest 2.4% rate.

Regional Performance Highlights

Asia Pacific Region Carriers in the Asia Pacific region achieved the second-highest growth rate globally, with passenger traffic expanding 7.8% year-on-year.

Market-Specific Trends

Australia: The Australian domestic aviation market showed growth of 2.9% in passenger traffic (note: an earlier figure in the report cited 2.2% growth). However, airlines aggressively expanded capacity, resulting in a utilisation decline of 2.8 percentage points, bringing the load factor down to 82.2%.

The contrasting approaches between China and Australia highlight different strategic responses to post-pandemic recovery, with China prioritising load factors whilst Australia emphasised capacity restoration.

Assuming you are a business analyst, use the given data sources within a data warehouse to deliver insights to stakeholders, including governments, airline companies, or passengers.

Data Sources

Airline Dataset

The dataset is from Kagglearrow-up-right.

Data dictionary

  • Passenger ID - Unique identifier for each passenger

  • First Name - First name of the passenger

  • Last Name - Last name of the passenger

  • Gender - Gender of the passenger

  • Age - Age of the passenger

  • Nationality - Nationality of the passenger

  • Airport Name - Name of the airport where the passenger boarded

  • Airport Country Code - Country code of the airport's location

  • Country Name - Name of the country the airport is located in

  • Airport Continent - Continent where the airport is situated

  • Continents - Continents involved in the flight route

  • Departure Date - Date when the flight departed

  • Arrival Airport - Destination airport of the flight

  • Pilot Name - Name of the pilot operating the flight

  • Flight Status - Current status of the flight (e.g., on-time, delayed, canceled)

file-download
12MB
Airports
file-download
24KB
Counties
file-download
1MB
Navaids
file-download
468KB
Regions

These four datasets are sourced from OurAirportsarrow-up-right. Visit the website for more details.

Special Requirements for CITS5504 and CITS3401

  • For CITS3401 students, you can use the Airline Dataset only. You are encouraged to use one or more of the additional provided datasets.

  • For CITS5504 students, you must use the Airline Dataset along with more than one of the additional provided datasets.

  • All students are encouraged to use external datasets associated with the Airline Dataset.

Data Warehousing Design and Implementation

Design a data warehouse for the dataset(s) by following the four steps below of dimensional modelling (i.e., Kimball's four steps).

  1. Identify the process being modelled.

  2. Determine the grain at which facts can be stored.

  3. Choose the dimensions

  4. Identify the numeric measures for the facts.

To realise the four steps, we can start by drawing and refining a StarNet with the above four questions in mind.

  1. Think about a few business questions that your data warehouse could help answer.

  2. Draw a StarNet to identify the dimensions and concept hierarchies for each dimension. This should be based on the lowest level of information you have access to.

  3. Use the StarNet footprints to illustrate how the business queries can be answered with your design. Refine the StarNet if the desired queries cannot be answered, for example, by adding more dimensions or concept hierarchies.

  4. Once the StarNet diagram is completed, draw it using software such as Microsoft Visio (free to download under Azure Educationarrow-up-right) or a drawing program/websitearrow-up-right of your own choice. A Hand-printed StarNet diagram is also welcome!

  5. Implement a star or snowflake schema using PostgreSQL. For the fact table and dimension tables, clearly state which ones are measures and dimensions, and indicate the dimension references.

  6. Use PostgreSQL to build a multi-dimensional analysis service solution, with a cube designed to answer your business queries. Make sure the concept hierarchies match your StarNet design.

  7. Use Power BI/Tableau to visualise the data returned from your business queries.

Association Rule Mining

circle-exclamation

Make sure you complete the relevant lab before attempting this task. The lab content may help you complete this task.

You are expected to select a subset (or all) of the attributes to mine interesting patterns. To rank the interestingness of the extracted rules, use support, confidence, and lift.

In the submitted PDF, you need to:

  • Explain and discuss which association rules mining algorithms were used with references.

  • Explain the top k rules (where k ≥ 1) that have "Flight Status" on the right-hand side only, ranked by lift and confidence.

  • Explain the meaning of these k rules in plain English.

  • Share insights derived from the mining results. If no meaningful rules are discovered, explore potential reasons for this outcome.

  • Based on the rules, provide and explain at least THREE (3) recommendations to the government on civil aviation industry development。

What to Submit

The following are the files needed for Project 1 submission.

triangle-exclamation
  1. A comprehensive PDF report that includes the following key components:

    • Design, implementation, and usage of the data warehouse: Explain and discuss the design, implementation, and usage of the data warehouse to answer your queries, such as the fact table, dimension tables, and concept hierarchies design. When designing your data warehouse, you may select any number of columns/attributes as dimensions and measures. Please justify your design by explaining your assumptions or reasoning behind your choices with references.

    NOTE: The data warehouse design should demonstrate concept hierarchies and good dimension modelling. An overly simple design will result in low marks for the project.

    • Schema, Starnet and query footprints: Provide and explain your StarNet diagrams and the query footprints. Discuss which schema you used and justify your choice with references.

    • Data cleaning, preprocessing, and ETL process: Describe your ETL process in detail. Include descriptions of the techniques used, discuss your ETL principles, explain the reasoning behind the key steps, and provide screenshots illustrating the process flow with references.

    NOTE: During the ETL process, you may remove some rows from the dataset. However, the number of dropped rows should not exceed 5% of the Airline dataset. Removing too many rows will result in a low mark for the ETL section.

    • Visualisation of query results: Present the findings from your business queries using appropriate charts, graphs, and other visualisations. Ensure the insights are clearly communicated and easily understandable to stakeholders.

    • Association rules mining: See the Association Rules Mining section above.

  2. All scripts you used (e.g. Python, PostgreSQL) with clear and well-structured comments.

  3. The Power BI/Tableau file (.pdf file).

  4. All CSV files you used for building and populating your database.

circle-exclamation

Marking Scheme

Scheme
Marks

Schema of each dimension and concept hierarchies for each dimension

5

Corresponding StarNet to illustrate query capabilities of the data warehouse

5

At least 5 types of business queries that the StarNet can answer

5

Data warehouse schema and fact table design

5

Data cleaning/pre-processing/ETL process for data transformation (include code, screenshots, and detailed explanations)

10

SQL Script file for building the database and loading the datasets

5

Visualise corresponding to the 5 business queries with appropriate charts

5

Association rule mining meaningful set-up

5

Interpretation of top rules and suggestions

5

Coherence between the design and implementation, quality and complexity of the solution, reproducibility of the solution

5

Overall report quality

5

Total

60

Data warehousing exercises are often open-ended. In other words, there is almost always a better solution. You can interpret the scale of marks as:

5 - Exemplary (comprehensive solution demonstrating professional application of the knowledge taught in the class with initiative beyond just meeting the project requirement)

4 - Proficient (correct application of the taught concepts with clear understandings demonstrated)

3 - Satisfactory (managed to meet most of the project requirements)

2 - Developing (some skills are demonstrated but need revision)

1 - Not yet Satisfactory (minimal effort)

0 - Not attempted.

Can I Use Generation AI?

Informed and educated use of generative AI tools is encouraged and should be declared. If you choose to use generative AI, please document its suggestions in your report and provide a clear rationale for accepting or rejecting each recommendation. While using generative AI is optional, as a data science professional, you are strongly encouraged to develop critical evaluation skills and build familiarity with modern tools that could enhance your productivity.

triangle-exclamation

The following example is a BAD example that may risk receiving a 0 mark.

Hallucinated wrong answer generated from LLM

In this case, k>1 means the student should explain more than 1 top association mining rule. However, the student misunderstood the k value and the concept of lift. This is a common error when using GenAI, which sometimes generates non-existent theories - a phenomenon called 'AI hallucinationarrow-up-right'.

The explanations below are correct understandings.

Correct Explanation
Correct Explanation

Last updated