Week 1 - Introduction (Self-Reading)
Last updated
Last updated
There is no scheduled lab for CITS3401/5505 in Week 1. This page is designed to introduce a few basic concepts and tools of our lab and data warehouse to you, such as Docker, PostgreSQL, Tableau, and Power BI. We will explain what they are and what they are used for from a high level, so you can have a big picture of what we will do in the coming semester. Please read this page carefully before the first lab.
The following image illustrates a basic data warehouse project. While real-world applications may vary, they generally adhere to this framework. Our demo project and Project 1 will also fit into this framework.
Understanding concepts is often best achieved by delving into their history.
IBM researchers Barry Devlin and Paul Murphy developed the "business data warehouse" concept in late 1980. The intention is to provide an architectural model for the flow of data from operational systems to support business decision-making.
At the beginning, compute and storage resources were costly, leading to relatively simple data warehouse project architectures, primarily focused on a company's sales data. However, with the advancement of cloud computing, the costs for cloud computing and storage have significantly decreased. This reduction in cost has enabled the collection, storage, and analysis of a much larger volume of data, offering deeper and more comprehensive insights.
Technology is evolving rapidly, but the fundamental component concepts remain constant:
Data
Data sources
Data storage
Data management
Data efficiency
ETL flow
Extract
Transform
Load
Report
Visualization
e.g. Power BI report
Data science enabled possible future
What-if Analysis
Machine Learning
Artificial Intelligence
More
After the ETL (Extract, Transform, Load) process, the data is loaded into the data warehouse engine, where it becomes structured and organised for efficient querying and analysis. There are different ways to implement the data warehouse engine, we will use PostgreSQL to illustrate the core concept of data warehouse.
The evolution of the data warehouse comes from three perspectives:
Data is increasingly originating from a variety of sources and continues to grow in volume. For instance, real-time data is now being received from multiple sources in various formats. This trend results in increasingly complex ETL (Extract, Transform, Load) processes.
Furthermore, with OLAP (Online Analytical Processing) systems now handling data ranging from megabytes to terabytes, efficiently querying and analyzing this data presents a new set of challenges for data warehouse engines.
Additionally, there's a growing demand to extract deeper insights from data utilizing the latest advancements in Machine Learning (ML) and Artificial Intelligence (AI) technologies.
These key concepts above will not be assessed in the final exam. However, it will be critical and foundational requirements for your career, especially if you choose data engineer as your career direction. So hopefully our labs and projects can help you gain practical insight for your future career.
The tools we choose to use in labs and projects may include:
Docker: Orchestrate the tools locally
PostgreSQL: data storage and management [Data] and deployment of the cube [Data warehouse engine]
Python (or other programming language) or Excel: ETL process [ETL]
PowerBI or Tableau: Report presentation [Report]
Containerisation is a type of virtualisation in which all the components of an application are bundled into a single container image and can be run in isolated user space on the same shared operating system. Containers are lightweight, portable, and highly conducive to automation.
It solves the problem: It works and only works on my machine. If the application has been containerized, it should be able to work on any machine.
Docker is a leading implementation of containerization technology. It is a popular platform used for developing, shipping, and running applications. It allows you to separate your applications from your infrastructure, enabling rapid software delivery. Some key components for the docker include:
Docker Image: A Docker image is a lightweight, standalone, executable package that includes everything needed to run a piece of software, including the code, runtime, libraries, environment variables, and config files.
Docker Container: A container is a runnable instance of an image. You can think of it as the execution of an image, the runtime environment. Containers isolate software from its surroundings, for example, differences between development and staging environments.
Docker Compose: Docker Compose is a tool for defining and running multi-container Docker applications. With Compose, you use a YAML file to configure your application’s services, networks, and volumes. Then, with a single command, you create and start all the services from your configuration.
The other important concepts regarding docker include: Volumes and Networks, volume is used to persist the storage of the containers, and network is to manage the communication between multiple containers.
Until February 2025, relational DBMS will occupy its position for a long time, especially in critical systems.
PostgreSQL is the DBMS of the Year 2023.
Data formats are diverse and from multiple sources, such as relational DB systems, flat files, or .csv files, etc. The main purpose of the ETL is to transform the data from transactional data sources to analytic data sources. In our lab, we choose to use the most popular open-source relational database: PostgreSQL as our tool to store transactional and analytical data.
Data visualization is the representation of data through the use of common graphics, such as charts, plots, infographics, and even animations. It will help to gain insight from data and support business decision-making. There are quite a lot of tools on the market, here we will pick three for you to choose:
PowerBI
Tableau Desktop
Unfortunately, there is no native version of Power BI for Mac, though it is under consideration. So, Power BI is a powerful BI tool, but it sadly doesn't support Mac devices.
You can install a local virtual machine such as Parallels Desktop and UTM on your Mac, load Windows 10 ISO, install Windows, and load Power BI Desktop on this local virtual machine.
PowerBI and Tableau are both quite popular in industry applications.
ALL questions and answers are generated based on this page, including links.
10 questions (MCQs, True/False questions, and a matching question)
Each question is worth 1 mark. The quiz contributes 2% to the final grade.
More details:
This test can be taken up to THREE times. The highest mark will be used to calculate the final grade.
This test can be saved and resumed later.
Remember to save your answers.
Due date/time: 28 March, Friday, at 11:59 PM.
This unit will use docker as the tool to set up your local development environment (you also can choose the local server solution or the shared server solution). We will assume you have some basic knowledge regarding this, if not you can check the documentation . The docker-related knowledge will not be assessed in final exams. The usage of docker within this unit is basic, and we will give you workable examples, so you do not need to be worried about this. We do encourage you to have a deep understanding of docker, as it will help your future career. Currently, in industry, 90% of applications are containerised or dockerised.
is an advanced, enterprise-class open-source relational database that supports both SQL (relational) and JSON (non-relational) querying. It is a highly stable database management system, backed by more than 20 years of community development which has contributed to its high levels of resilience, integrity, and correctness. PostgreSQL is used as the primary data store or data warehouse for many web, mobile, geospatial, and analytics applications. It is widely used for general-purpose OLTP databases. Redshift and BigQuery, which are the cloud data warehouse engines provided by AWS and Google respectively, are also built upon PostgreSQL.
Stephan Schmidt points out that many backend technologies can be replaced by Postgres. Thus, the development will become easier, more scaleable and operateable. For more points, please read .
is a business analytics solution that lets you visualise data and share insights throughout your organisation. You can also embed them in your app or website. Connect to hundreds of data sources and bring your data to life with live dashboards and reports. This is a great tool for businesses that allows users to analyze more data in more ways than ever before. Regardless of your business, if it is data-intensive, Power BI is the tool you need to make sense of the data.
is a visual analytics platform transforming the way we use data to solve problems - empowering people and organisations to make the most of their data.
can run on Windows and Mac OS, and it also can work with PostgreSQL, but it is not free. You can use without any cost this semester.
This is an individual online open-book quiz (on ).