Week 6 - Association Rules Mining
Lab Objectives
Find frequently occurring itemsets using the Apriori algorithm.
Compute the support of the frequent itemset.
Compute the confidence and lift of an association rule.
Technologies Covered
Python
mlxtend
Python
pandas
Python
numpy
Jupyter Notebook
Algothrims for Association Rules Mining
FP-Growth and ECLAT algorithms are not assessed in the final exam.
The Apriori algorithm works by first identifying the frequent itemsets in a dataset. Any subset of a frequent itemset must also be frequent. Terminate when no frequent or candidate set can be generated. The Apriori algorithm uses a bottom-up approach, starting with individual items and gradually building up to more complex itemsets (Ali, 2023).
Comparison of These Three Algorithms
This part will not be accessed in the final exam.
Apriori
Intensive
Can be slow, especially for large datasets.
Easy to implementation
FP-Growth
Less memory compared to Apriori
Generally faster than Apriori.
More complex to implement.
ECLAT
Memory usage is significantly lower than Apriori.
Runtime efficiency is usually between Apriori.
Implementation complexity is moderate.
*: Some research studies suggest that ECLAT is generally faster than FP-Growth, and the performance can vary depending on the characteristics of the datasets being analysed.
In this lab, we will use the Apriori algorithm to mine rules.
Key Measures for Association Rule Mining
The probability that a transaction contains 𝑋 ∪ 𝑌.
Association Rules Mining with Python
There are various software tools that can be used to perform association rules mining, such as Python, R, Weka, or Microsoft SQL Server Analysis Services (MS SSAS). In this lab, we use Jupyter Notebook and Python libraries to mine rules.
Download of the Dataset
Download the dataset for this lab. Note: download the dataset to your working directory. In this dataset, the column, "pep" indicates whether the customer purchased a Personal Equity Plan after the most recent promotional campaign.
Code Execution
For this lab, you have two options to execute the code: a local solution or a Docker solution. You can choose either option.
Local solution
Run Jupyter Notebook locally
Docker Solution
If you run this lab inside Docker, please save the dataset in the "notebooks" directory within our GitHub repository.
Data Pre-process
If you haven't installed
mlxtend
, you need to installmlxtend
first.
Import necessary libraries
Import the dataset to Jupyter Notebook
Descriptive statistics
Check missing values in the dataframe
We can see, that there are no missing values in the data frame.
The column, "id", is not useful for association rules mining, therefore, the column needs to be dropped.
Data transformation
Is the provided data transformation solution good? If not, could you propose a better solution for data transformation?
How are numerical attributes handled in your Project 1 for association rule mining? Could you justify your solution?
Association Rules Mining with mlxtend
Library
mlxtend
LibraryNote: All thresholds for confidence, lift, and support were not discussed in this part. This section serves as an example to help you become familiar with mining association rules using the mlxtend
library. The thresholds mentioned here are not intended as references for your Project 1. In Project 1, you may need to discuss the selection of appropriate thresholds.
We don't explain rules/results/outputs in the lab sheet, as explanating rules is a task in Project 1. Please research independently to understand how to explain rules in plain English.
min_support
: a minimum support threshold, used to filter out itemset that don't occur frequently enough.
Use confidence to filter out association rules that are not strong enough.
Use lift to filter out association rules
Filter the rules whose confidence ≥ 0.7
[Optional] Association Rules Mining without Python Libraries
Recall apriori algorithm
Scan the database once to get frequent 1-itemset; k=1
Repeat
Generate length (k+1) candidate itemsets from length k frequent itemsets
Test the candidates against the database to find frequent (k+1) itemsets
Set k=k+1
Terminate when no frequent or candidate set can be generated
Return all the frequent itemsets
Pseudocode
You may need to use
for
andwhile
loops in your code.You need to write your code to compute confidence, support and lift for your rules.
References
Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. Y. (2008, October). Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems, 107-114.
Jumaah, A. K., Al-Janabi, S., & Ali, N. A. (2014). Hiding Sensitive Association Rules Over Privacy-Preserving Distributed Data Mining. Kirkuk University Journal-Scientific Studies, 9(1), 59-72.
Wang, L., Guo, Y., Guo, Y., Xia, X., Zhang, Z., & Cao, J. (2023). An Improved Eclat Algorithm based Association Rules Mining Method for Failure Status Information and Remanufacturing Machining Schemes of Retired Products. Procedia CIRP, 118, 572-577.
Last updated