# Week 6 - Association Rules Mining

{% hint style="info" %} <mark style="color:red;">**The deadline for the Week 5 lab assessment is Thursday, 2nd April. You are still welcome to present your Week 5 lab results to**</mark><mark style="color:red;">**&#x20;**</mark>*<mark style="color:red;">**ANY**</mark>*<mark style="color:red;">**&#x20;**</mark><mark style="color:red;">**lab facilitator at**</mark><mark style="color:red;">**&#x20;**</mark>*<mark style="color:red;">**ANY**</mark>*<mark style="color:red;">**&#x20;**</mark><mark style="color:red;">**lab session in Week 6 before the deadline.**</mark>
{% endhint %}

## Lab Objectives

* Find frequently occurring itemsets using the Apriori algorithm.
* Compute the support of the frequent itemset.
* Compute the confidence and lift of an association rule.

## Technologies Covered

* Python `mlxtend`
* Python `pandas`
* Python `numpy`
* Jupyter Notebook

## Algothrims for Association Rules Mining

**FP-Growth and ECLAT algorithms are not assessed in the final exam.**

{% tabs %}
{% tab title="Apriori algorithm" %}
The Apriori algorithm works by first identifying the frequent itemsets in a dataset. Any subset of a frequent itemset must also be frequent. Terminate when no frequent or candidate set can be generated. The Apriori algorithm uses a bottom-up approach, starting with individual items and gradually building up to more complex itemsets (Ali, 2023).
{% endtab %}

{% tab title="FP-Growth algorithm" %}
"FP-Growth works in a divide and conquer way. It requires 2 scans on a database. FP-Growth first computes a list of frequent items sorted by frequency in descending order during its first database scan. In its second scan, the database is compressed into an FP-tree. Then FP-Growth starts to mine the FP-tree for each item whose support is larger than $$\xi$$ by recursively building its conditional FP-tree." (Li et al., 2008)
{% endtab %}

{% tab title="ECLAT algorithm" %}
"Apriori and FP-Growth algorithms are based on horizontally-formatted datasets, but the ECLAT algorithm is based on vertically-formatted datasets. ECLAT algorithm generates candidate frequent k-item sets by the intersection operation of the previous frequent k-1 item sets, then prune to generate candidate frequent item sets. After converting to a vertical data structure, the Eclat algorithm only needs to scan the dataset once." (Wang et al., 2023)
{% endtab %}
{% endtabs %}

### Comparison of These Three Algorithms

**This part will not be accessed in the final exam.**

<table data-full-width="true"><thead><tr><th>Algorithm</th><th>Memory Usage</th><th>Runtime Efficiency*</th><th>Implementation Complexity</th></tr></thead><tbody><tr><td><strong>Apriori</strong></td><td>Intensive</td><td>Can be slow, especially for large datasets.</td><td>Easy to implementation</td></tr><tr><td><strong>FP-Growth</strong></td><td>Less memory compared to Apriori</td><td>Generally faster than Apriori.</td><td>More complex to implement.</td></tr><tr><td><strong>ECLAT</strong></td><td>Memory usage is significantly lower than Apriori.</td><td>Runtime efficiency is usually between Apriori.</td><td>Implementation complexity is moderate.</td></tr></tbody></table>

\*: Some research studies suggest that ECLAT is generally faster than FP-Growth, and the performance can vary depending on the characteristics of the datasets being analysed.

**In this lab, we will use the&#x20;*****Apriori algorithm*****&#x20;to mine rules.**

## Key Measures for Association Rule Mining

{% tabs %}
{% tab title="Support" %}
The probability that a transaction contains 𝑋 ∪ 𝑌.

$$Support (X  \Rightarrow Y)=P(X \cup Y)$$

<figure><img src="/files/H5lFmowmEDFNH6qW3iT5" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Confidence" %}
The conditional probability that a transaction having 𝑋 also contains 𝑌.

$$Confidence(X \implies Y) = P(Y|X) =\frac{Support(X \bigcup Y)}{Support(X)}$$

<figure><img src="/files/RowOzWoMQWoOE7yvCXaU" alt=""><figcaption></figcaption></figure>
{% endtab %}

{% tab title="Lift" %}
Assesses the degree to which the occurrence of one “lifts” the occurrence of the other.

$$Lift  (X \implies Y) = \frac{P(X \bigcup Y)}{P(X) \times P(Y)}$$

<figure><img src="/files/plqaACdex73eoWdyOF9i" alt=""><figcaption></figcaption></figure>
{% endtab %}
{% endtabs %}

## Association Rules Mining with Python

There are various software tools that can be used to perform association rules mining, such as Python, R, Weka, or Microsoft SQL Server Analysis Services (MS SSAS). In this lab, we use Jupyter Notebook and Python libraries to mine rules.

### Download of the Dataset

Download the dataset for this lab. Note: download the dataset to your working directory. In this dataset, the column, "pep" indicates whether the customer purchased a Personal Equity Plan after the most recent promotional campaign.

{% file src="/files/3sasEV1p13czEUt40N1Q" %}
The dataset for this lab
{% endfile %}

### Code Execution

For this lab, you have two options to execute the code: a local solution or a Docker solution. You can choose **either** option.

* **Local solution**
  * Run Jupyter Notebook locally

```python
#Check your current working directory
import os
os.getcwd()

#If you need to change your current working directory, you can
os.chdir('working directory path')
```

* **Docker Solution**
  * If you run this lab inside Docker, please **save the dataset in the "notebooks" directory within our GitHub repository.**

### Data Pre-process

* If you haven't installed `mlxtend`, you need to install `mlxtend` first.

```python
#Install mlxtend library.
pip install mlxtend
```

* Import necessary libraries

```python
#Import necessary libraries, pandas, numpy and mlxtend
import pandas as pd
import numpy as np
import mlxtend
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
```

* Import the dataset to Jupyter Notebook

```python
#Read the csv file
df = pd.read_csv('bank-data.csv')

#print the df
print(df)
```

* Descriptive statistics

```python
#Show the data head
df.head()

#Check the data types in the dataframe
df.dtypes

#The columns of the dataframe
df.columns

#Check the number of rows and columns
df.shape
```

* Check missing values in the dataframe

```python
#Check missing values in the dataframe
df.isna()

#Count total missing values at each column in the dataframe
df.isna().sum()
```

<mark style="color:blue;">**We can see, that there are no missing values in the data frame.**</mark>

* The column, "id", is not useful for association rules mining, therefore, the column needs to be dropped.

```python
#Drop the column, 'id'.
new_df = df.drop('id',axis = 'columns')
```

* Data transformation

```python
#Data transformation
#TransactionEncoder() function only can handle string type
new_df = new_df.astype(str)

#TransactionEncoder() was designed to covert lists to array
list = new_df.values.tolist()

#Covert the list to one-hot encoded boolean numpy array. 
#Apriori function allows boolean data type only, such as 1 and 0, or FALSE and TRUE.
te = TransactionEncoder()
array_te = te.fit(list).transform(list)

#Check the array
array_te

#Check the colunms
te.columns_

#Apriori function can handle dataframe only, covert the array to a dataframe
arm_df = pd.DataFrame(array_te, columns = te.columns_)
```

* **Is the provided data transformation solution good? If not, could you propose a better solution for data transformation?**
* **How are numerical attributes handled in your Project 1 for association rule mining? Could you justify your solution?**

### Association Rules Mining with `mlxtend` Library

<mark style="color:orange;">**Note: All thresholds for confidence, lift, and support were not discussed in this part. This section serves as an example to help you become familiar with mining association rules using the**</mark><mark style="color:orange;">**&#x20;**</mark><mark style="color:orange;">**`mlxtend`**</mark><mark style="color:orange;">**&#x20;**</mark><mark style="color:orange;">**library. The thresholds mentioned here are not intended as references for your Project 1. In Project 1, you may need to discuss the selection of appropriate thresholds.**</mark>

*We don't explain rules/results/outputs in the lab sheet, as explanating rules is a task in Project 1. Please research independently to understand how to explain rules in plain English.*

* `min_support`: a minimum support threshold, used to filter out itemset that don't occur frequently enough.

```python
#Find the frequent itemsets
frequent_itemsets = apriori(arm_df,min_support=0.2,use_colnames =True)

#Check the length of rules
frequent_itemsets['length']=frequent_itemsets['itemsets'].apply(lambda x: len(x))

#Assume the length is 2 and the min support is >= 0.3
frequent_itemsets[ (frequent_itemsets['length']==2) & 
                  (frequent_itemsets['support']>=0.3)]
```

* Use confidence to filter out association rules that are not strong enough.

```python
#Assume the min confidence is 0.5
rules_con = association_rules(frequent_itemsets, metric="confidence",min_threshold=0.5)
```

* Use lift to filter out association rules

```python
#Assume the min lift is 1
rules_lift = association_rules(frequent_itemsets, metric="lift",min_threshold=1)

#Based on min confidence (=0.5), 
#output antecedents, consequents, support, confidence and lift.
result_arm = rules_con[['antecedents','consequents','support','confidence','lift']]
```

* Filter the rules whose confidence ≥ 0.7

```python
#Find the rules whose confidence >= 0.7
new_result_arm = result_arm[result_arm['confidence']>=0.7]
```

## \[Optional] Association Rules Mining without Python Libraries

<img src="/files/c5NoWmUnxCkI2s2Xc7ze" alt="" data-size="line">You can also mine association rules without Python libraries, except `pandas` and `numpy`.

<img src="/files/5zLT2wtZSY1FduWApJqF" alt="" data-size="line">Hints:

* Recall apriori algorithm
  1. Scan the database once to get frequent 1-itemset; k=1
  2. Repeat
     1. Generate length (k+1) candidate itemsets from length k frequent itemsets
     2. Test the candidates against the database to find frequent (k+1) itemsets
     3. Set k=k+1
  3. Terminate when no frequent or candidate set can be generated
  4. Return all the frequent itemsets
* Pseudocode

<figure><img src="/files/5N70AGM0eyAQ5gCLLKjG" alt=""><figcaption><p>From: <em>Pesudo code for Apriori algorithm,</em> by Jumaah, AI-Janabi, and Ali, 2014, Flickr <a href="https://www.iasj.net/iasj/download/ea6e857c5fed9a37">https://www.iasj.net/iasj/download/ea6e857c5fed9a37</a></p></figcaption></figure>

* You may need to use `for` and `while` loops in your code.
* You need to write your code to compute confidence, support and lift for your rules.

## References

Ali, M. (2023, January). *Association Rule Mining in Python Tutorial.* datacamp. <https://www.datacamp.com/tutorial/association-rule-mining-python>

Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. Y. (2008, October). Pfp: parallel fp-growth for query recommendation. In *Proceedings of the 2008 ACM conference on Recommender systems*, 107-114.

Jumaah, A. K., Al-Janabi, S., & Ali, N. A. (2014). Hiding Sensitive Association Rules Over Privacy-Preserving Distributed Data Mining. *Kirkuk University Journal-Scientific Studies*, *9*(1), 59-72.

Wang, L., Guo, Y., Guo, Y., Xia, X., Zhang, Z., & Cao, J. (2023). An Improved Eclat Algorithm based Association Rules Mining Method for Failure Status Information and Remanufacturing Machining Schemes of Retired Products. *Procedia CIRP*, *118*, 572-577.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://csse-uwa.gitbook.io/data-warehousing-lab-sheets/week-6-association-rules-mining.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
