-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsongjiang_liu_midtermproj.py
302 lines (228 loc) · 12.5 KB
/
songjiang_liu_midtermproj.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
#!/usr/bin/env python
# coding: utf-8
# # FA24-CS634101 Data Mining
# ## Midterm Project Report
# Professor: Yasser Abduallah
# Student: Songjiang Liu
# UCID: sl947
# GitHub Repository: https://github.com/youjustlook/CS634_songjiang_liu_midtermproj
# ### Introduction
# This report shows the step by step implementation of algorithms used for generating request items sets and association rules for transactions from 5 databases. The databases are modified from given sample records or generated by ChatGPT as demonstrative data. These algorithms are useful for generating shopping recommendations, finding concurrent patterns, etc. for many industry applications.
#
# #### About Support and Confidence
# Support is the proportion of transactions in a dataset that contain a particular itemset, calculated as as the ratio of the number of transactions containing the itemset to the total number of transactions:
# **Support**:
# \\[
# \text{Support}(A) = \frac{\text{Transactions containing } A}{\text{Total transactions}}
# \\]
# Confidence is Confidence is the likelihood that an item appears in a transaction given that another item or itemset is already present, calculated as the ratio of the number of transactions containing both the antecedent and the consequent to the number of transactions containing only the antecedent:
# **Confidence**:
# \\[
# \text{Confidence}(A \rightarrow B) = \frac{\text{Transactions containing both } A \text{ and } B}{\text{Transactions containing } A}
# \\]
#
# #### About Algorithms Implemented/Used
# ##### self-implemented brute force algorithm
# The first algorithm in this project is a self-implemented brute force algorithm to find frequent item set and generate association rules from transactions at given support and confidence with enumeration for all possible combinations from item set of 1 to k items in the set.
# ##### built-in Apriori algorithm
# The second algorithm is the built-in Apriori algorithm in apriori_python package. It does the same thing as the self-implemented brute force algorithm but is optimized for efficiency too.
# ##### built-in FP-Growth algorithm
# The third algorithm is the built-in FP-Growth algorithm in mlxtend package. It first constructs an FP-tree to represent the transaction database, and then recursively mining the tree to extract frequent itemsets by identifying conditional patterns and combining them to generate larger frequent itemsets without the need for multiple database scans.
#
# ### Code Implementation
# It takes selection of the databases, self-determined support level and confidence level, and finds the frequent items sets and association rules under the aforementioned requirement. In the end it present the execution time of three algorithms: student-built brute force algorithm, built-in Apriori algorithm, and built-in FP-Growth algorithm
#
#
# Run the following code (delete # in the front first for the codes to be effective) to install necessary package if not found in environment.
# In[1]:
# !pip install pandas
# !pip install numpy
# !pip install apriori_python
# !pip install mlxtend
# #### Step 1: Import and Load Dataset
# In[2]:
# import libraries
import pandas as pd
import numpy as np
from apriori_python.apriori import apriori
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder
import itertools
import time
print("Hello, welcome to the Apriori Algorithm. Version 1.0")
def load_database(choice):
databases = {
1: "Amazon.csv",
2: "BestBuy.csv",
3: "HomeDepot.csv",
4: "K-mart.csv",
5: "Nike.csv"
}
try:
return pd.read_csv(databases[choice])
except KeyError:
print("Invalid choice. Please enter a number between 1 and 5.")
return None
except FileNotFoundError:
print(f"File {databases[choice]} not found.")
return None
def get_item_set(transactions_data):
item_set = set()
for transaction in transactions_data.iloc[:, 1]:
items = transaction.split(', ')
item_set.update(items)
return item_set
# Load the database at choice
databases = ["Amazon", "BestBuy", "HomeDepot", "K-mart", "Nike"]
while True:
print("Please select a database to load:")
for i, db in enumerate(databases, 1):
print(f"{i}. {db}")
try:
choice = int(input("Enter the number of your choice: "))
if choice < 1 or choice > len(databases):
raise ValueError("Invalid input. Please enter a number between 1 and 5.")
transactions_data = load_database(choice)
if transactions_data is None:
continue # if loading the database fails, ask for input again
transactions_data_each = transactions_data.iloc[:, 1]
print(f"Loaded {databases[choice-1]} database successfully.")
# Display number of transactions
num_transactions = len(transactions_data)
print(f"Number of transactions: {num_transactions}")
# Display item set information
item_set = get_item_set(transactions_data)
num_items = len(item_set)
print(f"Number of items in the item set: {num_items}")
print("Items in the item set:")
for item in item_set:
print(item)
break # exit the loop if everything is successful
except ValueError as e:
print(e)
except Exception as e:
print(f"An unexpected error occurred: {e}")
# #### Step 2: User Input for Support and Confidence
# In[3]:
# Get user input for minimum support and minimum confidence
while True:
try:
min_support = float(input("Enter the minimum support level in % (larger than 0 and lower than 100%): "))
if 0 < min_support < 100:
break
else:
print("Invalid input. Please enter a value between 0 and 100.")
except ValueError:
print("Invalid input. Please enter a numeric value.")
while True:
try:
min_confidence = float(input("Enter the minimum confidence level in % (larger than 0 and lower than 100%): "))
if 0 < min_confidence < 100:
break
else:
print("Invalid input. Please enter a value between 0 and 100.")
except ValueError:
print("Invalid input. Please enter a numeric value.")
# #### Step 3: Brute Force Method
# In[4]:
# Brute force method
def generate_candidates(item_set, length):
return [set(item) for item in itertools.combinations(item_set, length)]
def is_frequent(candidate, transactions, min_support_count):
count = sum(1 for transaction in transactions if candidate.issubset(transaction))
return count >= min_support_count
def count_frequent(candidate, transactions):
count = sum(1 for transaction in transactions if candidate.issubset(transaction))
return count
def brute_force_apriori(transactions, item_set, min_support):
min_support_count = len(transactions) * min_support
frequent_itemsets = []
k = 1
current_itemsets = [set([item]) for item in item_set]
while current_itemsets:
next_itemsets = []
for itemset in current_itemsets:
if is_frequent(itemset, transactions, min_support_count):
frequent_itemsets.append(itemset)
k += 1
current_itemsets = generate_candidates(set(itertools.chain.from_iterable(frequent_itemsets)), k)
return frequent_itemsets
def find_association_rules(frequent_collections, data_samples, threshold_confidence):
association_rules = []
for collection in frequent_collections:
for num_elements in range(1, len(collection)):
for precursor in itertools.combinations(collection, num_elements):
precursor = set(precursor)
outcome = collection - precursor
if outcome:
support = sum(1 for sample in data_samples if collection.issubset(sample)) / len(data_samples)
confidence = support / (sum(1 for sample in data_samples if precursor.issubset(sample)) / len(data_samples))
if confidence >= threshold_confidence:
association_rules.append((precursor, outcome, support, confidence))
return association_rules
# Running the brute force method
transactions_data_processed = []
order = sorted(item_set)
for lines in transactions_data_each:
trans = list(lines.strip().split(', '))
trans_l = list(np.unique(trans))
trans_l.sort(key=lambda x: order.index(x))
transactions_data_processed.append(sorted(trans_l))
start_time_brute_force_freq_itemset = time.time_ns()
frequent_itemsets = brute_force_apriori(transactions_data_processed, item_set, min_support / 100)
brute_force_time_freq_itemset = time.time_ns() - start_time_brute_force_freq_itemset
print("--------------------\nFrequent item sets using brute force method: item | support")
for itemset in frequent_itemsets:
print(str(itemset) + " | " + str(count_frequent(itemset, transactions_data_processed)/len(transactions_data_processed)))
print("--------------------\nAssociation rules using brute force method:")
start_time_self_association_rules = time.time_ns()
rules = find_association_rules(frequent_itemsets, transactions_data_processed, min_confidence / 100)
find_rules_time = time.time_ns() - start_time_self_association_rules
for i, rule in enumerate(rules):
antecedent, consequent, support, confidence = rule
print(f"Rule {i+1}: {antecedent} -> {consequent} (support: {support}, confidence: {confidence})")
# #### Step 4: Built-in Apriori Algorithm
# In[5]:
# Check with built-in Apriori algorithm
print("--------------------\nCheck with Built-in Apriori Algorithm:")
start_apriori_time_freq_itemset_and_rules = time.time_ns()
freq_item_set, rules = apriori(transactions_data_processed, min_support/100, min_confidence/100)
apriori_time_freq_itemset_and_rules = time.time_ns() - start_apriori_time_freq_itemset_and_rules
print("--------------------\nFrequent item sets with Built-in Apriori algorithm:")
print(freq_item_set)
print("--------------------\nAssociation rules with Built-in Apriori algorithm:")
for i, rule in enumerate(rules):
print(f"Rule {i+1}: {rule}")
# The self-built brute force algorithm generates same frequent item sets andd association rules as the built-in Apriori algotithm.
# #### Step 5: FP-Growth Algorithm
# In[6]:
# Check with fp-growth algorithm
te = TransactionEncoder()
te_ary = te.fit(transactions_data_processed).transform(transactions_data_processed)
df_encoded = pd.DataFrame(te_ary, columns=te.columns_)
start_time_fpgrowth = time.time_ns()
freq_item_set_fp = fpgrowth(df_encoded, min_support/100, use_colnames=True)
fpgrowth_time = time.time_ns() - start_time_fpgrowth
print("--------------------\nCheck with Built-in FP-Growth algorithm:")
print(freq_item_set_fp)
start_time_association_rules = time.time_ns()
rules_fp = association_rules(freq_item_set_fp, metric="confidence", min_threshold=min_confidence / 100)
association_rules_time = time.time_ns() - start_time_association_rules
print("--------------------\nAssociation rules with FP-Growth algorithm: rule, support, confidence")
for idx, row in rules_fp.iterrows():
antecedent = set(row['antecedents'])
consequent = set(row['consequents'])
support = row['support']
confidence = row['confidence']
print(f"Rule {idx + 1}: {antecedent} -> {consequent} (support: {support:.2f}, confidence: {confidence:.2f})")
# The the results from brute force, apriori, and fp-growth are the same.
# In[12]:
# Print combined execution times
print("\nExecution Times (in nanoseconds):")
print(f"Self-Built Brute Force: {brute_force_time_freq_itemset + find_rules_time} ns")
print(f"Built-in Apriori: {apriori_time_freq_itemset_and_rules} ns")
print(f"Built-In FP-Growth: {fpgrowth_time + association_rules_time} ns")
print("---------------------\nFootnote:\n Built-in Apriori returns frequent item sets and rules in one go, while the other two generate it separately. \n And the databases only have 20 transactions. \n Therefore, sometimes Built-in Apriori took the least time, and sometimes FP-Growth is even slower than self-built brute force. ")
# ### Conclusion
# This project guides you to understand the use and functioning steps of typical association rule algorithms: one self-built brute force algorithm, which is a de facto Apriori algorithm, one optimized-efficient built-in Apriori algorithm, and one more scale-efficient FP-Growth algorim. The frequent items sets and association rules generated are matched with each other; however, their running time are diffent and can vary depending on scale of dataset and their own methodology behind.