-
Notifications
You must be signed in to change notification settings - Fork 0
/
BakeryAnalysis.Rmd
241 lines (183 loc) · 7.71 KB
/
BakeryAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
title: "R Notebook"
output: html_notebook
editor_options:
chunk_output_type: inline
---
This R-markdown entails the analysis i conducted of the data set "Transactions from a bakery" provided by the kaggle user Xavier (https://www.kaggle.com/xvivancos).
The data set is loaded regularly for the Experimental Data Analysis (EDA) and also read in as transaction data using the read.transactions() function from the R-package "arules".
```{r, warning=F}
library(ggplot2)
library(tidyverse)
library(ggpubr)
library(lubridate)
library(arules)
library(arulesViz)
library(knitr)
library(gridExtra)
data <- read.csv("/Users/julianblau/Downloads/BreadBasket_DMS.csv")
transaction.data <- read.transactions("/Users/julianblau/Downloads/BreadBasket_DMS.csv",
format="single", cols=c(3,4), sep=",",
rm.duplicates=TRUE)
```
A quick overview of the data set at hand.
```{r}
str(data)
head(data)
```
Converting the "Date" and "Time" formats for further analysis.
```{r}
data <- data %>%
mutate(
Date = as.Date(Date),
Time = hms(Time)
)
data$Weekdays <- weekdays(data$Date)
```
Analyzing most frequently purchased products.
```{r}
data.items <- data %>% group_by(Item) %>% summarise(count=n())
data.items <- data.items[order(-data.items$count),]
data.items
ggplot(data.items[1:10,], aes(x = reorder(Item, -count),
y = count, fill=count))+
geom_bar(stat = "identity")+
labs(title="Most purchased items", x="Item",
y="count")+
theme(axis.title = element_text(size=12),
axis.text=element_text(size=8, angle = 35))
```
Coffee is the most purchased item at the bakery, followed by bread, tea and other products. The item "NONE" is unknown but will not be excluded as the information will be updated once clarified in the kaggle discussion thread.
A quick look at the items sold at the bakery. Here the item "Adjustment" appears, which seems to be some input from the cashier machine.
```{r}
items <- unique(data$Item)
items <- sort(items, decreasing = F)
head(items)
```
Analysis of the transactions per month.
```{r}
data.days <- data %>% group_by(Date) %>% summarise(count=n())
data.days$month <- as.Date(cut(data.days$Date, breaks = "month"))
data.days$Weekdays <- weekdays(data.days$Date)
ggplot(data.days, aes(month, count, fill=Weekdays))+
geom_bar(stat = "identity")+
scale_x_date(date_labels = "%B", date_breaks = "1 month")+
labs(title="Transaction frequency per month", x="Month")+
theme(axis.title = element_text(size=12),
axis.text.x=element_text(size=8, angle = 35))
```
October falls short as the data collected is of the 30th of the month and April only includes the first 9 days. This said, November seems to be the month with the most transactions followed by March and February.
Taking a closer look at the transaction distribution for each weekday, it becomes clear that Saturdays are the highest turnover days. The top 10 days by accumulated transactions are all Saturdays.
```{r}
data.days.desc <- data.days[order(-data.days$count), ]
data.days.desc
data.weekday.rev <- aggregate(data.days.desc$count, by=list(Category=data.days.desc$Weekdays), FUN=sum)
data.weekday.rev <- data.weekday.rev[order(-data.weekday.rev$x), ]
data.weekday.rev
data.weekday.rev$percent <- (prop.table(data.weekday.rev$x)*100)
data.weekday.rev
```
Combined, all Saturdays make up for almost 23% of all transactions, followed by Fridays and Saturday.
To ensure that these percentages aren't falsified by a more Saturdays then other days in the data set, the weekdays are summed up.
```{r}
sales.days <- unique(data$Date)
sales.days <- as.data.frame(sales.days)
colnames(sales.days) <- "Date"
sales.days$weekday <- weekdays(as.Date(sales.days$Date))
sales.days
sales.days.sum <- sales.days %>% group_by(weekday) %>% summarise(count=n())
sales.days.sum
```
The amount of weekdays are almost equal, solely Mondays are missing two days.
```{r}
ggplot(data.days.desc, aes(x="", y=count, fill=Weekdays))+
geom_bar(width = 1, stat = "identity")+
coord_polar("y", start=0)+
labs(title="Most purchased items", x="Item",
y="count")
ggplot(data.days.desc, aes(x=reorder(Weekdays, -count), y=count, fill=Weekdays))+
geom_bar(stat = "identity")+
labs(title="Accumulated transactions per weekday", x="Item",
y="count")+
theme(axis.title = element_text(size=12),
axis.text.x=element_text(size=8, angle = 35))
```
These two figures visualize the previous findings of Saturday being the day with the most transactions.
As Mondays can get quite rough, coffee can be a life saver. Lets look if Mondays show higher coffee sales then other days.
```{r}
coffee.sales <- aggregate(data$Item == "Coffee", by=list(Category=data$Weekdays), FUN=sum)
coffee.sales$percent <- (prop.table(coffee.sales$x)*100)
coffee.sales
```
Surprisingly, the proportions of coffee sales are fairly similar to the overall transaction proportions for the weekdays.
Knowing the time and amount of transaction frequencies can help the bakery management to plan their staff schedule accordingly.
```{r, warning=F, echo=F}
a <- data %>%
mutate(
Hour = as.factor(hour(Time))
) %>%
group_by(Hour, Transaction) %>%
summarize(
Transactions = n_distinct(Item)
) %>%
ggplot(
aes(x = Hour, fill = Hour)
) +
geom_histogram(stat="count") +
theme(
legend.position="none") +
labs(x = "Hour", y = "Transactions", title = "Transactions per hour")
a
b <- data %>%
mutate(
Hour = as.factor(hour(Time))
) %>%
group_by(Hour, Transaction) %>%
summarize(
Transactions = n_distinct(Item)
) %>%
ggplot(
aes(x = Hour, y = Transactions, fill = Hour)
) +
geom_boxplot() +
theme(
legend.position="none"
) +
labs(x = "Hour", y = "Items / transaction", title = "Items per transaction (per hour)")
b
```
The busiest time for the bakery is between 9 am and 4 pm with a peak around lunch time at 11-12 am. Opening hours seem to be 7 am to 8 pm with exceptions where the bakery was open till 1 am the next day. Might have been a big event.
```{r}
frequentItems <- eclat(transaction.data, parameter = list(supp=0.07, maxlen=15))
itemFrequencyPlot(transaction.data, topN=20, type="absolute", main= "Item Frequency")
```
Now we will take a look at association rules.
```{r}
rules <- apriori(transaction.data, parameter = list(supp= 0.001, conf= 0.5))
ruleConf <- sort(rules, by="confidence", decreasing = T)
inspect(head(ruleConf))
plot(rules)
head(quality(rules))
plot(rules, method = "grouped")
subrules2 <- head(rules, n = 10, by = "lift")
plot(subrules2, method = "graph")
plot(subrules2, method = "paracoord")
```
Lift is the factor by which, the co-occurrence of A and B exceeds the expected probability of A and B co-occurring, had they been independent. So, higher the lift, higher the chance of A and B occurring together.
```{r}
rulesHighLift <- sort(rules, by="lift", decreasing = T)
inspect(head(rulesHighLift))
```
This said, Coke and Juice have a 10 time higher chance of being bought with a sandwich rather then separately.
Now we will look at products that were purchased after/along with Tea.
```{r}
rulesItemSpecific <- apriori (transaction.data, parameter=list (supp=0.001,conf = 0.08), appearance = list (default="lhs",rhs="Tea"), control = list (verbose=F))
rulesItemSpecificConf <- sort(rulesItemSpecific, by="confidence", decreasing = T)
inspect(head(rulesItemSpecificConf))
```
Lets find out what items where purchased along with Coffee.
```{r}
rulesPurchase <- apriori(transaction.data, parameter = list(supp=0.001, conf=0.005, minlen=2), appearance = list(default="rhs", lhs="Coffee"), control = list(verbose=F))
rulesPurchaseConf <- sort(rulesPurchase, by="confidence", decreasing = T)
inspect(head(rulesPurchaseConf))
```