Question

1. Objective

This dataset records information about sales for a bakery shop. By doing association rule mining, we can improve business sales by uncovering relationships between items sold in the bakery. For instance, we can discover what item are usually sold together and hence making business decisions based on these associations.

2. Dataset description

The dataset contains transactions made by customers and each transaction hold records of item(s) sold and its quantity sold.

Data preprocessing done:

  • Quantity column is omitted from the dataset as quantity of purchase will not affect the outcome gain from association rule mining.
  • Each integer representation of Food column into mapped into its text representation that are more meaningful.
    • This is done so that when doing association rule mining or even when visualizing the data, everything will be more obvious.
  • Header names are added to each of the columns of the dataset so that we can correctly differentiate between columns.
  • Data is converted into basket format so that we can run it in apriori.

3. Rule mining process

Parameter Settings (Based on 1000i.csv)
Parameter Value
Support 0.015
Confidence 0.9
Algorithm apriori
Time required 0.20s - 0.24s

4. Resulting rules

This association rule mining tells us which item are normally sold with other items.
After pruning the rules, we are left with 28 rules. (Was 68 before pruning)

A summary of the rules (Pruned)

Description Value
minimum support 0.018
maximum support 0.040
minimum confidence 0.9
maximum confidence 1.0
minimum lift 11.18
maximum lift 19.61

A selection of those we would show to the client are rules with high support, confidence and lift value.


5. Recommendations

Clients can do bundled promotions based on the rules discovered.
The rules has shown that those who like coffee flavor will also favor blackberry flavor. Hence we can conclude that customers enjoy the combination of these flavors as their meals. The recommendation that we can give to the client is, try to make a bundle based on the combination of flavor of the menu. Besides that, those who buy vanilla frappucino and walnut cookie are likely to buy chocolate tart. Hence the client can sell these in a bundle. Clients can also do discounts and promotion on items that are frequently bought together. For instance, Those who buy coffee drink can get discounted price for eclair,pie or twist.

R Code

Imported libraries

library(arules)
library(arulesViz)
library(ggplot2)

Association Rule Mining

Data preprocessing

Import the dataset

Load the dataset and assigning header names to each column

receipt_df <- read.csv("1000i.csv", header = F)
names(receipt_df) <- c("Receipt_Number","Quantity","Food")

Before preprocessing

##   Receipt_Number Quantity Food
## 1              1        3    7
## 2              1        4   15
## 3              1        2   49
## 4              1        5   44
## 5              2        1    1
## 6              2        2   19

Data preprocessing

Create a dataframe containing each item and its corresponding item_ID

id <- c(0:49)
food <- c("Chocolate Cake","Lemon Cake","Casino Cake","Opera Cake", "Strawberry Cake", "Truffle Cake", "Chocolate Eclair", "Coffee Eclair", "Vanilla Eclair", "Napolean Cake", "Almond Tart", "Apple Pie", "Apple Tart","Apricot Tart", "Berry Tart", "Blackberry Tart", "Blueberry Tart", "Chocolate Tart", "Cherry Tart", "Lemon Tart", "Pecan Tart", "Ganache Cookie", "Gongolais Cookie", "Raspberry Cookie", "Lemon Cookie", "Chocolate Meringue", "Vanilla Meringue", "Marzipan Cookie", "Tuile Cookie", "Walnut Cookie", "Almond Croissant", "Apple Croissant", "Apricot Croissant", "Cheese Croissant", "Chocolate Croissant", "Apricot Danish", "Apple Danish", "Almond Twist", "Almond Bear_Claw", "Blueberry Danish", "Lemon Lemonade", "Raspberry Lemonade", "Orange Juice", "Green Tea", "Bottled Water", "Hot Coffee", "Chocolate Coffee", "Vanilla Frappucino", "Cherry Soda", "Single Espresso")
df <- data.frame(id, food)

Map item_ID to its text representation

receipt_df$Food <- df$food[match(receipt_df$Food,df$id)]

Seperating food into “Flavor” and “Type” representation

ft <- matrix(unlist(strsplit(as.character(receipt_df$Food), ' ')) , ncol=2, byrow=TRUE)
receipt_df <- data.frame(receipt_df, ft)
names(receipt_df) <- c("Receipt_Number","Quantity","Food", "Flavor", "Type")

After preprocessing

head(receipt_df)
##   Receipt_Number Quantity            Food     Flavor     Type
## 1              1        3   Coffee Eclair     Coffee   Eclair
## 2              1        4 Blackberry Tart Blackberry     Tart
## 3              1        2 Single Espresso     Single Espresso
## 4              1        5   Bottled Water    Bottled    Water
## 5              2        1      Lemon Cake      Lemon     Cake
## 6              2        2      Lemon Tart      Lemon     Tart

Convert into basket format to run in apriori

test_df <- receipt_df[,c("Receipt_Number","Food", "Flavor", "Type")]
df_trans <- as(split(test_df$Food, test_df$Receipt_Number), "transactions")
df_trans_Flavor <- as(split(test_df$Flavor, test_df$Receipt_Number), "transactions")
df_trans_Type <- as(split(test_df$Type, test_df$Receipt_Number), "transactions")

Association Rule Mining (Item)

#start timer
ptm <- proc.time() #Calculate running time
rules<-apriori(df_trans, 
               control=list(verbose=F),
               parameter=list(supp=0.015,conf=0.9))

#trying to remove redundancy
subset.matrix <- is.subset(rules, rules)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1

#remove redundant rules
rules.pruned <- rules[!redundant]
rules <- rules.pruned

#end timer
proc.time() - ptm
##    user  system elapsed 
##    0.06    0.00    0.06
##                                                        rules support
## 24                  {Almond Twist,Hot Coffee} => {Apple Pie}   0.024
## 5               {Green Tea,Lemon Lemonade} => {Lemon Cookie}   0.019
## 25                 {Coffee Eclair,Hot Coffee} => {Apple Pie}   0.024
## 1     {Chocolate Tart,Walnut Cookie} => {Vanilla Frappucino}   0.018
## 18          {Green Tea,Lemon Cookie} => {Raspberry Lemonade}   0.019
## 21      {Green Tea,Raspberry Cookie} => {Raspberry Lemonade}   0.019
## 7         {Green Tea,Lemon Lemonade} => {Raspberry Lemonade}   0.019
## 10     {Lemon Cookie,Lemon Lemonade} => {Raspberry Lemonade}   0.028
## 30        {Apricot Croissant,Hot Coffee} => {Blueberry Tart}   0.032
## 36                {Apple Danish,Cherry Soda} => {Apple Tart}   0.031
## 37             {Apple Croissant,Cherry Soda} => {Apple Tart}   0.031
## 32   {Lemon Cookie,Raspberry Lemonade} => {Raspberry Cookie}   0.029
## 15 {Lemon Lemonade,Raspberry Lemonade} => {Raspberry Cookie}   0.028
## 34              {Apricot Danish,Opera Cake} => {Cherry Tart}   0.038
## 19            {Green Tea,Lemon Cookie} => {Raspberry Cookie}   0.019
## 28        {Casino Cake,Chocolate Cake} => {Chocolate Coffee}   0.038
## 8           {Green Tea,Lemon Lemonade} => {Raspberry Cookie}   0.019
## 13       {Lemon Cookie,Lemon Lemonade} => {Raspberry Cookie}   0.028
## 38            {Apple Danish,Apple Tart} => {Apple Croissant}   0.040
## 41           {Apple Danish,Cherry Soda} => {Apple Croissant}   0.031
## 26              {Almond Twist,Hot Coffee} => {Coffee Eclair}   0.024
## 3       {Blackberry Tart,Single Espresso} => {Coffee Eclair}   0.023
## 22               {Almond Twist,Apple Pie} => {Coffee Eclair}   0.027
##    confidence     lift
## 24  0.9600000 14.11765
## 5   0.9047619 13.70851
## 25  0.9230769 13.57466
## 1   1.0000000 13.51351
## 18  0.9500000 13.19444
## 21  0.9500000 13.19444
## 7   0.9047619 12.56614
## 10  0.9032258 12.54480
## 30  1.0000000 12.34568
## 36  0.9393939 11.89106
## 37  0.9393939 11.89106
## 32  0.9666667 11.78862
## 15  0.9655172 11.77460
## 34  0.9743590 11.59951
## 19  0.9500000 11.58537
## 28  0.9500000 11.17647
## 8   0.9047619 11.03368
## 13  0.9032258 11.01495
## 38  0.9756098 10.72099
## 41  0.9393939 10.32301
## 26  0.9600000 10.32258
## 3   0.9583333 10.30466
## 22  0.9310345 10.01112

Association Rule Mining (Flavor)

#start timer
ptm <- proc.time()
rules2<-apriori(df_trans_Flavor, 
               control=list(verbose=F),
               parameter=list(supp=0.005,conf=0.7))

#trying to apply remove redundancy
subset.matrix <- is.subset(rules2, rules2)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1

#remove redundant rules
rules2.pruned <- rules2[!redundant]
rules2 <- rules2.pruned

#end timer
proc.time() - ptm
##    user  system elapsed 
##    0.04    0.00    0.05
##                                      rules support confidence      lift
## 1          {Blackberry,Single} => {Coffee}   0.023  0.9583333 10.304659
## 10       {Cheese,Strawberry} => {Napolean}   0.005  0.8333333  9.259259
## 11           {Marzipan,Vanilla} => {Tuile}   0.008  0.8888889  8.714597
## 21 {Apricot,Chocolate,Marzipan} => {Tuile}   0.005  0.8333333  8.169935
## 12                {Coffee,Hot} => {Almond}   0.025  0.9615385  5.623032
## 8              {Cherry,Opera} => {Apricot}   0.038  0.8636364  4.406308
## 17            {Blueberry,Hot} => {Apricot}   0.034  0.8500000  4.336735
## 14                 {Coffee,Hot} => {Apple}   0.024  0.9230769  3.978780
## 4             {Green,Raspberry} => {Lemon}   0.020  0.8333333  3.858025
## 5          {Vanilla,Walnut} => {Chocolate}   0.023  0.8846154  3.442083
## 15              {Almond,Coffee} => {Apple}   0.031  0.7948718  3.426172
## 19                 {Almond,Hot} => {Apple}   0.026  0.7878788  3.396029
## 28   {Apple,Cherry,Vanilla} => {Chocolate}   0.008  0.8000000  3.112840
## 27 {Apple,Blueberry,Cherry} => {Chocolate}   0.006  0.7500000  2.918288
## 7        {Casino,Gongolais} => {Chocolate}   0.005  0.7142857  2.779322

Association Rule Mining (Type)

#start timer
ptm <- proc.time()
rules3<-apriori(df_trans_Type, 
                control=list(verbose=F),
                parameter=list(supp=0.010,conf=0.8))

#trying to apply remove redundancy
subset.matrix <- is.subset(rules3, rules3)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1

#remove redundant rules
rules3.pruned <- rules3[!redundant]
rules3 <- rules3.pruned

#end timer
proc.time() - ptm
##    user  system elapsed 
##    0.03    0.00    0.03
##                           rules support confidence     lift
## 3       {Pie,Twist} => {Eclair}   0.028  0.9655172 6.034483
## 9    {Coffee,Twist} => {Eclair}   0.024  0.8888889 5.555556
## 10     {Coffee,Pie} => {Eclair}   0.024  0.8000000 5.000000
## 6       {Pie,Twist} => {Coffee}   0.024  0.8275862 4.675628
## 11 {Danish,Soda} => {Croissant}   0.034  0.8717949 2.830503
## 2    {Lemonade,Tea} => {Cookie}   0.022  0.8800000 2.162162
## 1   {Eclair,Espresso} => {Tart}   0.026  0.9285714 1.673102
## 12      {Danish,Soda} => {Tart}   0.035  0.8974359 1.617002
## 13   {Croissant,Soda} => {Tart}   0.036  0.8181818 1.474201

Visualization

Some of the plots from Code.r, refer Shiny for interactivity.

reorder_size <- function(x) {
  factor(x, levels = names(sort(table(x))))
}
ggplot(data = receipt_df, aes(x = reorder_size(Food), fill = as.factor(Quantity))) + geom_bar(colour = "black") + coord_flip()

ggplot(data = receipt_df, aes(x = reorder_size(Food), fill = as.factor(Quantity))) + geom_bar(colour = "black") + facet_grid(as.factor(Quantity)~.) + theme(axis.text.x = element_text(angle = 90, hjust = 1))

#trying plot to see what kind of results that we might expected
plot(rules, measure=c("support","lift"), shading="confidence")

#flavor rule, trying to see what can we get from rules 2
plot(rules2, measure=c("support","lift"), shading="confidence")