Conclusion

This project is my trial to apply association rules to the real-world data. There are many precious tutorials and one of them not only talk about abstract concepts but also practical application of arules package in R. The article is on CRAN. Besides, I use data from here. The data is bundled with a tutorial, though there are many mistakes in that article. Thus, I give up and try to understand and practice association rulses on my own. Hope it works!

library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(kableExtra)
rm(list=ls())
df <- read_csv("D:/R_programming/DSprogramming/groceries.csv",
               col_types = cols(
                 Member_number = col_character(),
                 Date = col_date("%m/%d/%Y"),
                 itemDescription = col_character()))
names(df) <- c("member_ID", "date", "item")

Let’s take a look at the data!

glimpse(df)
## Observations: 260
## Variables: 3
## $ member_ID <chr> "1.61905E+12", "1.67903E+12", "1.63404E+12", "1.6560...
## $ date      <date> 2012-10-22, 2010-10-03, 2011-07-07, 2010-03-13, 201...
## $ item      <chr> "citrus fruit", "tropical fruit", "whole milk", "pip...

There are three columns of this grocery data, which are member_ID, date, and items bought. Each observation contains an item only, so if a customer buy 5 items in the same time, it will account for 5 observations. Now I am goint to use group_by to check the real number of transaction.

df %>% group_by(member_ID, date) %>%
  count(sort = T)
## # A tibble: 199 x 3
## # Groups:   member_ID, date [199]
##    member_ID   date           n
##    <chr>       <date>     <int>
##  1 1.61607E+12 2012-12-30     7
##  2 1.61201E+12 2013-09-07     6
##  3 1.69201E+12 2011-01-30     6
##  4 1.60604E+12 2010-08-19     4
##  5 1.61103E+12 2012-12-03     4
##  6 1.61708E+12 2014-11-23     4
##  7 1.62604E+12 2010-01-17     4
##  8 1.62609E+12 2010-07-20     4
##  9 1.63005E+12 2012-10-22     4
## 10 1.65111E+12 2013-11-15     4
## # ... with 189 more rows

arules needs a primary key, but primary key in df is composite key. I convert it below. group_id is the new single primary key of df2. We can take a look at glimpse(df2) and df2 %>% group_by(group_id) %>% count(sort = T) %>% glimpse()

df2 <- df %>%
  unite(gawy, c(member_ID, date), sep = "-")
df2$group_id <- df2 %>% group_indices(gawy)
df2$gawy <- NULL
df2 <- df2 %>% select(group_id, item) %>% arrange(group_id)
glimpse(df2)
## Observations: 260
## Variables: 2
## $ group_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 13...
## $ item     <chr> "newspapers", "shopping bags", "chewing gum", "bottle...
df2 %>% group_by(group_id) %>% count(sort = T) %>% glimpse()
## Observations: 199
## Variables: 2
## $ group_id <int> 38, 27, 190, 13, 25, 39, 52, 53, 64, 113, 117, 119, 1...
## $ n        <int> 7, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3,...

After completing the transformation of df2, it is needed to use function in arules to continue.

#write.csv(df2, "df2.csv") I have wrriten it.
order_trans <- read.transactions(
  file = "df2.csv",
  format = "single",
  sep = ",",
  cols=c("group_id","item"),
  rm.duplicates = T
)
length(order_trans)
## [1] 199
dim(order_trans)
## [1] 199  73
size(order_trans)
##   [1] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 4 1 4 1 1 1 4 1 1 1 1 2 1 1 4
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 1 1 4 1 1 1 1 1 4 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 4 1 6 1 1 1 1 1 1 1 1 1 3 1 7 4 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 4 4 1 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
image(order_trans)

summary(order_trans)
## transactions as itemMatrix in sparse format with
##  199 rows (elements/itemsets/transactions) and
##  73 columns (items) and a density of 0.0176912 
## 
## most frequent items:
##       whole milk       rolls/buns other vegetables          sausage 
##               20               15               14               13 
##      frankfurter          (Other) 
##               12              183 
## 
## element (itemset/transaction) length distribution:
## sizes
##   1   2   3   4   6   7 
## 177   4   5  11   1   1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.291   1.000   7.000 
## 
## includes extended item information - examples:
##             labels
## 1 abrasive cleaner
## 2    baking powder
## 3 bathroom cleaner
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2            10
## 3           100

order_trans is a specific data type in arules package. It is a sparse matrix consists of transactions as row and items as columns. If items appear in a transcation, the value will be 1, and 0 vice versa. size(order_trans) indicates items bought in each transcation. image(order_trans) gives us a broader look to the big picture. It is useful to check the condition such as whether some items are out of shelter.

itemFrequencyPlot(order_trans, topN=10,type="absolute")

We can take a look at the frequency plot. It is clear that whole milk is a hot item. So are rolls/buns.

basket_rules <- apriori(order_trans, parameter = list(sup = 0.01, conf = 0.5,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[73 item(s), 199 transaction(s)] done [0.00s].
## sorting and recoding items ... [48 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules <- sort(basket_rules, decreasing = TRUE, by = "confidence")

apriori is one of algorithm of association rules. Using apriori function, order_trans are transformed into rules, another specific data type related to association rules. Let’s take a look at it.

inspect(rules)
##      lhs                                       rhs                        
## [1]  {brown bread}                          => {other vegetables}         
## [2]  {packaged fruit/vegetables,soda}       => {whole milk}               
## [3]  {packaged fruit/vegetables,whole milk} => {soda}                     
## [4]  {soda,whole milk}                      => {packaged fruit/vegetables}
## [5]  {canned beer,yogurt}                   => {other vegetables}         
## [6]  {canned beer,other vegetables}         => {yogurt}                   
## [7]  {other vegetables,yogurt}              => {canned beer}              
## [8]  {yogurt}                               => {canned beer}              
## [9]  {yogurt}                               => {other vegetables}         
## [10] {packaged fruit/vegetables}            => {soda}                     
## [11] {packaged fruit/vegetables}            => {whole milk}               
##      support    confidence lift     count
## [1]  0.01005025 1.0000000  14.21429 2    
## [2]  0.01005025 1.0000000   9.95000 2    
## [3]  0.01005025 1.0000000  28.42857 2    
## [4]  0.01005025 1.0000000  49.75000 2    
## [5]  0.01005025 1.0000000  14.21429 2    
## [6]  0.01005025 1.0000000  66.33333 2    
## [7]  0.01005025 1.0000000  33.16667 2    
## [8]  0.01005025 0.6666667  22.11111 2    
## [9]  0.01005025 0.6666667   9.47619 2    
## [10] 0.01005025 0.5000000  14.21429 2    
## [11] 0.01005025 0.5000000   4.97500 2
summary(rules)
## set of 11 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 6 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.545   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence          lift            count  
##  Min.   :0.01005   Min.   :0.5000   Min.   : 4.975   Min.   :2  
##  1st Qu.:0.01005   1st Qu.:0.6667   1st Qu.:12.082   1st Qu.:2  
##  Median :0.01005   Median :1.0000   Median :14.214   Median :2  
##  Mean   :0.01005   Mean   :0.8485   Mean   :24.258   Mean   :2  
##  3rd Qu.:0.01005   3rd Qu.:1.0000   3rd Qu.:30.798   3rd Qu.:2  
##  Max.   :0.01005   Max.   :1.0000   Max.   :66.333   Max.   :2  
## 
## mining info:
##         data ntransactions support confidence
##  order_trans           199    0.01        0.5

rules tell about the basic concept underling association rules. We can also check that in the plot form.

plot(basket_rules)

plot(rules, method = "grouped", control = list(k = 5))

plot(rules, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## main  =  Graph for 11 rules
## nodeColors    =  c("#66CC6680", "#9999CC80")
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol   =  c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF",  "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF",  "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha     =  0.5
## cex   =  1
## itemLabels    =  TRUE
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## layout    =  NULL
## layoutParams  =  list()
## arrowSize     =  0.5
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## max   =  100
## verbose   =  FALSE

plot(rules, method = "paracoord",  control = list(alpha = .5, reorder = TRUE))

To conclude, I applied the concepts learned and tried many useful packages such as dplyr, arules in this porject. I spent much time in understanding association rules and application in real life. I wish I could hand the real world data next time!