Conclusion
This project is my trial to apply association rules to the real-world data. There are many precious tutorials and one of them not only talk about abstract concepts but also practical application of arules
package in R. The article is on CRAN. Besides, I use data from here. The data is bundled with a tutorial, though there are many mistakes in that article. Thus, I give up and try to understand and practice association rulses on my own. Hope it works!
library(tidyverse)
library(arules)
library(arulesViz)
library(knitr)
library(kableExtra)
rm(list=ls())
df <- read_csv("D:/R_programming/DSprogramming/groceries.csv",
col_types = cols(
Member_number = col_character(),
Date = col_date("%m/%d/%Y"),
itemDescription = col_character()))
names(df) <- c("member_ID", "date", "item")
Let’s take a look at the data!
glimpse(df)
## Observations: 260
## Variables: 3
## $ member_ID <chr> "1.61905E+12", "1.67903E+12", "1.63404E+12", "1.6560...
## $ date <date> 2012-10-22, 2010-10-03, 2011-07-07, 2010-03-13, 201...
## $ item <chr> "citrus fruit", "tropical fruit", "whole milk", "pip...
There are three columns of this grocery data, which are member_ID, date, and items bought. Each observation contains an item only, so if a customer buy 5 items in the same time, it will account for 5 observations. Now I am goint to use group_by
to check the real number of transaction.
df %>% group_by(member_ID, date) %>%
count(sort = T)
## # A tibble: 199 x 3
## # Groups: member_ID, date [199]
## member_ID date n
## <chr> <date> <int>
## 1 1.61607E+12 2012-12-30 7
## 2 1.61201E+12 2013-09-07 6
## 3 1.69201E+12 2011-01-30 6
## 4 1.60604E+12 2010-08-19 4
## 5 1.61103E+12 2012-12-03 4
## 6 1.61708E+12 2014-11-23 4
## 7 1.62604E+12 2010-01-17 4
## 8 1.62609E+12 2010-07-20 4
## 9 1.63005E+12 2012-10-22 4
## 10 1.65111E+12 2013-11-15 4
## # ... with 189 more rows
arules
needs a primary key, but primary key in df
is composite key. I convert it below. group_id
is the new single primary key of df2
. We can take a look at glimpse(df2)
and df2 %>% group_by(group_id) %>% count(sort = T) %>% glimpse()
df2 <- df %>%
unite(gawy, c(member_ID, date), sep = "-")
df2$group_id <- df2 %>% group_indices(gawy)
df2$gawy <- NULL
df2 <- df2 %>% select(group_id, item) %>% arrange(group_id)
glimpse(df2)
## Observations: 260
## Variables: 2
## $ group_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 13...
## $ item <chr> "newspapers", "shopping bags", "chewing gum", "bottle...
df2 %>% group_by(group_id) %>% count(sort = T) %>% glimpse()
## Observations: 199
## Variables: 2
## $ group_id <int> 38, 27, 190, 13, 25, 39, 52, 53, 64, 113, 117, 119, 1...
## $ n <int> 7, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3,...
After completing the transformation of df2, it is needed to use function in arules
to continue.
#write.csv(df2, "df2.csv") I have wrriten it.
order_trans <- read.transactions(
file = "df2.csv",
format = "single",
sep = ",",
cols=c("group_id","item"),
rm.duplicates = T
)
length(order_trans)
## [1] 199
dim(order_trans)
## [1] 199 73
size(order_trans)
## [1] 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 1 1 4 1 4 1 1 1 4 1 1 1 1 2 1 1 4
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 1 1 4 1 1 1 1 1 4 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 4 1 6 1 1 1 1 1 1 1 1 1 3 1 7 4 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 4 4 1 3 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
image(order_trans)
summary(order_trans)
## transactions as itemMatrix in sparse format with
## 199 rows (elements/itemsets/transactions) and
## 73 columns (items) and a density of 0.0176912
##
## most frequent items:
## whole milk rolls/buns other vegetables sausage
## 20 15 14 13
## frankfurter (Other)
## 12 183
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 6 7
## 177 4 5 11 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.291 1.000 7.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 baking powder
## 3 bathroom cleaner
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 10
## 3 100
order_trans
is a specific data type in arules
package. It is a sparse matrix consists of transactions as row and items as columns. If items appear in a transcation, the value will be 1, and 0 vice versa. size(order_trans)
indicates items bought in each transcation. image(order_trans)
gives us a broader look to the big picture. It is useful to check the condition such as whether some items are out of shelter.
itemFrequencyPlot(order_trans, topN=10,type="absolute")
We can take a look at the frequency plot. It is clear that whole milk is a hot item. So are rolls/buns.
basket_rules <- apriori(order_trans, parameter = list(sup = 0.01, conf = 0.5,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[73 item(s), 199 transaction(s)] done [0.00s].
## sorting and recoding items ... [48 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [11 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules <- sort(basket_rules, decreasing = TRUE, by = "confidence")
apriori
is one of algorithm of association rules. Using apriori
function, order_trans
are transformed into rules
, another specific data type related to association rules. Let’s take a look at it.
inspect(rules)
## lhs rhs
## [1] {brown bread} => {other vegetables}
## [2] {packaged fruit/vegetables,soda} => {whole milk}
## [3] {packaged fruit/vegetables,whole milk} => {soda}
## [4] {soda,whole milk} => {packaged fruit/vegetables}
## [5] {canned beer,yogurt} => {other vegetables}
## [6] {canned beer,other vegetables} => {yogurt}
## [7] {other vegetables,yogurt} => {canned beer}
## [8] {yogurt} => {canned beer}
## [9] {yogurt} => {other vegetables}
## [10] {packaged fruit/vegetables} => {soda}
## [11] {packaged fruit/vegetables} => {whole milk}
## support confidence lift count
## [1] 0.01005025 1.0000000 14.21429 2
## [2] 0.01005025 1.0000000 9.95000 2
## [3] 0.01005025 1.0000000 28.42857 2
## [4] 0.01005025 1.0000000 49.75000 2
## [5] 0.01005025 1.0000000 14.21429 2
## [6] 0.01005025 1.0000000 66.33333 2
## [7] 0.01005025 1.0000000 33.16667 2
## [8] 0.01005025 0.6666667 22.11111 2
## [9] 0.01005025 0.6666667 9.47619 2
## [10] 0.01005025 0.5000000 14.21429 2
## [11] 0.01005025 0.5000000 4.97500 2
summary(rules)
## set of 11 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 6
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.545 3.000 3.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.01005 Min. :0.5000 Min. : 4.975 Min. :2
## 1st Qu.:0.01005 1st Qu.:0.6667 1st Qu.:12.082 1st Qu.:2
## Median :0.01005 Median :1.0000 Median :14.214 Median :2
## Mean :0.01005 Mean :0.8485 Mean :24.258 Mean :2
## 3rd Qu.:0.01005 3rd Qu.:1.0000 3rd Qu.:30.798 3rd Qu.:2
## Max. :0.01005 Max. :1.0000 Max. :66.333 Max. :2
##
## mining info:
## data ntransactions support confidence
## order_trans 199 0.01 0.5
rules
tell about the basic concept underling association rules. We can also check that in the plot form.
plot(basket_rules)
plot(rules, method = "grouped", control = list(k = 5))
plot(rules, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## main = Graph for 11 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
plot(rules, method = "paracoord", control = list(alpha = .5, reorder = TRUE))
To conclude, I applied the concepts learned and tried many useful packages such as dplyr
, arules
in this porject. I spent much time in understanding association rules and application in real life. I wish I could hand the real world data next time!