I’m currently following Pattern Discovery in Data Mining on Coursera from Illinois university and was introduced to association rule mining, a method of finding frequent itemsets in large databases. I then found a chapter in Yanchang Zhao’s R and Data Mining on the subject which, despite being pretty savagely ripped apart as a whole in reviews for being impenetrable and low quality, was helpful in this instance, in combination with the lectures, in showing the practical applications. Here’s a run through of an example from Zhao using data on survivors from the sinking of the Titanic.

## Support, Confidence and Lift

First, a little background theory. An association rule of the form $A=>B$ states that there is a correlation or association between occurences of the itemset A, known as the left hand side lhs or antecedent, and the itemset B, known as the right hand side rhs or consequent.

We can define the value of a rule using a number of different measures. The simplest is the Support, which gives the frequency or number of occurences of the combined itemset ${A}\cup{B}$ in the database. The Confidence gives the relative frequency of occurences of A that also contain B:

$Conf(A=>B)=P(B|A)=\frac{P({A}\cup{B})}{P(A)}$

The final measure used here is the Lift, which gives the ratio of the confidence to the relative frequency of B:

$Lift(A=>B)=\frac{P({A}\cup{B})}{P(A)P(B)}$

In combination we can use these measures to sort and filter our rule set to find the most interesting patterns.

</div>

## Data: The Titanic

The dataset I’m going to use contains information on passengers of the Titanic when it sank, including their age, sex, the class they were travelling in, and whther they survived or not; pretty morbid stuff, but large enough to hopefully throw up some interesting patterns. It’s included in the datasets package in R, but you can download the original data from here.

str(Titanic)
##  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$Class : chr [1:4] "1st" "2nd" "3rd" "Crew" ## ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$Age : chr [1:2] "Child" "Adult" ## ..$ Survived: chr [1:2] "No" "Yes"
df<-as.data.frame(Titanic)
head(df)
##   Class    Sex   Age Survived Freq
## 1   1st   Male Child       No    0
## 2   2nd   Male Child       No    0
## 3   3rd   Male Child       No   35
## 4  Crew   Male Child       No    0
## 5   1st Female Child       No    0
## 6   2nd Female Child       No    0

After converting to a data frame we can see that R has summarised each combination of attributes in to a dense table with a frequency column. We want to break this down in to individual “transactions” for each occurence, where each row represents each passenger.

titanic.raw<-NULL
for(i in 1:4){
titanic.raw<-cbind(titanic.raw,rep(as.character(df[,i]),df\$Freq))
}

titanic.data<-as.data.frame(titanic.raw)
rm(titanic.raw)
names(titanic.data)<-names(df[1:4])
head(titanic.data)
##   Class  Sex   Age Survived
## 1   3rd Male Child       No
## 2   3rd Male Child       No
## 3   3rd Male Child       No
## 4   3rd Male Child       No
## 5   3rd Male Child       No
## 6   3rd Male Child       No

We can see a breakdown of the data using the summary function:

summary(titanic.data)
##   Class         Sex          Age       Survived
##  1st :325   Female: 470   Adult:2092   No :1490
##  2nd :285   Male  :1731   Child: 109   Yes: 711
##  3rd :706
##  Crew:885

## Rule Mining

The example uses the APRIORI algorithm, a level-wise candidate generator from the arules package. The algorithm is written in pseudocode below:

Set k = 1
Scan DB to get frequent k-itemset's
while(frequent candidate set generated)
Generate length k+1 itemsets from length-k frequent itemsets
Test candidates against DB to find frequent k+1 itemsets
k = k+1
end
return all frequent itemsets

Here I call the apriori function on the Titanic data frame, which returns 27 rules printed below along with their support, confidence and lift.

rules.all<-apriori(titanic.data)
##
## parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.8    0.1    1 none FALSE            TRUE     0.1      1     10
##  target   ext
##   rules FALSE
##
## algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules.all
## set of 27 rules
inspect(rules.all)
##    lhs               rhs             support confidence      lift
## 1  {}             => {Age=Adult}   0.9504771  0.9504771 1.0000000
## 2  {Class=2nd}    => {Age=Adult}   0.1185825  0.9157895 0.9635051
## 3  {Class=1st}    => {Age=Adult}   0.1449341  0.9815385 1.0326798
## 4  {Sex=Female}   => {Age=Adult}   0.1930940  0.9042553 0.9513700
## 5  {Class=3rd}    => {Age=Adult}   0.2848705  0.8881020 0.9343750
## 6  {Survived=Yes} => {Age=Adult}   0.2971377  0.9198312 0.9677574
## 7  {Class=Crew}   => {Sex=Male}    0.3916402  0.9740113 1.2384742
## 8  {Class=Crew}   => {Age=Adult}   0.4020900  1.0000000 1.0521033
## 9  {Survived=No}  => {Sex=Male}    0.6197183  0.9154362 1.1639949
## 10 {Survived=No}  => {Age=Adult}   0.6533394  0.9651007 1.0153856
## 11 {Sex=Male}     => {Age=Adult}   0.7573830  0.9630272 1.0132040
## 12 {Sex=Female,
##     Survived=Yes} => {Age=Adult}   0.1435711  0.9186047 0.9664669
## 13 {Class=3rd,
##     Sex=Male}     => {Survived=No} 0.1917310  0.8274510 1.2222950
## 14 {Class=3rd,
##     Survived=No}  => {Age=Adult}   0.2162653  0.9015152 0.9484870
## 15 {Class=3rd,
##     Sex=Male}     => {Age=Adult}   0.2099046  0.9058824 0.9530818
## 16 {Sex=Male,
##     Survived=Yes} => {Age=Adult}   0.1535666  0.9209809 0.9689670
## 17 {Class=Crew,
##     Survived=No}  => {Sex=Male}    0.3044071  0.9955423 1.2658514
## 18 {Class=Crew,
##     Survived=No}  => {Age=Adult}   0.3057701  1.0000000 1.0521033
## 19 {Class=Crew,
##     Sex=Male}     => {Age=Adult}   0.3916402  1.0000000 1.0521033
## 20 {Class=Crew,
##     Age=Adult}    => {Sex=Male}    0.3916402  0.9740113 1.2384742
## 21 {Sex=Male,
##     Survived=No}  => {Age=Adult}   0.6038164  0.9743402 1.0251065
##     Survived=No}  => {Sex=Male}    0.6038164  0.9242003 1.1751385
## 23 {Class=3rd,
##     Sex=Male,
##     Survived=No}  => {Age=Adult}   0.1758292  0.9170616 0.9648435
## 24 {Class=3rd,
##     Survived=No}  => {Sex=Male}    0.1758292  0.8130252 1.0337773
## 25 {Class=3rd,
##     Sex=Male,
##     Age=Adult}    => {Survived=No} 0.1758292  0.8376623 1.2373791
## 26 {Class=Crew,
##     Sex=Male,
##     Survived=No}  => {Age=Adult}   0.3044071  1.0000000 1.0521033
## 27 {Class=Crew,
##     Survived=No}  => {Sex=Male}    0.3044071  0.9955423 1.2658514

Suppose we are only interested in rules satisfying certain constraints, for example those with “Survived” in the consequent. We pass both Survived=Yes and Survived=No to the rhs in the appearance argument to filter only rules satisfying these constraints, and default equal to lhs so that all left hand side arguments (antecedents) are returned.

rules<-apriori(titanic.data,control=list(verbose=F),
parameter = list(minlen=2,supp=0.005,conf=0.8),
appearance = list(rhs=c("Survived=No","Survived=Yes"),default="lhs"))

We also set verbose to False in the control argument to prevent verbose output to console on execution, and minlen equal to two, so that rules with an empty left or right hand side (as seen in the first rule above) are excluded.

Finally we round the support, confidence and lift measures and sort by lift.

quality(rules)<-round(quality(rules),digits = 3)
rules<-sort(rules,by="lift")
inspect(rules)
##    lhs             rhs            support confidence  lift
## 1  {Class=2nd,
##     Age=Child}  => {Survived=Yes}   0.011      1.000 3.096
## 2  {Class=2nd,
##     Sex=Female,
##     Age=Child}  => {Survived=Yes}   0.006      1.000 3.096
## 3  {Class=1st,
##     Sex=Female} => {Survived=Yes}   0.064      0.972 3.010
## 4  {Class=1st,
##     Sex=Female,
##     Age=Adult}  => {Survived=Yes}   0.064      0.972 3.010
## 5  {Class=2nd,
##     Sex=Female} => {Survived=Yes}   0.042      0.877 2.716
## 6  {Class=Crew,
##     Sex=Female} => {Survived=Yes}   0.009      0.870 2.692
## 7  {Class=Crew,
##     Sex=Female,
##     Age=Adult}  => {Survived=Yes}   0.009      0.870 2.692
## 8  {Class=2nd,
##     Sex=Female,
##     Age=Adult}  => {Survived=Yes}   0.036      0.860 2.663
## 9  {Class=2nd,
##     Sex=Male,
##     Age=Adult}  => {Survived=No}    0.070      0.917 1.354
## 10 {Class=2nd,
##     Sex=Male}   => {Survived=No}    0.070      0.860 1.271
## 11 {Class=3rd,
##     Sex=Male,
##     Age=Adult}  => {Survived=No}    0.176      0.838 1.237
## 12 {Class=3rd,
##     Sex=Male}   => {Survived=No}    0.192      0.827 1.222

From this quick analysis we have already mined some interesting patterns. Children travelling second class all survived the sinking, evidenced by the hundred percent confidence in this rule. It is also the rule in our subset with the highest lift (711 of 2201 people survived the sinking, so there is a 32.3% chance of survival. $\frac{Conf(A=>B)}{P(B)} = 1/32.3 = 3.096$). However, there is also some redundancy: our second highest rule is that female children travelling in second class survived; this information is already implied explicitly in our first rule.

Generally speaking we can remove those supersets where the lift of the superset is lower than the lift of the subset. The function is.subset in arules identifies all subsets, which we use to filter our rule set.

# find all subsets in our rule set using is.subset. Note that we have already sorted our rules by descending lift.
rules.subset<-is.subset(rules)
# throw away the bottom half, including the diagonal, of our identity matrix
rules.subset[lower.tri(rules.subset,diag=T)]<-NA

# find all matching subset rules
redundant<-colSums(rules.subset,na.rm = T)>=1
# prune matching subset rules
rules.pruned<-rules[!redundant]
inspect(rules.pruned)
##   lhs             rhs            support confidence  lift
## 1 {Class=2nd,
##    Age=Child}  => {Survived=Yes}   0.011      1.000 3.096
## 2 {Class=1st,
##    Sex=Female} => {Survived=Yes}   0.064      0.972 3.010
## 3 {Class=2nd,
##    Sex=Female} => {Survived=Yes}   0.042      0.877 2.716
## 4 {Class=Crew,
##    Sex=Female} => {Survived=Yes}   0.009      0.870 2.692
## 5 {Class=2nd,
##    Sex=Male,
##    Age=Adult}  => {Survived=No}    0.070      0.917 1.354
## 6 {Class=2nd,
##    Sex=Male}   => {Survived=No}    0.070      0.860 1.271
## 7 {Class=3rd,
##    Sex=Male,
##    Age=Adult}  => {Survived=No}    0.176      0.838 1.237
## 8 {Class=3rd,
##    Sex=Male}   => {Survived=No}    0.192      0.827 1.222

The first rule states that all children travelling second class survived. How about children in other classes? We cannot from this pruned rule set make any comparisons since there are no corresponding rules for children travelling in other classes, due to the fact that the support and confidence of these rules are below the limits we set previously. To find these rules we reset our filters for both the left and right hand sides of our rules, and loosen our parameter constraints.

rules.children<-apriori(titanic.data,
parameter=list(minlen=3,supp=0.002,conf=0.2),
appearance=list(rhs=c("Survived=Yes"),
lhs=c("Class=1st","Class=2nd","Class=3rd","Age=Child"),
default="none"),
control=list(verbose=F))

rules.children<-sort(rules.children,by="confidence")
inspect(rules.children)
##   lhs            rhs                support confidence     lift
## 1 {Class=2nd,
##    Age=Child} => {Survived=Yes} 0.010904134  1.0000000 3.095640
## 2 {Class=1st,
##    Age=Child} => {Survived=Yes} 0.002726034  1.0000000 3.095640
## 3 {Class=3rd,
##    Age=Child} => {Survived=Yes} 0.012267151  0.3417722 1.058004

We can now see that all children in first and second class survived, but only 34% of children in third class, of which there were as many as in first and second class combined. Pretty miserable stuff.

## Visualising Association Rules

The arulesViz package has some interesting functions allowing you to visualise your rules.

The simplest is a scatterplot showing each rule as a point against it’s confidence and support score, where the colour of the point shows it’s lift.

plot(rules.all,method="scatterplot",measure=c("support","confidence"),shading=c("lift"))

There are also some interactive selection and filtering options available, just set the interactive argument to True (this is the case for most of the subsequent visualisations from arulesViz demoed here).

plot(rules.all,method="scatterplot",measure=c("support","confidence"),shading=c("lift"),interactive=T)

If heatmaps are your thing then the matrix method will plot each antecedent against it’s consequent with a further dimension defining the shading. The names are printed in the console but I can’t work out how to include them on the plot.

plot(rules.all,method="matrix",type="grid",measure=c("confidence"))
## Itemsets in Antecedent (LHS)
##  [1] "{}"
##  [2] "{Class=2nd}"
##  [3] "{Class=1st}"
##  [4] "{Sex=Female}"
##  [5] "{Class=3rd}"
##  [6] "{Survived=Yes}"
##  [7] "{Class=Crew}"
##  [8] "{Survived=No}"
##  [9] "{Sex=Male}"
## [10] "{Sex=Female,Survived=Yes}"
## [11] "{Class=3rd,Sex=Male}"
## [12] "{Class=3rd,Survived=No}"
## [13] "{Sex=Male,Survived=Yes}"
## [14] "{Class=Crew,Survived=No}"
## [15] "{Class=Crew,Sex=Male}"
## [17] "{Sex=Male,Survived=No}"
## [19] "{Class=3rd,Sex=Male,Survived=No}"
## [22] "{Class=Crew,Sex=Male,Survived=No}"
## Itemsets in Consequent (RHS)
## [1] "{Age=Adult}"   "{Sex=Male}"    "{Survived=No}"

You can also plot it in 3D… providing further evidence that 3D plots are in most cases useless if you want to actually learn anything from your data vis.

plot(rules.all,method="matrix3D")
## Itemsets in Antecedent (LHS)
##  [1] "{}"
##  [2] "{Class=2nd}"
##  [3] "{Class=1st}"
##  [4] "{Sex=Female}"
##  [5] "{Class=3rd}"
##  [6] "{Survived=Yes}"
##  [7] "{Class=Crew}"
##  [8] "{Survived=No}"
##  [9] "{Sex=Male}"
## [10] "{Sex=Female,Survived=Yes}"
## [11] "{Class=3rd,Sex=Male}"
## [12] "{Class=3rd,Survived=No}"
## [13] "{Sex=Male,Survived=Yes}"
## [14] "{Class=Crew,Survived=No}"
## [15] "{Class=Crew,Sex=Male}"
## [17] "{Sex=Male,Survived=No}"
## [19] "{Class=3rd,Sex=Male,Survived=No}"
## [22] "{Class=Crew,Sex=Male,Survived=No}"
## Itemsets in Consequent (RHS)
## [1] "{Age=Adult}"   "{Sex=Male}"    "{Survived=No}"

A grouped plot does show the antecedent and consequent names by default, but no colour scale. Strange.

plot(rules.all,method="grouped",measure=c("confidence"))

There is also the option to plot rules as a network, where nodes represent antecedents and consequents and edges show the relationships between them. Under the hood this calls the igraph package for plotting and calculating common network measures. Here we plot the pruned rule set to show the relationships clearer; the edge weight represents the confidence in this case.

plot(rules.pruned,method="graph",measure=c("confidence"),control=list(type="itemsets"))

We can also plot relationships between individual items in the frequent itemsets making up the rules, by setting the control type argument to items. In this case nodes are the items themselves but they are not directly related to each; instead all relationships are directed through unlabelled nodes showing the size of the aggregate relationship, here confidence. For example, in the plot below the largest dot represents the rule {2nd class,Child}=>{Survived}, where the size of the dot indicates complete confidence since the maximum of the range of possible confidence is 1.

plot(rules.pruned,method="graph",control=list(type="items"),measure=c("confidence"))

This begins to show us some interesting relationships across all rules. The crew, first class passengers and women all had a much higher chance of survival than men and third class passengers, evidenced by the clustering of these two groups in to distinct areas on the graph. Second class passengers, however, straddle the center ground: for these passengers survival depends on your age and sex.

Finally, you can create a parallel coordinate plot. I find it hard to interpret this due to the lack of grid lines on the y axis. If anyone knows how to add these easily I’d be grateful to know.

plot(rules.all,method="paracoord",control=list(reorder=T),measure=c("confidence"),lty = "dotted")

## Reflections

So there you have it: I’ve demonstrated the Apriori algorithm for mining association rules and visualised them with the arulesViz package. This is my first dabble with these kinds of techniques; there are plenty more complex implementations out there, and I’m sure there are also better packages by now for doing this kind of mining. If you have any suggestions let me know in the comments. Thanks for reading!