Pattern Mining

Library


In [1]:
source("https://raw.githubusercontent.com/eogasawara/mylibrary/master/myPreprocessing.R")
loadlibrary("arules")
loadlibrary("arulesViz")
loadlibrary("arulesSequences")


Loading required package: ggplot2

Loading required package: scales

Loading required package: ggpubr

Loading required package: reshape

Loading required package: caret

Loading required package: lattice

Loading required package: MASS

Loading required package: DMwR

Loading required package: grid

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 

Loading required package: dplyr


Attaching package: ‘dplyr’


The following object is masked from ‘package:MASS’:

    select


The following object is masked from ‘package:reshape’:

    rename


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: arules

Loading required package: Matrix


Attaching package: ‘Matrix’


The following object is masked from ‘package:reshape’:

    expand



Attaching package: ‘arules’


The following object is masked from ‘package:dplyr’:

    recode


The following objects are masked from ‘package:base’:

    abbreviate, write


Loading required package: arulesViz

Registered S3 method overwritten by 'seriation':
  method         from 
  reorder.hclust gclus

Loading required package: arulesSequences


In [2]:
data(AdultUCI)
dim(AdultUCI)
head(AdultUCI)


  1. 48842
  2. 15
A data.frame: 6 × 15
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
<int><fct><int><ord><int><fct><fct><fct><fct><fct><int><int><int><fct><ord>
139State-gov 77516Bachelors13Never-married Adm-clerical Not-in-familyWhiteMale 2174040United-Statessmall
250Self-emp-not-inc 83311Bachelors13Married-civ-spouseExec-managerial Husband WhiteMale 0013United-Statessmall
338Private 215646HS-grad 9Divorced Handlers-cleanersNot-in-familyWhiteMale 0040United-Statessmall
453Private 23472111th 7Married-civ-spouseHandlers-cleanersHusband BlackMale 0040United-Statessmall
528Private 338409Bachelors13Married-civ-spouseProf-specialty Wife BlackFemale 0040Cuba small
637Private 284582Masters 14Married-civ-spouseExec-managerial Wife WhiteFemale 0040United-Statessmall

Removing attributes


In [3]:
AdultUCI$fnlwgt <- NULL
AdultUCI$"education-num" <- NULL

Conceptual Hierarchy and Binning


In [4]:
AdultUCI$age <- ordered(cut(AdultUCI$age, c(15,25,45,65,100)),
                              labels = c("Young", "Middle-aged", "Senior", "Old"))

AdultUCI$"hours-per-week" <- ordered(cut(AdultUCI$"hours-per-week",
                                             c(0,25,40,60,168)),
                                         labels = c("Part-time", "Full-time", "Over-time", "Workaholic"))

AdultUCI$"capital-gain" <- ordered(cut(AdultUCI$"capital-gain",
                                           c(-Inf,0,median(AdultUCI$"capital-gain"[AdultUCI$"capital-gain">0]),
                                             Inf)), labels = c("None", "Low", "High"))

AdultUCI$"capital-loss" <- ordered(cut(AdultUCI$"capital-loss",
                                           c(-Inf,0, median(AdultUCI$"capital-loss"[AdultUCI$"capital-loss">0]),
                                             Inf)), labels = c("None", "Low", "High"))

head(AdultUCI)


A data.frame: 6 × 13
ageworkclasseducationmarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome
<ord><fct><ord><fct><fct><fct><fct><fct><ord><ord><ord><fct><ord>
1Middle-agedState-gov BachelorsNever-married Adm-clerical Not-in-familyWhiteMale Low NoneFull-timeUnited-Statessmall
2Senior Self-emp-not-incBachelorsMarried-civ-spouseExec-managerial Husband WhiteMale NoneNonePart-timeUnited-Statessmall
3Middle-agedPrivate HS-grad Divorced Handlers-cleanersNot-in-familyWhiteMale NoneNoneFull-timeUnited-Statessmall
4Senior Private 11th Married-civ-spouseHandlers-cleanersHusband BlackMale NoneNoneFull-timeUnited-Statessmall
5Middle-agedPrivate BachelorsMarried-civ-spouseProf-specialty Wife BlackFemaleNoneNoneFull-timeCuba small
6Middle-agedPrivate Masters Married-civ-spouseExec-managerial Wife WhiteFemaleNoneNoneFull-timeUnited-Statessmall

Convert to transactions


In [5]:
AdultTrans <- as(AdultUCI, "transactions")

A Priori


In [6]:
rules <- apriori(AdultTrans, parameter=list(supp = 0.5, conf = 0.9, minlen=2, maxlen= 10, target = "rules"), 
                 appearance=list(rhs = c("capital-gain=None"), default="lhs"), control=NULL)
inspect(rules)


Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.9    0.1    1 none FALSE            TRUE       5     0.5      2
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 24421 

set item appearances ...[1 item(s)] done [0.00s].
set transactions ...[115 item(s), 48842 transaction(s)] done [0.08s].
sorting and recoding items ... [9 item(s)] done [0.01s].
creating transaction tree ... done [0.05s].
checking subsets of size 1 2 3 4 done [0.00s].
writing ... [18 rule(s)] done [0.00s].
creating S4 object  ... done [0.01s].
     lhs                               rhs                   support confidence  coverage      lift count
[1]  {hours-per-week=Full-time}     => {capital-gain=None} 0.5435895  0.9290688 0.5850907 1.0127342 26550
[2]  {sex=Male}                     => {capital-gain=None} 0.6050735  0.9051455 0.6684820 0.9866565 29553
[3]  {workclass=Private}            => {capital-gain=None} 0.6413742  0.9239073 0.6941976 1.0071078 31326
[4]  {race=White}                   => {capital-gain=None} 0.7817862  0.9143240 0.8550428 0.9966616 38184
[5]  {native-country=United-States} => {capital-gain=None} 0.8219565  0.9159062 0.8974243 0.9983862 40146
[6]  {capital-loss=None}            => {capital-gain=None} 0.8706646  0.9133376 0.9532779 0.9955863 42525
[7]  {capital-loss=None,                                                                                 
      hours-per-week=Full-time}     => {capital-gain=None} 0.5191638  0.9259787 0.5606650 1.0093657 25357
[8]  {race=White,                                                                                        
      sex=Male}                     => {capital-gain=None} 0.5313050  0.9030799 0.5883256 0.9844048 25950
[9]  {sex=Male,                                                                                          
      native-country=United-States} => {capital-gain=None} 0.5406003  0.9035349 0.5983170 0.9849008 26404
[10] {workclass=Private,                                                                                 
      race=White}                   => {capital-gain=None} 0.5472339  0.9208931 0.5942427 1.0038221 26728
[11] {workclass=Private,                                                                                 
      native-country=United-States} => {capital-gain=None} 0.5689570  0.9218444 0.6171942 1.0048592 27789
[12] {workclass=Private,                                                                                 
      capital-loss=None}            => {capital-gain=None} 0.6111748  0.9204465 0.6639982 1.0033354 29851
[13] {race=White,                                                                                        
      native-country=United-States} => {capital-gain=None} 0.7194628  0.9128933 0.7881127 0.9951019 35140
[14] {race=White,                                                                                        
      capital-loss=None}            => {capital-gain=None} 0.7404283  0.9099693 0.8136849 0.9919147 36164
[15] {capital-loss=None,                                                                                 
      native-country=United-States} => {capital-gain=None} 0.7793702  0.9117168 0.8548380 0.9938195 38066
[16] {workclass=Private,                                                                                 
      race=White,                                                                                        
      capital-loss=None}            => {capital-gain=None} 0.5204742  0.9171628 0.5674829 0.9997559 25421
[17] {workclass=Private,                                                                                 
      capital-loss=None,                                                                                 
      native-country=United-States} => {capital-gain=None} 0.5414807  0.9182030 0.5897179 1.0008898 26447
[18] {race=White,                                                                                        
      capital-loss=None,                                                                                 
      native-country=United-States} => {capital-gain=None} 0.6803980  0.9083504 0.7490480 0.9901500 33232

In [7]:
rules_a <- as(rules, "data.frame")
head(rules_a)


A data.frame: 6 × 6
rulessupportconfidencecoverageliftcount
<fct><dbl><dbl><dbl><dbl><int>
1{hours-per-week=Full-time} => {capital-gain=None} 0.54358950.92906880.58509071.012734226550
2{sex=Male} => {capital-gain=None} 0.60507350.90514550.66848200.986656529553
3{workclass=Private} => {capital-gain=None} 0.64137420.92390730.69419761.007107831326
4{race=White} => {capital-gain=None} 0.78178620.91432400.85504280.996661638184
5{native-country=United-States} => {capital-gain=None}0.82195650.91590620.89742430.998386240146
6{capital-loss=None} => {capital-gain=None} 0.87066460.91333760.95327790.995586342525

Analysis of Rules


In [8]:
imrules <- interestMeasure(rules, transactions = AdultTrans)
head(imrules)


A data.frame: 6 × 47
supportconfidenceliftcountcoveragerhsSupportleveragehyperLifthyperConfidencefishersExactTestcasualConfidenceleastContradictioncenteredConfidencevaryingLiaisonyuleQyuleYlermanimplicationIndeximportancestdLift
<dbl><dbl><dbl><int><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
10.54358950.92906881.0127342265500.58509070.9173867 0.0068351211.01004341.000000e+001.121041e-280.99999760.5925413 0.011682156 0.012734168 0.1805312 0.09101332 2.0618407-6.870784 0.0133701190.29068832
20.60507350.90514550.9866565295530.66848200.9173867-0.0081830180.98444374.037659e-471.000000e+000.99999660.6595621-0.012241193-0.013343548-0.2604097-0.13249043-2.3093466 7.695561-0.0173515960.05145482
30.64137420.92390731.0071078313260.69419760.9173867 0.0045265831.00500481.000000e+004.115642e-150.99999730.6991318 0.006520598 0.007107796 0.1343742 0.06749314 1.2535725-4.177348 0.0101547650.23907273
40.78178620.91432400.9966616381840.85504280.9173867-0.0026186960.99536003.318642e-101.000000e+000.99999660.8521883-0.003062649-0.003338449-0.1518210-0.07635304-0.6534489 2.177523-0.0098736940.11325898
50.82195650.91590620.9983862401460.89742430.9173867-0.0013286260.99731701.488781e-049.998511e-010.99999660.8959761-0.001480488-0.001613810-0.1016143-0.05093898-0.3236118 1.078389-0.0067192120.08649318
60.87066460.91333760.9955863425250.95327790.9173867-0.0038598670.99487650.000000e+001.000000e+000.99999640.9490705-0.004049047-0.004413675-1.0000000-1.00000000-0.9121852 3.039724-0.0391869040.00000000

Removing redundant rules


In [9]:
nrules <- rules[!is.redundant(rules)]

In [10]:
arules::inspect(nrules)


    lhs                               rhs                 support   confidence
[1] {hours-per-week=Full-time}     => {capital-gain=None} 0.5435895 0.9290688 
[2] {sex=Male}                     => {capital-gain=None} 0.6050735 0.9051455 
[3] {workclass=Private}            => {capital-gain=None} 0.6413742 0.9239073 
[4] {race=White}                   => {capital-gain=None} 0.7817862 0.9143240 
[5] {native-country=United-States} => {capital-gain=None} 0.8219565 0.9159062 
[6] {capital-loss=None}            => {capital-gain=None} 0.8706646 0.9133376 
    coverage  lift      count
[1] 0.5850907 1.0127342 26550
[2] 0.6684820 0.9866565 29553
[3] 0.6941976 1.0071078 31326
[4] 0.8550428 0.9966616 38184
[5] 0.8974243 0.9983862 40146
[6] 0.9532779 0.9955863 42525

Showing the transactions that support the rules

In this example, we can see the transactions (trans) that support rules 1.


In [16]:
st <- supportingTransactions(nrules[1], AdultTrans)
trans <- unique(st@data@i)
length(trans)
print(c(length(trans)/length(AdultTrans), nrules[1]@quality$support))


26550
[1] 0.5435895 0.5435895

Now we can see the transactions (trans) that support rules 1 and 2. As can be observed, the support for both rules is not the sum of the support of each rule.


In [17]:
st <- supportingTransactions(nrules[1:2], AdultTrans)
trans <- unique(st@data@i)
length(trans)
print(c(length(trans)/length(AdultTrans), nrules[1:2]@quality$support))


39392
[1] 0.8065190 0.5435895 0.6050735

Rules visualization


In [18]:
options(repr.plot.width=7, repr.plot.height=4)
plot(rules)



In [19]:
options(repr.plot.width=7, repr.plot.height=4)
plot(rules, method="paracoord", control=list(reorder=TRUE))


Sequence Mining


In [20]:
x <- read_baskets(con = system.file("misc", "zaki.txt", package = "arulesSequences"), info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")


A data.frame: 10 × 4
itemssequenceIDeventIDSIZE
<fct><int><int><int>
{C,D} 1102
{A,B,C} 1153
{A,B,F} 1203
{A,C,D,F}1254
{A,B,F} 2153
{E} 2201
{A,B,F} 3103
{D,G,H} 4103
{B,F} 4202
{A,G,H} 4253

In [21]:
s1 <- cspade(x, parameter = list(support = 0.4), control = list(verbose = TRUE))
as(s1, "data.frame")


parameter specification:
support : 0.4
maxsize :  10
maxlen  :  10

algorithmic control:
bfstype  : FALSE
verbose  :  TRUE
summary  : FALSE
tidLists : FALSE

preprocessing ... 1 partition(s), 0 MB [0.046s]
mining transactions ... 0 MB [0.032s]
reading sequences ... [0.027s]

total elapsed time: 0.105s
A data.frame: 18 × 2
sequencesupport
<fct><dbl>
<{A}> 1.00
<{B}> 1.00
<{D}> 0.50
<{F}> 1.00
<{A,F}> 0.75
<{B,F}> 1.00
<{D},{F}> 0.50
<{D},{B,F}> 0.50
<{A,B,F}> 0.75
<{A,B}> 0.75
<{D},{B}> 0.50
<{B},{A}> 0.50
<{D},{A}> 0.50
<{F},{A}> 0.50
<{D},{F},{A}> 0.50
<{B,F},{A}> 0.50
<{D},{B,F},{A}>0.50
<{D},{B},{A}> 0.50

In [ ]: