Association Rule Learning

This tutorial goes over association rule learning and the Apriori algorithm. It goes over the following:

Custom but simple implemenation of Apriori which uses data from a csv (easy to use but not efficient)
Implementation using Orange
Implementation using R programming language (very streamlined & fast implementation)

Custom Impelmentation

This code is a Python Implementation of Apriori Algorithm for finding Frequent sets and Association Rules taken from this. The advantage is that it does not require much manipulation of the data and can create association rules from a .csv file directly.

The first part of the code defines custom implemented methods/functions for apriori, and the second part sets the support and confidence and run's the apriori algorithm.



In [1]:

    
import sys

from itertools import chain, combinations
from collections import defaultdict
from optparse import OptionParser


def subsets(arr):
    """ Returns non empty subsets of arr"""
    return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])


def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
        """calculates the support for items in the itemSet and returns a subset
       of the itemSet each of whose elements satisfies the minimum support"""
        _itemSet = set()
        localSet = defaultdict(int)

        for item in itemSet:
                for transaction in transactionList:
                        if item.issubset(transaction):
                                freqSet[item] += 1
                                localSet[item] += 1

        for item, count in localSet.items():
                support = float(count)/len(transactionList)

                if support >= minSupport:
                        _itemSet.add(item)

        return _itemSet


def joinSet(itemSet, length):
        """Join a set with itself and returns the n-element itemsets"""
        return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])


def getItemSetTransactionList(data_iterator):
    transactionList = list()
    itemSet = set()
    for record in data_iterator:
        transaction = frozenset(record)
        transactionList.append(transaction)
        for item in transaction:
            itemSet.add(frozenset([item]))              # Generate 1-itemSets
    return itemSet, transactionList


def runApriori(data_iter, minSupport, minConfidence):
    """
    run the apriori algorithm. data_iter is a record iterator
    Return both:
     - items (tuple, support)
     - rules ((pretuple, posttuple), confidence)
    """
    itemSet, transactionList = getItemSetTransactionList(data_iter)

    freqSet = defaultdict(int)
    largeSet = dict()
    # Global dictionary which stores (key=n-itemSets,value=support)
    # which satisfy minSupport

    assocRules = dict()
    # Dictionary which stores Association Rules

    oneCSet = returnItemsWithMinSupport(itemSet,
                                        transactionList,
                                        minSupport,
                                        freqSet)

    currentLSet = oneCSet
    k = 2
    while(currentLSet != set([])):
        largeSet[k-1] = currentLSet
        currentLSet = joinSet(currentLSet, k)
        currentCSet = returnItemsWithMinSupport(currentLSet,
                                                transactionList,
                                                minSupport,
                                                freqSet)
        currentLSet = currentCSet
        k = k + 1

    def getSupport(item):
            """local function which Returns the support of an item"""
            return float(freqSet[item])/len(transactionList)

    toRetItems = []
    for key, value in largeSet.items():
        toRetItems.extend([(tuple(item), getSupport(item))
                           for item in value])

    toRetRules = []
    for key, value in largeSet.items()[1:]:
        for item in value:
            _subsets = map(frozenset, [x for x in subsets(item)])
            for element in _subsets:
                remain = item.difference(element)
                if len(remain) > 0:
                    confidence = getSupport(item)/getSupport(element)
                    if confidence >= minConfidence:
                        toRetRules.append(((tuple(element), tuple(remain)),
                                           confidence))
    return toRetItems, toRetRules


def printResults(items, rules):
    """prints the generated itemsets sorted by support and the confidence rules sorted by confidence"""
    for item, support in sorted(items, key=lambda (item, support): support):
        print "item: %s , %.3f" % (str(item), support)
    print "\n------------------------ RULES:"
    for rule, confidence in sorted(rules, key=lambda (rule, confidence): confidence):
        pre, post = rule
        print "Rule: %s ==> %s , %.3f" % (str(pre), str(post), confidence)


def dataFromFile(fname):
        """Function which reads from the file and yields a generator"""
        file_iter = open(fname, 'rU')
        for line in file_iter:
                line = line.strip().rstrip(',')                         # Remove trailing comma
                record = frozenset(line.split(','))
                yield record



In [6]:

    
inFile = dataFromFile('../datasets/INTEGRATED-DATASET.csv')
minSupport = 0.15
minConfidence = 0.6

items, rules = runApriori(inFile, minSupport, minConfidence)

printResults(items, rules)









    



item: ('Brooklyn',) , 0.152
item: ('HISPANIC',) , 0.164
item: ('HISPANIC', 'MBE') , 0.164
item: ('MBE', 'WBE') , 0.169
item: ('MBE', 'New York') , 0.170
item: ('WBE', 'New York') , 0.175
item: ('MBE', 'ASIAN') , 0.200
item: ('ASIAN',) , 0.202
item: ('New York',) , 0.295
item: ('NON-MINORITY',) , 0.300
item: ('NON-MINORITY', 'WBE') , 0.300
item: ('BLACK',) , 0.301
item: ('MBE', 'BLACK') , 0.301
item: ('WBE',) , 0.477
item: ('MBE',) , 0.671

------------------------ RULES:
Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628
Rule: ('ASIAN',) ==> ('MBE',) , 0.990
Rule: ('HISPANIC',) ==> ('MBE',) , 1.000
Rule: ('BLACK',) ==> ('MBE',) , 1.000
Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000
item: ('Brooklyn',) , 0.152
item: ('HISPANIC',) , 0.164
item: ('HISPANIC', 'MBE') , 0.164
item: ('MBE', 'WBE') , 0.169
item: ('MBE', 'New York') , 0.170
item: ('WBE', 'New York') , 0.175
item: ('MBE', 'ASIAN') , 0.200
item: ('ASIAN',) , 0.202
item: ('New York',) , 0.295
item: ('NON-MINORITY',) , 0.300
item: ('NON-MINORITY', 'WBE') , 0.300
item: ('BLACK',) , 0.301
item: ('MBE', 'BLACK') , 0.301
item: ('WBE',) , 0.477
item: ('MBE',) , 0.671

------------------------ RULES:
Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628
Rule: ('ASIAN',) ==> ('MBE',) , 0.990
Rule: ('HISPANIC',) ==> ('MBE',) , 1.000
Rule: ('BLACK',) ==> ('MBE',) , 1.000
Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000
item: ('Brooklyn',) , 0.152
item: ('HISPANIC',) , 0.164
item: ('HISPANIC', 'MBE') , 0.164
item: ('MBE', 'WBE') , 0.169
item: ('MBE', 'New York') , 0.170
item: ('WBE', 'New York') , 0.175
item: ('MBE', 'ASIAN') , 0.200
item: ('ASIAN',) , 0.202
item: ('New York',) , 0.295
item: ('NON-MINORITY',) , 0.300
item: ('NON-MINORITY', 'WBE') , 0.300
item: ('BLACK',) , 0.301
item: ('MBE', 'BLACK') , 0.301
item: ('WBE',) , 0.477
item: ('MBE',) , 0.671

------------------------ RULES:
Rule: ('WBE',) ==> ('NON-MINORITY',) , 0.628
Rule: ('ASIAN',) ==> ('MBE',) , 0.990
Rule: ('HISPANIC',) ==> ('MBE',) , 1.000
Rule: ('BLACK',) ==> ('MBE',) , 1.000
Rule: ('NON-MINORITY',) ==> ('WBE',) , 1.000

Associate Rule Learning Using Orange

Orange is an Open source data visualization and data analysis tool with interactive workflows and a large toolbox. As noted in the tutorial video, Orange also provides a graphical user interface which can be used as a method to quickly prototype rule mining. For more on the GUI you can read the getting started and avaliable modules

Since the above aprori algorithm is not as streamlined as a library implementation, Orange's implementation of Apriori can be used. The following code is based upon examples datasets provided by Orange given here

In addition, the use of Orange's add-on for enumerating frequent itemsets and association rules mining is also very powerful. The documentation here outlines how to use custom data sets.



In [3]:

    
import Orange
data = Orange.data.Table("market-basket.basket")

rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)
print "%4s %4s  %s" % ("Supp", "Conf", "Rule")
for r in rules[:5]:
    print "%4.1f %4.1f  %s" % (r.support, r.confidence, r)









    



Supp Conf  Rule
 0.4  1.0  Cola -> Diapers
 0.4  0.5  Diapers -> Cola
 0.4  1.0  Cola -> Diapers Milk
 0.4  1.0  Cola Diapers -> Milk
 0.4  1.0  Cola Milk -> Diapers
Supp Conf  Rule
 0.4  1.0  Cola -> Diapers
 0.4  0.5  Diapers -> Cola
 0.4  1.0  Cola -> Diapers Milk
 0.4  1.0  Cola Diapers -> Milk
 0.4  1.0  Cola Milk -> Diapers
Supp Conf  Rule
 0.4  1.0  Cola -> Diapers
 0.4  0.5  Diapers -> Cola
 0.4  1.0  Cola -> Diapers Milk
 0.4  1.0  Cola Diapers -> Milk
 0.4  1.0  Cola Milk -> Diapers



In [4]:

    
import Orange
data = Orange.data.Table("market-basket.basket")

ind = Orange.associate.AssociationRulesSparseInducer(support=0.4, storeExamples = True)
itemsets = ind.get_itemsets(data)
for itemset, tids in itemsets[:5]:
    print "(%4.2f) %s" % (len(tids)/float(len(data)),
                          " ".join(data.domain[item].name for item in itemset))









    



(0.40) Cola
(0.40) Cola Diapers
(0.40) Cola Diapers Milk
(0.40) Cola Milk
(0.60) Beer
(0.40) Cola
(0.40) Cola Diapers
(0.40) Cola Diapers Milk
(0.40) Cola Milk
(0.60) Beer
(0.40) Cola
(0.40) Cola Diapers
(0.40) Cola Diapers Milk
(0.40) Cola Milk
(0.60) Beer



In [5]:

    
import Orange
data = Orange.data.Table("inquisition.basket")

rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)

print "%5s   %5s" % ("supp", "conf")
for r in rules:
    print "%5.3f   %5.3f   %s" % (r.support, r.confidence, r)









    



 supp    conf
0.500   1.000   fear -> surprise
0.500   1.000   surprise -> fear
0.500   1.000   fear -> surprise our
0.500   1.000   fear surprise -> our
0.500   1.000   fear our -> surprise
0.500   1.000   surprise -> fear our
0.500   1.000   surprise our -> fear
0.500   0.714   our -> fear surprise
0.500   1.000   fear -> our
0.500   0.714   our -> fear
0.500   1.000   surprise -> our
0.500   0.714   our -> surprise
 supp    conf
0.500   1.000   fear -> surprise
0.500   1.000   surprise -> fear
0.500   1.000   fear -> surprise our
0.500   1.000   fear surprise -> our
0.500   1.000   fear our -> surprise
0.500   1.000   surprise -> fear our
0.500   1.000   surprise our -> fear
0.500   0.714   our -> fear surprise
0.500   1.000   fear -> our
0.500   0.714   our -> fear
0.500   1.000   surprise -> our
0.500   0.714   our -> surprise
 supp    conf
0.500   1.000   fear -> surprise
0.500   1.000   surprise -> fear
0.500   1.000   fear -> surprise our
0.500   1.000   fear surprise -> our
0.500   1.000   fear our -> surprise
0.500   1.000   surprise -> fear our
0.500   1.000   surprise our -> fear
0.500   0.714   our -> fear surprise
0.500   1.000   fear -> our
0.500   0.714   our -> fear
0.500   1.000   surprise -> our
0.500   0.714   our -> surprise

My favourite implementation of Apriori is in the [R] progamming langugage. This is an example of Apriori on the Adult dataset by setting the support and confidence.

library(arules)
library(printr)
data("Adult")

rules <- apriori(Adult,
             parameter = list(support = 0.4, confidence = 0.7),
             appearance = list(rhs = c("race=White"), default = "lhs"))
rules.sorted <- sort(rules, by = "lift")
top5.rules <- head(rules.sorted, 5)
as(top5.rules, "data.frame")

You can read more about the implementation here:

https://cran.r-project.org/web/packages/arules/arules.pdf

https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf