This tutorial goes over association rule learning and the Apriori algorithm. It goes over the following:
This code is a Python Implementation of Apriori Algorithm for finding Frequent sets and Association Rules taken from this. The advantage is that it does not require much manipulation of the data and can create association rules from a .csv file directly.
The first part of the code defines custom implemented methods/functions for apriori, and the second part sets the support and confidence and run's the apriori algorithm.
In [1]:
import sys
from itertools import chain, combinations
from collections import defaultdict
from optparse import OptionParser
def subsets(arr):
""" Returns non empty subsets of arr"""
return chain(*[combinations(arr, i + 1) for i, a in enumerate(arr)])
def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
"""calculates the support for items in the itemSet and returns a subset
of the itemSet each of whose elements satisfies the minimum support"""
_itemSet = set()
localSet = defaultdict(int)
for item in itemSet:
for transaction in transactionList:
if item.issubset(transaction):
freqSet[item] += 1
localSet[item] += 1
for item, count in localSet.items():
support = float(count)/len(transactionList)
if support >= minSupport:
_itemSet.add(item)
return _itemSet
def joinSet(itemSet, length):
"""Join a set with itself and returns the n-element itemsets"""
return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])
def getItemSetTransactionList(data_iterator):
transactionList = list()
itemSet = set()
for record in data_iterator:
transaction = frozenset(record)
transactionList.append(transaction)
for item in transaction:
itemSet.add(frozenset([item])) # Generate 1-itemSets
return itemSet, transactionList
def runApriori(data_iter, minSupport, minConfidence):
"""
run the apriori algorithm. data_iter is a record iterator
Return both:
- items (tuple, support)
- rules ((pretuple, posttuple), confidence)
"""
itemSet, transactionList = getItemSetTransactionList(data_iter)
freqSet = defaultdict(int)
largeSet = dict()
# Global dictionary which stores (key=n-itemSets,value=support)
# which satisfy minSupport
assocRules = dict()
# Dictionary which stores Association Rules
oneCSet = returnItemsWithMinSupport(itemSet,
transactionList,
minSupport,
freqSet)
currentLSet = oneCSet
k = 2
while(currentLSet != set([])):
largeSet[k-1] = currentLSet
currentLSet = joinSet(currentLSet, k)
currentCSet = returnItemsWithMinSupport(currentLSet,
transactionList,
minSupport,
freqSet)
currentLSet = currentCSet
k = k + 1
def getSupport(item):
"""local function which Returns the support of an item"""
return float(freqSet[item])/len(transactionList)
toRetItems = []
for key, value in largeSet.items():
toRetItems.extend([(tuple(item), getSupport(item))
for item in value])
toRetRules = []
for key, value in largeSet.items()[1:]:
for item in value:
_subsets = map(frozenset, [x for x in subsets(item)])
for element in _subsets:
remain = item.difference(element)
if len(remain) > 0:
confidence = getSupport(item)/getSupport(element)
if confidence >= minConfidence:
toRetRules.append(((tuple(element), tuple(remain)),
confidence))
return toRetItems, toRetRules
def printResults(items, rules):
"""prints the generated itemsets sorted by support and the confidence rules sorted by confidence"""
for item, support in sorted(items, key=lambda (item, support): support):
print "item: %s , %.3f" % (str(item), support)
print "\n------------------------ RULES:"
for rule, confidence in sorted(rules, key=lambda (rule, confidence): confidence):
pre, post = rule
print "Rule: %s ==> %s , %.3f" % (str(pre), str(post), confidence)
def dataFromFile(fname):
"""Function which reads from the file and yields a generator"""
file_iter = open(fname, 'rU')
for line in file_iter:
line = line.strip().rstrip(',') # Remove trailing comma
record = frozenset(line.split(','))
yield record
In [6]:
inFile = dataFromFile('../datasets/INTEGRATED-DATASET.csv')
minSupport = 0.15
minConfidence = 0.6
items, rules = runApriori(inFile, minSupport, minConfidence)
printResults(items, rules)
Orange is an Open source data visualization and data analysis tool with interactive workflows and a large toolbox. As noted in the tutorial video, Orange also provides a graphical user interface which can be used as a method to quickly prototype rule mining. For more on the GUI you can read the getting started and avaliable modules
Since the above aprori algorithm is not as streamlined as a library implementation, Orange's implementation of Apriori can be used. The following code is based upon examples datasets provided by Orange given here
In addition, the use of Orange's add-on for enumerating frequent itemsets and association rules mining is also very powerful. The documentation here outlines how to use custom data sets.
In [3]:
import Orange
data = Orange.data.Table("market-basket.basket")
rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.3)
print "%4s %4s %s" % ("Supp", "Conf", "Rule")
for r in rules[:5]:
print "%4.1f %4.1f %s" % (r.support, r.confidence, r)
In [4]:
import Orange
data = Orange.data.Table("market-basket.basket")
ind = Orange.associate.AssociationRulesSparseInducer(support=0.4, storeExamples = True)
itemsets = ind.get_itemsets(data)
for itemset, tids in itemsets[:5]:
print "(%4.2f) %s" % (len(tids)/float(len(data)),
" ".join(data.domain[item].name for item in itemset))
In [5]:
import Orange
data = Orange.data.Table("inquisition.basket")
rules = Orange.associate.AssociationRulesSparseInducer(data, support = 0.5)
print "%5s %5s" % ("supp", "conf")
for r in rules:
print "%5.3f %5.3f %s" % (r.support, r.confidence, r)
My favourite implementation of Apriori is in the [R] progamming langugage. This is an example of Apriori on the Adult dataset by setting the support and confidence.
library(arules)
library(printr)
data("Adult")
rules <- apriori(Adult,
parameter = list(support = 0.4, confidence = 0.7),
appearance = list(rhs = c("race=White"), default = "lhs"))
rules.sorted <- sort(rules, by = "lift")
top5.rules <- head(rules.sorted, 5)
as(top5.rules, "data.frame")
You can read more about the implementation here:
https://cran.r-project.org/web/packages/arules/arules.pdf
https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf