In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
from __future__ import division
import itertools
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['axes.grid'] = False
import logging
logger = logging.getLogger()
Each Basket consists of a set of items(an itemset)
The number of items in a basket is small.
The number of baskets is usually very large.
Basket are sets, and in priciple items can appear only once.
a set of items that appears in many baskets is said to be "frequent".
support: if ${I}$ is a set of items, the support of ${I}$ is the number of baskets for which I is a subset.
Assume $s$ is the support threshold, then we say ${I}$ is frequent if its support is $s$ or more.
In [2]:
logger.setLevel('WARN')
data_raw = [
['Cat', 'and', 'dog', 'bites'],
['Yahoo', 'news', 'claims', 'a', 'cat', 'mated', 'with', 'a', 'dog', 'and', 'produced', 'viable', 'offspring'],
['Cat', 'killer', 'likely', 'is', 'a', 'big', 'dog'],
['Professional', 'free', 'advice', 'on', 'dog', 'training', 'puppy', 'training'],
['Cat', 'and', 'kitten', 'training', 'and', 'behavior'],
['Dog', '&', 'Cat', 'provides', 'dog', 'training', 'in', 'Eugene', 'Oregon'],
['Dog', 'and', 'cat', 'is', 'a', 'slang', 'term', 'used', 'by', 'police', 'officers', 'for', 'a', 'male-female', 'relationship'],
['Shop', 'for', 'your', 'show', 'dog', 'grooming', 'and', 'pet', 'supplies']
]
data = [set(map(str.lower, x)) for x in data_raw]
data = pd.Series(data)
def calc_occurrence_of_doubletons(df, data):
logger.info('df: \n{}\n'.format(df))
for i_ in df.index:
for c_ in df.columns:
if not np.isnan(df.loc[i_,c_]):
key = {i_, c_}
ind_ = [ind+1 for ind, is_in in enumerate(data.apply(lambda x: key.issubset(x))) if is_in]
df.loc[i_,c_] = ','.join([str(x) for x in ind_])
return df
mask = [
[1, 1, 1, 1],
[1, 1, 1, np.nan],
[1, 1, np.nan, np.nan],
[1, np.nan, np.nan, np.nan]
]
df = pd.DataFrame(mask, index=['dog', 'cat', 'and', 'a'], columns=['training', 'a', 'and', 'cat'])
calc_occurrence_of_doubletons(df, data)
Out[2]:
an association rule $I \to j$:
if all of the items in $I$ appear in some basket, then $j$ is "likely" to appear in that basket as well.
confidence: $$\text{confidence}(I \to j) = \frac{\text{support}(I \cup \{j\})}{\text{support}(I)}$$
interest: $$\text{interest}(I \to j) = \text{confience}(I \to j) - \frac{\text{support}(j)}{m}$$ where $m$ is the number of all baskets.
if $j$ is a set of $n$ items that is found to be frequent, then we have $n$ candidates: $J - \{j\} \to j$ for each $j$ in $J$. Then their confidence can be calculated.
cons: assumed that there are not too many frequent itemsets, since each one found must be acted upon.
solution: adjust the support threshold $s$ so that we do not get too many frequent itemsets.
We assume that:
a major cost of any algorithm is the time it takes to read the baskets from disk.
Why we miss the time it takes to generate all the subsets of size $k$?
Measure: only the number of passes taken by the algorithm matters.
Each algorithm has a limit on how many items it can deal with.
In general, we need a hash table that translates items as they appear in the file to integers.
The Triangular-Matrix Method
The Triples Method
We can store counts as triples $[i,j,c]$.
eg. a hash table with $i$ and $j$ as the search key.
pros: don't store anything if a pair counts 0.
cons: store 3 integers for every pair.
use the triangular matrix if at least 1/3 of the $C_n^2$ possible pairs actually appear in some basket.
use the triples method if significatly fewer than 1/3 of the possible pairs occur.
We might be better off using the triples method, because it would be normal to be a sufficiently uneven distribution of items even if the were ten or a hundred times as many baskets.
to avoid counting many triples or larger sets.
The First Pass two tables: one is used to translate item names to integers, and another one is used to count.
Between the Passes
we get frequent sets after we set the threshold $s$.
The Second Pass
We count all pairs of the frequent sets as follows:
In [3]:
plt.imshow(plt.imread('./res/fig6_4.png'))
Out[3]:
1st pass:
during pass:
$C_2$, pairs ${i,j}$:
pros: $C_2 \downarrow$.
cons: cannot renumber $1,\dotsc,m$ $\to$ cannot use triangular matrix $\to$ ONLY use the triple method.
In [4]:
plt.imshow(plt.imread('./res/fig6_5.png'))
Out[4]:
It improves upon PCA by using several successive hash tables to reduce further the number of candidate pairs.
1st pass is the same as of PCY.
2nd pass:
We hash ${i,j}$ if and only if:
Then summarized as a bitmap $B_2$.
$C_2$ pairs ${i,j}$:
Attention:
Each pass must store the bitmaps, eventually, there is not enough space left to count if used too many stages.
In [5]:
plt.imshow(plt.imread('./res/fig6_6.png'))
Out[5]:
use two hash functions and two seperate hash tables on the 1st pass.
The danger of using two hash tables on one pass is that each hash table has half as many buckets as the one large hash table of PCY. $\implies$ the average count of a bucket for PCY is much lower than the support threshold.
$C_2$: $i$ and $j$ must both be frequent, and the pair must have hashed to a frequent bucket according to both hash tables.
The risk is that should we use too many hash tables, the average count for a bucket will exceed the support threshold.
$\implies$ the probability an infrequent pair will be a candidate rises, rather than falls, if we add another hash table.
In [6]:
plt.imshow(plt.imread('./res/fig6_7.png'))
Out[6]:
to pick a random subset of the baskets and adjust the support thresold.
The safety way:
read the entire dataset one by one,
and for each basket, select that basket for the sample with some fixed probility $p$.
eliminate False Positives: making a pass through the full datasets and counting all the candidates to check.
reduce False Negatives:
use smaller threshold for the samples, such as $0.9ps$, and so push more itemsets to be checked.
cons: need more main memory.
Avoid both Fasle Negatives and False Positives, at the cost of making two full passes.
1st pass to find candidates.
2nd pass to count all the candidates and check.
First Map Function: $(F,1)$, where $F$ is a frequent itemset from the sample.
First Reduce Function: combine all the $F$ to construct the candidate itemsets.
Second Map Function: $(C,v)$, where $C$ is one of the candidate sets and $v$ is the support.
Second Reduce Function: Sum and filter out the frequent itemsets.
pass over a small sample and one full pass over the data.
avoid both FN and FP, but there is a small probability that it will fail to produce any answer at all.
1st pass: candidates
2nd pass: check, counting all $F$ and $N$.
eliminate FP $gets$ check in the full datasets.
eliminate FN(namely, find all real frequent itemset in the sample):
Proof:
When no member of the $N$ is frequent in the whole,
there can be no itemset $S$ whatsoever that is:
Proof by Contradiction:
Suppose $S$ exist, but the algorithm gives OUTPUT when no member of the $N$ is frequent in the whole.
Let $T$ be a subset of $S$ that is of the smallest possible size among all subsets of $S$ that are not frequent in the sample.
When the frquent-itemsets algorithm finishes, we have an estimate of the frequent itemsets in the stream.
Then we have several options:
Use the collection at handy, but start running another iteration immediately.
Continue to count the frequent itemsets, and
a decaying window on a stream:
record all items whose score was at least $1/2$.
unfold directly. $\{a,b\}, \{c,d\} \to a, b, c, d$
cons: We want to find all frequent itemsets, not just singleton itemsets.
Start scoring certain itemsets as soon as we see one instance, but be conservative about which itemsets we start. $gets$ too many counts.
eg: Only start an itemset $I$ if all its immediate subsets are already being scored.
The big disadvantage of decaying window is:
It requires us to maintain scores for each itemset with a score of at least $1/2$,
while limiting by $c$ will force us to accept information that tracks the local fluctuations in frequency too closely.
Solution: