Association rule learning

Apriori Algorithm
FP-growth

Association rule learning

Algorithms

Apriori Algorithm

Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

Apriori is designed to operate on databases containing transactions(for example, collections of items bought by customers, or details of a website frequentation). Each transaction is seen as a set of items(an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database.

Apriori uses a "bottom up" approach, where frequent subsets are extended on item at a time(a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.

Process

The pseudo code for the algorithm is given below for a transaction database $T$, and a support threshold of $\epsilon$. Usual set theoretic notation is employed, though note that $T$ is a multiset. $C_k$ is the candidate set for level $k$. At each step, the algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma. $count[c]$ accesses a field of the data structur e that represents candidate set $c$, which is initially assumed to be zero.

Limitations

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawne other algorithms. Candidate generation generates large numbers of subsets(the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration(essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all $2^{|S|}-1$ of its proper subsets.

Later algorithms such as Max-Miner try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach.

Frequent Pattern Mining / The FP-growth algorithm

In Data Mining, the task of finding frequent pattern in large database is very important and has been studied in large scale in the past few years. For details, click here.

The FP-Growth algorithm is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree).

FP-Tree Structure

The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frquent-pattern tree (FP-tree), which retains the itemset association information.

In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity.

In large databases, it’s not possible to hold the FP-tree in the main memory. A strategy to cope with this problem is to firstly partition the database into a set of smaller databases (called projected databases), and then construct an FP-tree from each of these smaller databases.

FP-Tree structure

The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database.

FP-tree definitions as below:

One root labeled as "null" with a set of item-prefix subtrees as children, and a frequent-item-header table

Each node in the item-prefix subtree consists of three fields:
- Item-name: registers which item is represented by the node;
- Count: the number of transactions represented by the portion of the path reaching the node;
- Node-link: links to the next node in the FP-tree carrying the same item-name, or null if there is none.
Each entry in the frequent-item-header table consists of two fields:
- Item-name: as the same to the node
- Head of node-link: a pointer to the first node in the FP-tree carrying the item-name

Additionally the frequent-item-header table can have the count support for an item.

Algorithm 1 : FP-tree construction

Input: A transaction database DB and a minimum support threshold?

Output: FP-tree, the frequent-pattern tree of DB.

Method: The FP-tree is constructed as follows.

Scan the transaction database DB once. Collect F, the set of frequent items, and the support of each frequent item. Sort F in support-descending order as FList, the list of frequent items.

Create the root of an FP-tree, T, and label it as “null”. For each transaction Trans in DB do the following:
- Select the frequent items in Trans and sort them according to the order of FList. Let the sorted frequent-item list in Trans be $[p|P]$, where $p$ is the first element and $P$ is the remaining list. Call insert tree$([p|P],T)$.
- The function insert tree$([p|P],T)$ is performed as follows. If T has a child N such that N.item-name = $p$.item-name, then increment N ’s count by 1; else create a new node N , with its count initialized to 1, its parent link linked to T , and its node-link linked to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree$(P,N)$ recursively.

By using this algorithm, the FP-tree is constructed in two scans of the database. The first scan collects and sort the set of frequent items, and the second constructs the FP-tree.

FP-Growth Algorithm

Algorithm2 : FP-Growth

Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimum support threshold ?.

Output: The complete set of frequent patterns.

Method: call FP-growth(FP-tree, null).

Procedure FP-growth(Tree,a){

If Tree contains a single prefix path then {Mining single prefix-path FP-tree
- let P be the single prefix-path part of Tree;
- let Q be the multipath part with the top branching node replaced by a null root;
- for each combination (denoted as $\beta$) of nodes in the path P do
- generate pattern $\beta \cup a$ with support = minimum support of nodes in $\beta$
- let freq pattern set(P) be the set of patterns so generated }

else let Q be Tree;
for each item $ai$ in Q do {// Mining multipath FP-tree
- generate pattern $\beta = ai \cup a$ with support = $ai$ .support;
- construct $\beta$'s conditional pattern-base and then $\beta$’s conditional FP-tree Tree $\beta$;
- if Tree $\beta \neq \varnothing$ then
  - call FP-growth(Tree $\beta$,$\beta$);
- let freq pattern set(Q) be the set of patterns so generated; }
return(freq pattern set(P) $\cup$ freq pattern set(Q) $\cup$ (freq pattern set(P) × freq pattern set(Q)))

}

FP-Growth Algorithm Variations



In [ ]: