Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.
Apriori is designed to operate on databases containing transactions(for example, collections of items bought by customers, or details of a website frequentation). Each transaction is seen as a set of items(an itemset). Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database.
Apriori uses a "bottom up" approach, where frequent subsets are extended on item at a time(a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.
The pseudo code for the algorithm is given below for a transaction database $T$, and a support threshold of $\epsilon$. Usual set theoretic notation is employed, though note that $T$ is a multiset. $C_k$ is the candidate set for level $k$. At each step, the algorithm is assumed to generate the candidate sets from the large item sets of the preceding level, heeding the downward closure lemma. $count[c]$ accesses a field of the data structur e that represents candidate set $c$, which is initially assumed to be zero.
Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawne other algorithms. Candidate generation generates large numbers of subsets(the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset exploration(essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all $2^{|S|}-1$ of its proper subsets.
Later algorithms such as Max-Miner try to identify the maximal frequent item sets without enumerating their subsets, and perform "jumps" in the search space rather than a purely bottom-up approach.
In Data Mining, the task of finding frequent pattern in large database is very important and has been studied in large scale in the past few years. For details, click here.
The FP-Growth algorithm is an efficient and scalable method for mining the complete set of frequent patterns by pattern fragment growth, using an extended prefix-tree structure for storing compressed and crucial information about frequent patterns named frequent-pattern tree (FP-tree).
The FP-Growth Algorithm is an alternative way to find frequent itemsets without using candidate generations, thus improving performance. For so much it uses a divide-and-conquer strategy. The core of this method is the usage of a special data structure named frquent-pattern tree (FP-tree), which retains the itemset association information.
In simple words, this algorithm works as follows: first it compresses the input database creating an FP-tree instance to represent frequent items. After this first step it divides the compressed database into a set of conditional databases, each one associated with one frequent pattern. Finally, each such database is mined separately. Using this strategy, the FP-Growth reduces the search costs looking for short patterns recursively and then concatenating them in the long frequent patterns, offering good selectivity.
In large databases, it’s not possible to hold the FP-tree in the main memory. A strategy to cope with this problem is to firstly partition the database into a set of smaller databases (called projected databases), and then construct an FP-tree from each of these smaller databases.
The frequent-pattern tree (FP-tree) is a compact structure that stores quantitative information about frequent patterns in a database.
FP-tree definitions as below:
Additionally the frequent-item-header table can have the count support for an item.
Input: A transaction database DB and a minimum support threshold?
Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
By using this algorithm, the FP-tree is constructed in two scans of the database. The first scan collects and sort the set of frequent items, and the second constructs the FP-tree.
Input: A database DB, represented by FP-tree constructed according to Algorithm 1, and a minimum support threshold ?.
Output: The complete set of frequent patterns.
Method: call FP-growth(FP-tree, null).
Procedure FP-growth(Tree,a){
for each item $ai$ in Q do {// Mining multipath FP-tree
return(freq pattern set(P) $\cup$ freq pattern set(Q) $\cup$ (freq pattern set(P) × freq pattern set(Q)))
}
In [ ]: