Association Rule Mining

Association rule mining

Association rules are one of the major techniques of data mining. It finds frequent patterns, associations, correlations or informal structures among sets of items or objects in transactional databases and other information repositories.It is one of the most important data mining tasks, which aims at finding interesting associations and correlation relationships among large sets of data items. A typical example of association rule mining is market basket analysis which analyzes customer buying habits by finding associations between different items that customers place in their “shopping baskets”. These associations can help retailers develop better marketing strategies. The process of association rule mining contains the following two steps:

Step 1: Find all frequent itemsets. A set of items is referred to as an itemset. The occurrence frequency of an itemset is the number of transactions that contain the itemset.

A frequent itemset is an itemset whose occurrence frequency is larger than a predetermined minimum support count.

Step 2: Generate strong association rules from the frequent itemsets. Rules that satisfy both a minimum support threshold and a minimum confidence threshold are called strong association rules.

The techniques for discovering association rules from the data have traditionally focused on identifying relationships between items, which show some aspect of human behavior e.g. buying behavior for determining items that customers buy together. Therefore, the association rules of one type describe a particular local pattern which can be easily interpreted and communicated Analysis of market basket analysis is shown in figure below:

TID Items
1

2

3

4

{Shoes, Skirt, Jacket}

{Shoes, Jacket}

{Shoes, Jeans}

{Shirt, Sweatshirt}

Figure: Analysis of market basket transaction

 

The two important basic measures of association rules are support and confidence; If there are two items then support is defined as the ratio of occurrence of that two items and a total number of transactions. Minimum support is that if rules which have supported greater than a user-defined support. Minimum confidence is one in which rules have a confidence greater than a user-defined confidence.

 

Apriori Algorithm

The Apriori algorithm was proposed by Agarwal and Srikant in 1994. Apriori is designed to operate on databases containing transactions. Other algorithms are designed for finding association rules in data having no transactions, or having no timestamps. Each transaction is seen as a set of items. Given a threshold C, the Apriori algorithm identifies the item sets which are subsets of at least C transactions in the database.

Apriori is a breadth-first search, candidate set generation-and-test algorithm. It is a Bottom-up generation of Frequent item set combinations. It needs l database scans if the maximal length of frequent itemsets is l. Support counting is expensive, Multiple database scans (I/O), breath first search algorithm to be used. It has two steps: Find all itemsets that have minimum support frequent itemsets. Use frequent itemsets to generate rules. The Apriori property follows a two-step process:

  1. Join step: Ck is generated by joining Lk-1 with itself
  2. Prune step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.

 

Working with Apriori Algorithm

Step 1: Apriori algorithm starts working by scanning the transaction database to get the support S of each 1-item set. Compare S with min_support and get a set of frequent 1-itemsets, L1.

Step 2: Use Lk-1  join Lk-1 to generate a set of candidate k-itemsets. Use Apriori property to prune the unfrequented k-itemsets from this set.

Step 3: Scan the transaction database to get the support S of each candidate k-itemset in the final set. Compare S with min_support, and get a set of frequent k-itemsets, Lk

Step 4: If the candidate set is not null then step 2 will be repeated, if it is null then step 5 will be performed.

Step 5: For each frequent itemset 1, generate all nonempty subsets of 1.

Step 6: For every nonempty subset s of l, output the rule “s=> (l-s)” if confidence C of the rule

“s=> (l-s)” (=support S of l/support S of s)’ min_conf

Limitations of Apriori Algorithm:

  1. Needs several iterations of the data. Uses a uniform minimum support threshold.
  2. Difficulties to find rarely occurring events.
  3. Alternative methods can address this by using a non-uniform minimum support threshold, some competing alternative approaches focus on partition and sampling.
  4. It consumes huge memory for storage of itemsets.
  5. Candidate generation could be slow and also generates duplicates depending on the implementation.
  6. Constant items make the algorithm a lot heavier.

Apriori is Candidate generation generates large numbers of subsets ,the algorithm attempts to load up the candidate set with as many as possible before each scan. Bottom-up subset exploration essentially a breadth-first traversal of the subset lattice finds any maximal subset S only after all of its proper subset.

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have spawned other algorithms. Candidate generation generates large numbers of subsets Bottom-up subset exploration finds any maximal subset S only after all of its proper subsets.

Later algorithms such as Max-Miner try to identify the maximal frequent itemsets without enumerating their subsets and perform “jumps” in the search space rather than a purely bottom-up approach

FP Growth Algorithm

FP Growth allows frequent itemset discovery without candidate itemset generation. It is an improvement of apriori algorithm which is used for eliminating the heavy bottlenecks of apriori algorithm. FP Growth algorithm overcomes all the problems of apriori algorithm. It uses divide and conquers rule. First, it compresses the database which represents the frequent items into FP tree and divides the compressed database into condition database. It is two steps approach.

Step1: Build a compact data structure called the FP-Tree. FP-Tree is built using two passes over the dataset.

Step 2: Extracts frequent itemsets directly from the FP-Tree.

Algorithm: FP growth. Mine frequent itemsets using an FP-tree by pattern fragment growth. Input: Consider D, a transaction database;            min sup, the minimum support count threshold.

Output: The complete set of frequent patterns.

Method:

  1. The FP-tree is constructed with the help of following steps:
  • Scan the transaction database D once. Collect F, the set of frequent items, and their support counts. Sort F in support count descending order as L, the list of frequent items.
  • Create the root of an FP-tree, and label it as “null.” For each transaction, Trans in D does the following.

Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be [p|P], where p is the first element and P is the remaining list. Call insert tree([p|P] , T),which is performed as follows. IfT has a child N such that N.item-name = p.itemname, then increment N’s count by 1; else create a new node N, and let it count be 1, its parent link be linked to T, and its node-link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N) recursively.

  1. The FP-tree is mined by calling FP growth (FP tree, null), which is implemented as follows.

procedure FP growth(Tree, α) if Tree contains a single path P then for each combination (denoted as β) of the nodes in the path P generate pattern β  α with support count = minimum support count o f nodes in β; else for each ai in the header of Tree { generate pattern β = ai α with support count = ai.support count; construct β’s conditional pattern base and then β’s conditional FP tree Treeβ; if Treeβ 6= 0/ then call FP growth(Treeβ, β); }

 

Advantages of using FP Growth Algorithm

  • FP Growth requires less memory.
  • It has a compact structure and no candidate generation.
  • It has smaller execution time.
  • It is more efficient to find long patterns
  • FP Growth algorithm has better performance as compared to Apriori algorithm.
  • It has more accuracy and small utility loss.
  • FP Growth algorithm requires only two passes over the dataset.
  • It is much faster than the Apriori algorithm

 

Share

Leave a Reply

Your email address will not be published. Required fields are marked *