4. Mining Frequent Items and Associations
Mining Frequent Items and Association Rules is an important area of Data Mining that focuses on discovering hidden relationships, patterns, and associations among items in large databases.
Organizations generate massive transactional data daily from:
-
Retail stores
-
Online shopping platforms
-
Banking systems
-
Healthcare systems
-
Websites
Analyzing this data manually is difficult. Association analysis helps discover:
-
Frequently occurring items
-
Customer purchasing patterns
-
Product relationships
-
Hidden trends
This information helps organizations:
-
Improve sales
-
Increase profit
-
Enhance recommendation systems
-
Make better business decisions
4.1 Frequent Item Set
1. Definition of Frequent Item Set
A Frequent Item Set is a set of one or more items that occur together in a transaction database with frequency greater than or equal to a specified minimum support threshold.
In simple words, if certain items appear together many times in transactions, they are called frequent item sets.
2. Important Terminologies
Before understanding frequent item sets, some basic terms are important.
1. Item
An item is a single product or object in a transaction.
Examples
-
Bread
-
Milk
-
Butter
-
Rice
2. Item Set
An item set is a collection of one or more items.
Examples
-
{Bread}
-
{Milk, Butter}
-
{Bread, Milk, Eggs}
3. Transaction
A transaction is a collection of items purchased together by a customer.
Example Transaction Table
| Transaction ID | Items Purchased |
|---|---|
| T1 | Bread, Milk |
| T2 | Bread, Butter |
| T3 | Milk, Butter |
| T4 | Bread, Milk, Butter |
4. Transaction Database
A transaction database is a collection of transactions.
3. Support Measure
Support is one of the most important measures in frequent item set mining.
It measures how frequently an item set appears in the transaction database.
Formula for Support
Support(A)=\frac{\text{Number of transactions containing A}}{\text{Total number of transactions}}
Example of Support Calculation
Consider the following transactions:
| Transaction ID | Items |
|---|---|
| T1 | Bread, Milk |
| T2 | Bread, Butter |
| T3 | Bread, Milk |
| T4 | Milk, Butter |
| T5 | Bread, Milk |
Find support of:
- {Bread, Milk}
Step 1: Count Transactions Containing {Bread, Milk}
Present in:
-
T1
-
T3
-
T5
Total = 3 transactions
Step 2: Total Transactions
Total transactions = 5
Step 3: Calculate Support
Support(Bread,Milk)=\frac{3}{5}=0.6
Support = 0.6 = 60%
4. Minimum Support Threshold
Minimum support is a user-defined value used to determine whether an item set is frequent.
Example
If minimum support = 50%
then:
- Item sets with support ≥ 50% are considered frequent.
5. Importance of Frequent Item Sets
Frequent item sets are important because they help discover:
-
Customer buying habits
-
Product relationships
-
Frequent patterns
-
Hidden trends
They are the foundation for:
-
Association rule mining
-
Recommendation systems
-
Market basket analysis
6. Applications of Frequent Item Sets
1. Retail Industry
Identifying products commonly purchased together.
Example
Customers buying:
-
Bread
-
Butter
together frequently.
2. E-Commerce Websites
Recommendation systems suggest related products.
Example
“Customers who bought this also bought…”
3. Medical Diagnosis
Finding diseases and symptoms occurring together.
4. Web Usage Mining
Analyzing frequently visited web pages together.
4.2 Closed Item Set
1. Definition of Closed Item Set
A Closed Item Set is a frequent item set for which none of its supersets has the same support count.
In simple words:
-
If adding another item changes the support value,
-
then the item set is considered closed.
2. Explanation of Closed Item Set
Closed item sets help reduce redundancy in mining results.
Many frequent item sets may contain duplicate information. Closed item sets provide a compact representation without losing support information.
3. Example of Closed Item Set
Consider transactions:
| Transaction ID | Items |
|---|---|
| T1 | Bread, Milk |
| T2 | Bread, Milk |
| T3 | Bread, Butter |
Support table:
| Item Set | Support |
|---|---|
| {Bread} | 3 |
| {Bread, Milk} | 2 |
Since:
-
Support of {Bread} = 3
-
Support of {Bread, Milk} = 2
Support changes after adding Milk.
Therefore:
- {Bread} is closed.
4. Characteristics of Closed Item Sets
1. No Superset with Same Support
A closed item set has no larger item set with identical support.
2. Compact Representation
Closed item sets reduce the number of patterns stored.
3. Preserves Frequency Information
Support values remain meaningful and accurate.
4. Reduces Redundancy
Duplicate frequent patterns are removed.
5. Advantages of Closed Item Sets
1. Reduced Storage Requirement
Fewer patterns need to be stored.
2. Improved Mining Efficiency
Reduces processing complexity.
3. Simplified Analysis
Users can analyze patterns more easily.
4. Eliminates Redundant Information
Only meaningful patterns are retained.
4.3 Association Rule Mining
1. Definition of Association Rule Mining
Association Rule Mining is a Data Mining technique used to discover relationships, associations, and correlations among items in large databases.
It identifies:
-
Frequently occurring item combinations
-
Relationships between products
-
Customer purchasing behavior
2. Association Rule
An association rule is represented in the form:
A \Rightarrow B
where:
-
A = Antecedent (Left-hand side)
-
B = Consequent (Right-hand side)
Meaning:
- If A occurs, B is likely to occur.
3. Example of Association Rule
Rule:
Bread → ButterMeaning:
- Customers purchasing bread are likely to purchase butter.
4. Components of Association Rule
1. Antecedent
Items appearing before the arrow.
Example
Bread
2. Consequent
Items appearing after the arrow.
Example
Butter
5. Rule Generation Process
Association rule mining mainly involves two steps.
Step 1: Frequent Item Set Generation
Frequent item sets satisfying minimum support are identified.
Step 2: Rule Generation
Association rules are generated from frequent item sets.
6. Support and Confidence
Support and confidence are used to evaluate association rules.
1. Support
Support measures how often an association rule appears in the database.
Formula
Support(A \Rightarrow B)=\frac{\text{Transactions containing A and B}}{\text{Total transactions}}
2. Confidence
Confidence measures the reliability of the association rule.
Formula
Confidence(A \Rightarrow B)=\frac{Support(A \cup B)}{Support(A)}
Example of Confidence Calculation
Suppose:
-
Bread appears in 50 transactions
-
Bread and Butter together appear in 30 transactions
Confidence:
Confidence(Bread \Rightarrow Butter)=\frac{30}{50}=0.6
Confidence = 60%
Meaning:
- 60% of customers purchasing bread also purchase butter.
7. Importance of Association Rule Mining
1. Discover Hidden Relationships
Finds meaningful product relationships.
2. Supports Business Decisions
Helps improve marketing and sales strategies.
3. Improves Recommendation Systems
Suggests related products to customers.
4. Helps Customer Analysis
Analyzes customer purchasing behavior.
8. Applications of Association Rule Mining
1. Market Basket Analysis
Finding products purchased together.
2. Fraud Detection
Identifying suspicious transaction patterns.
3. Medical Diagnosis
Finding relationships among symptoms and diseases.
4. Web Usage Analysis
Analyzing user navigation behavior.
4.4 Market Basket Analysis
1. Definition of Market Basket Analysis
Market Basket Analysis is a technique used to analyze customer purchasing patterns by identifying products frequently bought together.
It is one of the most common applications of association rule mining.
2. Objective of Market Basket Analysis
The main objectives are:
-
Understand customer buying behavior
-
Increase sales
-
Improve product placement
-
Support cross-selling
3. Customer Purchasing Patterns
Purchasing patterns show relationships among products purchased by customers.
Examples
Customers buying:
-
Bread may also buy butter
-
Mobile phones may buy earphones
-
Chips may buy soft drinks
4. Product Association Analysis
Product association analysis identifies related products.
1. Product Placement
Related products are placed nearby.
Example
Bread and butter placed in nearby shelves.
2. Cross-Selling
Suggesting additional products.
Example
“Customers also bought…”
3. Combo Offers
Organizations create promotional offers using associations.
Example
Burger + Soft Drink combo.
5. Advantages of Market Basket Analysis
1. Improves Sales
Encourages customers to buy additional products.
2. Enhances Customer Satisfaction
Provides better recommendations.
3. Helps Inventory Management
Improves stock planning.
4. Supports Marketing Strategies
Helps targeted advertising and promotions.
4.5 Classification of Association Rules
Association rules are classified based on dimensions and data types involved.
1. Single-Dimensional Association Rules
Definition
Association rules involving only one dimension or predicate are called single-dimensional association rules.
Example
Buys(Customer, Bread) → Buys(Customer, Butter)Only the “buys” dimension is used.
Characteristics
-
Simple structure
-
Easy implementation
-
Common in retail analysis
2. Multidimensional Association Rules
Definition
Association rules involving multiple dimensions are called multidimensional association rules.
Example
Age(20-30) ∧ Occupation(Student) → Buys(Laptop)Dimensions involved:
-
Age
-
Occupation
-
Product
Characteristics
-
More informative
-
More complex
-
Rich analytical insights
3. Boolean Association Rules
Definition
Boolean association rules consider only:
-
Presence
-
Absence
of items.
Example
Bread → ButterOnly whether items exist or not is considered.
Characteristics
-
Binary values only
-
Simple calculations
-
Widely used
4. Quantitative Association Rules
Definition
Quantitative association rules involve numerical attributes or quantities.
Example
Age between 20-30 → Purchase amount > 5000Characteristics
-
Uses numerical data
-
More detailed analysis
-
Requires complex computations
4.6 Apriori Algorithm
1. Introduction to Apriori Algorithm
The Apriori Algorithm is one of the most popular algorithms for mining frequent item sets and association rules.
It was proposed by:
-
Rakesh Agrawal
-
Ramakrishnan Srikant
The algorithm works using the Apriori principle.
2. Principle of Apriori
The Apriori principle states:
“If an item set is frequent, then all of its subsets must also be frequent.”
This helps reduce unnecessary computations.
Example of Apriori Principle
If:
- {Bread, Milk}
is frequent,
then:
-
{Bread}
-
{Milk}
must also be frequent.
If:
- {Bread}
is not frequent,
then:
- {Bread, Milk}
cannot be frequent.
3. Working of Apriori Algorithm
The algorithm works iteratively level by level.
Steps of Apriori Algorithm
Step 1: Generate Frequent 1-Item Sets
Count support of individual items.
Step 2: Remove Infrequent Items
Items below minimum support are removed.
Step 3: Candidate Generation
Generate candidate item sets from previous frequent item sets.
Step 4: Calculate Support
Support values of candidate sets are calculated.
Step 5: Pruning
Remove item sets whose subsets are infrequent.
Step 6: Repeat
Repeat until no new frequent item sets are found.
4. Candidate Generation
Candidate generation creates possible frequent item sets.
Example
Frequent 1-item sets:
-
Bread
-
Milk
-
Butter
Candidate 2-item sets:
-
{Bread, Milk}
-
{Bread, Butter}
-
{Milk, Butter}
5. Pruning
Pruning removes unnecessary candidate item sets.
Apriori Pruning Rule
If any subset of a candidate item set is infrequent, then the candidate itself is removed.
Example of Pruning
If:
- {Bread}
is infrequent,
then:
- {Bread, Milk}
cannot be frequent and is removed.
6. Advantages of Apriori Algorithm
1. Simple and Easy to Understand
Widely used due to simplicity.
2. Reduces Search Space
Pruning minimizes unnecessary computations.
3. Efficient for Small and Medium Databases
Performs well for moderate datasets.
4. Generates Useful Association Rules
Supports recommendation systems and business analysis.
7. Limitations of Apriori Algorithm
1. Multiple Database Scans
Requires repeated scanning of the database.
2. Large Candidate Sets
Candidate generation becomes expensive for large datasets.
3. High Computational Cost
Performance decreases for huge databases.
4. High Memory Usage
Large candidate sets consume more memory.
8. Applications of Apriori Algorithm
1. Retail Analysis
Finding products purchased together.
2. Recommendation Systems
Suggesting related products.
3. Medical Analysis
Finding related symptoms and diseases.
4. Website Usage Analysis
Identifying frequently visited web pages together.