5. Classification and Prediction
Classification and Prediction are important Data Mining techniques used to analyze data and make future decisions based on patterns discovered from historical data.
These techniques belong to:
- Supervised Learning
because they use previously known data to train models.
They are widely used in:
-
Banking
-
Healthcare
-
Business intelligence
-
Fraud detection
-
Weather forecasting
-
Recommendation systems
Classification predicts categorical labels, while prediction estimates continuous numerical values.
5.1 Classification and Prediction
1. Definition of Classification
Classification is a Data Mining technique used to assign data objects into predefined classes or categories based on their attributes.
In classification:
-
A model is first trained using labeled data.
-
The trained model is then used to classify unknown data.
2. Example of Classification
Examples
-
Email classified as:
-
Spam
-
Not Spam
-
-
Student classified as:
-
Pass
-
Fail
-
-
Loan application classified as:
-
Approved
-
Rejected
-
3. Working of Classification
Classification generally involves two steps.
Step 1: Training Phase
A classification model is built using training data.
The training dataset contains:
-
Input attributes
-
Known class labels
Example
| Age | Income | Loan Status |
|---|---|---|
| Young | High | Approved |
| Old | Low | Rejected |
Step 2: Testing Phase
The trained model predicts class labels for unknown data.
4. Definition of Prediction
Prediction is a Data Mining technique used to estimate continuous or numerical values based on historical data.
Unlike classification:
- Prediction estimates numeric values instead of categories.
5. Examples of Prediction
Examples
-
Predicting future sales
-
Predicting temperature
-
Predicting stock market prices
-
Predicting employee salary
6. Difference Between Classification and Prediction
| Basis | Classification | Prediction |
|---|---|---|
| Output Type | Categorical values | Continuous numerical values |
| Purpose | Assign class labels | Estimate future values |
| Example | Pass/Fail | Salary prediction |
| Learning Type | Supervised learning | Supervised learning |
7. Applications of Classification and Prediction
Classification and prediction are used in many real-world applications.
1. Banking Sector
Applications
-
Loan approval prediction
-
Fraud transaction detection
-
Credit risk analysis
2. Healthcare
Applications
-
Disease diagnosis
-
Medical risk prediction
-
Patient classification
3. Education Systems
Applications
-
Student performance prediction
-
Attendance analysis
-
Grade classification
4. Business and Marketing
Applications
-
Customer segmentation
-
Sales prediction
-
Product recommendation
5. Weather Forecasting
Applications
-
Rainfall prediction
-
Temperature forecasting
-
Climate analysis
6. E-Commerce
Applications
-
Product recommendation systems
-
Customer behavior analysis
-
Purchase prediction
5.2 Issues Regarding Classification and Prediction
Classification and prediction systems face several challenges that affect their accuracy and efficiency.
Important issues include:
-
Overfitting
-
Accuracy
-
Missing values
-
Scalability
1. Overfitting
Definition
Overfitting occurs when a classification model learns the training data too perfectly, including noise and irrelevant details.
As a result:
-
The model performs well on training data
-
But performs poorly on new unseen data
Explanation
An overfitted model becomes too specific and loses generalization capability.
Example
Suppose a student memorizes exact answers instead of understanding concepts.
The student:
-
Performs well on known questions
-
Performs poorly on new questions
Similarly, overfitted models fail on new datasets.
Causes of Overfitting
1. Small Training Dataset
Insufficient data causes models to memorize patterns.
2. Too Many Attributes
Large numbers of features increase complexity.
3. Noise in Data
Incorrect or irrelevant data misleads the model.
Problems Caused by Overfitting
-
Reduced prediction accuracy
-
Poor generalization
-
Unreliable results
Methods to Reduce Overfitting
1. Pruning
Removing unnecessary branches in decision trees.
2. Using More Training Data
Larger datasets improve learning.
3. Feature Selection
Removing irrelevant attributes.
4. Cross Validation
Testing model performance on multiple datasets.
2. Accuracy
Definition
Accuracy measures how correctly a model predicts outcomes.
It is one of the most important performance measures.
Formula for Accuracy
Accuracy=\frac{\text{Correct Predictions}}{\text{Total Predictions}}
Example
Suppose:
-
Total predictions = 100
-
Correct predictions = 90
Then:
Accuracy=\frac{90}{100}=0.9
Accuracy = 90%
Factors Affecting Accuracy
1. Quality of Training Data
Poor-quality data reduces accuracy.
2. Missing Values
Incomplete data affects learning.
3. Noise
Incorrect data lowers prediction performance.
4. Model Selection
Different algorithms provide different accuracy levels.
Importance of Accuracy
-
Measures model reliability
-
Helps compare classifiers
-
Improves decision making
3. Missing Values
Definition
Missing values occur when some attribute values are absent in the dataset.
Causes of Missing Values
1. Data Entry Errors
Information may be skipped accidentally.
2. Hardware or Software Failures
Data may not be recorded properly.
3. User Refusal
Users may not provide certain information.
Example
| Name | Age | Salary |
|---|---|---|
| Amit | 25 | 50000 |
| Ravi | — | 45000 |
Age value is missing for Ravi.
Problems Caused by Missing Values
-
Reduces mining accuracy
-
Produces incorrect predictions
-
Increases processing difficulty
Methods for Handling Missing Values
1. Ignore Records
Remove tuples with missing values.
2. Manual Filling
Users manually enter missing data.
3. Mean or Average Method
Replace missing numerical values with average values.
4. Most Frequent Value Method
Replace missing values using the most common value.
5. Predictive Methods
Use machine learning models to estimate missing values.
4. Scalability
Definition
Scalability refers to the ability of a classification or prediction algorithm to handle increasing amounts of data efficiently.
Importance of Scalability
Modern databases contain:
-
Millions of records
-
Large attribute sets
-
Continuous data streams
Algorithms must process data efficiently.
Challenges in Scalability
1. Large Data Volume
Processing huge datasets requires high resources.
2. Memory Limitations
Large datasets consume significant memory.
3. Processing Time
Complex algorithms may become very slow.
Methods to Improve Scalability
1. Parallel Processing
Tasks are divided among multiple processors.
2. Distributed Computing
Processing is distributed across systems.
3. Data Reduction
Reducing dataset size improves efficiency.
4. Efficient Algorithms
Using optimized mining algorithms.
5.3 Comparing Classification Methods
Different classification methods are evaluated based on several criteria.
Important comparison parameters include:
-
Accuracy
-
Speed
-
Robustness
-
Interpretability
1. Accuracy
Definition
Accuracy measures the correctness of classification results.
Importance
A highly accurate classifier provides reliable predictions.
Example
A disease diagnosis system with:
- 95% accuracy
is more reliable than one with:
- 70% accuracy
Factors Affecting Accuracy
-
Data quality
-
Noise
-
Missing values
-
Algorithm selection
2. Speed
Definition
Speed refers to:
-
Training speed
-
Prediction speed
of the classification algorithm.
Types of Speed
1. Training Speed
Time required to build the model.
2. Prediction Speed
Time required to classify new data.
Importance
Fast systems are important for:
-
Real-time applications
-
Online services
-
Large databases
3. Robustness
Definition
Robustness refers to the ability of a classifier to handle:
-
Noise
-
Missing values
-
Incorrect data
without major performance reduction.
Example
A robust model continues working effectively even if some data is incomplete.
Importance
Real-world data is often imperfect.
Robust systems provide stable performance.
4. Interpretability
Definition
Interpretability means how easily humans can understand the classification model.
Example
Decision trees are highly interpretable because rules are easy to understand.
Importance
Interpretable models help:
-
Explain decisions
-
Build user trust
-
Support business understanding
Comparison Table of Classification Methods
| Criteria | Meaning |
|---|---|
| Accuracy | Correctness of predictions |
| Speed | Time required for training and prediction |
| Robustness | Ability to handle noisy or incomplete data |
| Interpretability | Ease of understanding the model |
5.4 Classification by Decision Tree Induction
Decision Tree Induction is one of the most popular classification techniques.
It constructs a tree-like structure for decision making.
Decision trees are widely used because they are:
-
Simple
-
Easy to understand
-
Highly interpretable
1. Decision Tree Concept
Definition
A Decision Tree is a tree-structured classifier where:
-
Internal nodes represent tests on attributes
-
Branches represent outcomes of tests
-
Leaf nodes represent class labels
Structure of Decision Tree
1. Root Node
Represents the topmost decision attribute.
2. Internal Nodes
Represent attribute tests.
3. Branches
Represent test outcomes.
4. Leaf Nodes
Represent final class labels.
Example of Decision Tree
Weather
/ \
Sunny Rainy
/ \
Play Don't PlayWorking of Decision Tree
The tree classifies data by moving from:
-
Root node
-
Through branches
-
To leaf nodes
based on attribute values.
2. Tree Construction
Tree construction is the process of building the decision tree from training data.
Steps in Tree Construction
Step 1: Select Best Attribute
Choose the most important attribute for splitting data.
Step 2: Create Root Node
Selected attribute becomes the root node.
Step 3: Partition Data
Data is divided according to attribute values.
Step 4: Repeat Recursively
The process continues for each subset.
Step 5: Stop Condition
Construction stops when:
-
All records belong to same class
-
No attributes remain
-
Dataset becomes empty
3. Attribute Selection
Definition
Attribute selection determines the best attribute for splitting data during tree construction.
The selected attribute should:
-
Maximize class separation
-
Reduce uncertainty
Attribute Selection Measures
Common measures include:
-
Information Gain
-
Gain Ratio
-
Gini Index
1. Information Gain
Information Gain measures reduction in uncertainty after splitting data.
Higher information gain indicates a better attribute.
Entropy Formula
Entropy(S)=-\sum_{i=1}^{n} p_i\log_2 p_i
Information Gain Formula
Gain(S,A)=Entropy(S)-\sum_{v \in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v)
2. Gain Ratio
Gain Ratio improves Information Gain by reducing bias toward attributes with many values.
3. Gini Index
Gini Index measures impurity in data.
Lower Gini Index indicates better splitting.
Gini Formula
Gini(S)=1-\sum_{i=1}^{n} p_i
4. Advantages of Decision Trees
1. Simple and Easy to Understand
Tree structures are human-readable.
2. Fast Classification
Prediction is quick after tree construction.
3. Handles Large Datasets
Works efficiently for many applications.
4. Supports Both Numerical and Categorical Data
Flexible for different data types.
5. Requires Less Data Preparation
Minimal preprocessing is needed.
5. Disadvantages of Decision Trees
1. Overfitting Problem
Trees may become overly complex.
2. Instability
Small data changes may produce different trees.
3. Bias Toward Dominant Classes
Unbalanced datasets may affect results.
4. Complex Trees Become Difficult to Interpret
Large trees reduce readability.
6. Applications of Decision Trees
1. Medical Diagnosis
Disease classification systems.
2. Banking
Loan approval prediction.
3. Fraud Detection
Detecting suspicious transactions.
4. Education Systems
Student performance prediction.
5. Business Intelligence
Customer behavior analysis and sales prediction.