5. Classification and Prediction

Classification and Prediction are important Data Mining techniques used to analyze data and make future decisions based on patterns discovered from historical data.

These techniques belong to:

  • Supervised Learning

because they use previously known data to train models.

They are widely used in:

  • Banking

  • Healthcare

  • Business intelligence

  • Fraud detection

  • Weather forecasting

  • Recommendation systems

Classification predicts categorical labels, while prediction estimates continuous numerical values.


5.1 Classification and Prediction

1. Definition of Classification

Classification is a Data Mining technique used to assign data objects into predefined classes or categories based on their attributes.

In classification:

  • A model is first trained using labeled data.

  • The trained model is then used to classify unknown data.


2. Example of Classification

Examples

  • Email classified as:

    • Spam

    • Not Spam

  • Student classified as:

    • Pass

    • Fail

  • Loan application classified as:

    • Approved

    • Rejected


3. Working of Classification

Classification generally involves two steps.


Step 1: Training Phase

A classification model is built using training data.

The training dataset contains:

  • Input attributes

  • Known class labels

Example

AgeIncomeLoan Status
YoungHighApproved
OldLowRejected

Step 2: Testing Phase

The trained model predicts class labels for unknown data.


4. Definition of Prediction

Prediction is a Data Mining technique used to estimate continuous or numerical values based on historical data.

Unlike classification:

  • Prediction estimates numeric values instead of categories.

5. Examples of Prediction

Examples

  • Predicting future sales

  • Predicting temperature

  • Predicting stock market prices

  • Predicting employee salary


6. Difference Between Classification and Prediction

BasisClassificationPrediction
Output TypeCategorical valuesContinuous numerical values
PurposeAssign class labelsEstimate future values
ExamplePass/FailSalary prediction
Learning TypeSupervised learningSupervised learning

7. Applications of Classification and Prediction

Classification and prediction are used in many real-world applications.


1. Banking Sector

Applications

  • Loan approval prediction

  • Fraud transaction detection

  • Credit risk analysis


2. Healthcare

Applications

  • Disease diagnosis

  • Medical risk prediction

  • Patient classification


3. Education Systems

Applications

  • Student performance prediction

  • Attendance analysis

  • Grade classification


4. Business and Marketing

Applications

  • Customer segmentation

  • Sales prediction

  • Product recommendation


5. Weather Forecasting

Applications

  • Rainfall prediction

  • Temperature forecasting

  • Climate analysis


6. E-Commerce

Applications

  • Product recommendation systems

  • Customer behavior analysis

  • Purchase prediction


5.2 Issues Regarding Classification and Prediction

Classification and prediction systems face several challenges that affect their accuracy and efficiency.

Important issues include:

  • Overfitting

  • Accuracy

  • Missing values

  • Scalability


1. Overfitting

Definition

Overfitting occurs when a classification model learns the training data too perfectly, including noise and irrelevant details.

As a result:

  • The model performs well on training data

  • But performs poorly on new unseen data


Explanation

An overfitted model becomes too specific and loses generalization capability.


Example

Suppose a student memorizes exact answers instead of understanding concepts.

The student:

  • Performs well on known questions

  • Performs poorly on new questions

Similarly, overfitted models fail on new datasets.


Causes of Overfitting

1. Small Training Dataset

Insufficient data causes models to memorize patterns.


2. Too Many Attributes

Large numbers of features increase complexity.


3. Noise in Data

Incorrect or irrelevant data misleads the model.


Problems Caused by Overfitting

  • Reduced prediction accuracy

  • Poor generalization

  • Unreliable results


Methods to Reduce Overfitting

1. Pruning

Removing unnecessary branches in decision trees.


2. Using More Training Data

Larger datasets improve learning.


3. Feature Selection

Removing irrelevant attributes.


4. Cross Validation

Testing model performance on multiple datasets.


2. Accuracy

Definition

Accuracy measures how correctly a model predicts outcomes.

It is one of the most important performance measures.


Formula for Accuracy

Accuracy=\frac{\text{Correct Predictions}}{\text{Total Predictions}}


Example

Suppose:

  • Total predictions = 100

  • Correct predictions = 90

Then:

Accuracy=\frac{90}{100}=0.9

Accuracy = 90%


Factors Affecting Accuracy

1. Quality of Training Data

Poor-quality data reduces accuracy.


2. Missing Values

Incomplete data affects learning.


3. Noise

Incorrect data lowers prediction performance.


4. Model Selection

Different algorithms provide different accuracy levels.


Importance of Accuracy

  • Measures model reliability

  • Helps compare classifiers

  • Improves decision making


3. Missing Values

Definition

Missing values occur when some attribute values are absent in the dataset.


Causes of Missing Values

1. Data Entry Errors

Information may be skipped accidentally.


2. Hardware or Software Failures

Data may not be recorded properly.


3. User Refusal

Users may not provide certain information.


Example

NameAgeSalary
Amit2550000
Ravi45000

Age value is missing for Ravi.


Problems Caused by Missing Values

  • Reduces mining accuracy

  • Produces incorrect predictions

  • Increases processing difficulty


Methods for Handling Missing Values

1. Ignore Records

Remove tuples with missing values.


2. Manual Filling

Users manually enter missing data.


3. Mean or Average Method

Replace missing numerical values with average values.


4. Most Frequent Value Method

Replace missing values using the most common value.


5. Predictive Methods

Use machine learning models to estimate missing values.


4. Scalability

Definition

Scalability refers to the ability of a classification or prediction algorithm to handle increasing amounts of data efficiently.


Importance of Scalability

Modern databases contain:

  • Millions of records

  • Large attribute sets

  • Continuous data streams

Algorithms must process data efficiently.


Challenges in Scalability

1. Large Data Volume

Processing huge datasets requires high resources.


2. Memory Limitations

Large datasets consume significant memory.


3. Processing Time

Complex algorithms may become very slow.


Methods to Improve Scalability

1. Parallel Processing

Tasks are divided among multiple processors.


2. Distributed Computing

Processing is distributed across systems.


3. Data Reduction

Reducing dataset size improves efficiency.


4. Efficient Algorithms

Using optimized mining algorithms.


5.3 Comparing Classification Methods

Different classification methods are evaluated based on several criteria.

Important comparison parameters include:

  • Accuracy

  • Speed

  • Robustness

  • Interpretability


1. Accuracy

Definition

Accuracy measures the correctness of classification results.


Importance

A highly accurate classifier provides reliable predictions.


Example

A disease diagnosis system with:

  • 95% accuracy

is more reliable than one with:

  • 70% accuracy

Factors Affecting Accuracy

  • Data quality

  • Noise

  • Missing values

  • Algorithm selection


2. Speed

Definition

Speed refers to:

  • Training speed

  • Prediction speed

of the classification algorithm.


Types of Speed

1. Training Speed

Time required to build the model.


2. Prediction Speed

Time required to classify new data.


Importance

Fast systems are important for:

  • Real-time applications

  • Online services

  • Large databases


3. Robustness

Definition

Robustness refers to the ability of a classifier to handle:

  • Noise

  • Missing values

  • Incorrect data

without major performance reduction.


Example

A robust model continues working effectively even if some data is incomplete.


Importance

Real-world data is often imperfect.

Robust systems provide stable performance.


4. Interpretability

Definition

Interpretability means how easily humans can understand the classification model.


Example

Decision trees are highly interpretable because rules are easy to understand.


Importance

Interpretable models help:

  • Explain decisions

  • Build user trust

  • Support business understanding


Comparison Table of Classification Methods

CriteriaMeaning
AccuracyCorrectness of predictions
SpeedTime required for training and prediction
RobustnessAbility to handle noisy or incomplete data
InterpretabilityEase of understanding the model

5.4 Classification by Decision Tree Induction

Decision Tree Induction is one of the most popular classification techniques.

It constructs a tree-like structure for decision making.

Decision trees are widely used because they are:

  • Simple

  • Easy to understand

  • Highly interpretable


1. Decision Tree Concept

Definition

A Decision Tree is a tree-structured classifier where:

  • Internal nodes represent tests on attributes

  • Branches represent outcomes of tests

  • Leaf nodes represent class labels


Structure of Decision Tree

1. Root Node

Represents the topmost decision attribute.


2. Internal Nodes

Represent attribute tests.


3. Branches

Represent test outcomes.


4. Leaf Nodes

Represent final class labels.


Example of Decision Tree

                  Weather
                 /      \
              Sunny    Rainy
               /           \
           Play          Don't Play

Working of Decision Tree

The tree classifies data by moving from:

  • Root node

  • Through branches

  • To leaf nodes

based on attribute values.


2. Tree Construction

Tree construction is the process of building the decision tree from training data.


Steps in Tree Construction

Step 1: Select Best Attribute

Choose the most important attribute for splitting data.


Step 2: Create Root Node

Selected attribute becomes the root node.


Step 3: Partition Data

Data is divided according to attribute values.


Step 4: Repeat Recursively

The process continues for each subset.


Step 5: Stop Condition

Construction stops when:

  • All records belong to same class

  • No attributes remain

  • Dataset becomes empty


3. Attribute Selection

Definition

Attribute selection determines the best attribute for splitting data during tree construction.

The selected attribute should:

  • Maximize class separation

  • Reduce uncertainty


Attribute Selection Measures

Common measures include:

  • Information Gain

  • Gain Ratio

  • Gini Index


1. Information Gain

Information Gain measures reduction in uncertainty after splitting data.

Higher information gain indicates a better attribute.


Entropy Formula

Entropy(S)=-\sum_{i=1}^{n} p_i\log_2 p_i


Information Gain Formula

Gain(S,A)=Entropy(S)-\sum_{v \in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v)


2. Gain Ratio

Gain Ratio improves Information Gain by reducing bias toward attributes with many values.


3. Gini Index

Gini Index measures impurity in data.

Lower Gini Index indicates better splitting.


Gini Formula

Gini(S)=1-\sum_{i=1}^{n} p_i


4. Advantages of Decision Trees

1. Simple and Easy to Understand

Tree structures are human-readable.


2. Fast Classification

Prediction is quick after tree construction.


3. Handles Large Datasets

Works efficiently for many applications.


4. Supports Both Numerical and Categorical Data

Flexible for different data types.


5. Requires Less Data Preparation

Minimal preprocessing is needed.


5. Disadvantages of Decision Trees

1. Overfitting Problem

Trees may become overly complex.


2. Instability

Small data changes may produce different trees.


3. Bias Toward Dominant Classes

Unbalanced datasets may affect results.


4. Complex Trees Become Difficult to Interpret

Large trees reduce readability.


6. Applications of Decision Trees

1. Medical Diagnosis

Disease classification systems.


2. Banking

Loan approval prediction.


3. Fraud Detection

Detecting suspicious transactions.


4. Education Systems

Student performance prediction.


5. Business Intelligence

Customer behavior analysis and sales prediction.