5. Classification and Prediction

Classification and Prediction are important Data Mining techniques used to analyze data and make future decisions based on patterns discovered from historical data.

These techniques belong to:

Supervised Learning

because they use previously known data to train models.

They are widely used in:

Banking
Healthcare
Business intelligence
Fraud detection
Weather forecasting
Recommendation systems

Classification predicts categorical labels, while prediction estimates continuous numerical values.

5.1 Classification and Prediction

1. Definition of Classification

Classification is a Data Mining technique used to assign data objects into predefined classes or categories based on their attributes.

In classification:

A model is first trained using labeled data.
The trained model is then used to classify unknown data.

2. Example of Classification

Examples

Email classified as:
- Spam
- Not Spam
Student classified as:
- Pass
- Fail
Loan application classified as:
- Approved
- Rejected

3. Working of Classification

Classification generally involves two steps.

Step 1: Training Phase

A classification model is built using training data.

The training dataset contains:

Input attributes
Known class labels

Example

Age	Income	Loan Status
Young	High	Approved
Old	Low	Rejected

Step 2: Testing Phase

The trained model predicts class labels for unknown data.

4. Definition of Prediction

Prediction is a Data Mining technique used to estimate continuous or numerical values based on historical data.

Unlike classification:

Prediction estimates numeric values instead of categories.

5. Examples of Prediction

Examples

Predicting future sales
Predicting temperature
Predicting stock market prices
Predicting employee salary

6. Difference Between Classification and Prediction

Basis	Classification	Prediction
Output Type	Categorical values	Continuous numerical values
Purpose	Assign class labels	Estimate future values
Example	Pass/Fail	Salary prediction
Learning Type	Supervised learning	Supervised learning

7. Applications of Classification and Prediction

Classification and prediction are used in many real-world applications.

1. Banking Sector

Applications

Loan approval prediction
Fraud transaction detection
Credit risk analysis

2. Healthcare

Applications

Disease diagnosis
Medical risk prediction
Patient classification

3. Education Systems

Applications

Student performance prediction
Attendance analysis
Grade classification

4. Business and Marketing

Applications

Customer segmentation
Sales prediction
Product recommendation

5. Weather Forecasting

Applications

Rainfall prediction
Temperature forecasting
Climate analysis

6. E-Commerce

Applications

Product recommendation systems
Customer behavior analysis
Purchase prediction

5.2 Issues Regarding Classification and Prediction

Classification and prediction systems face several challenges that affect their accuracy and efficiency.

Important issues include:

Overfitting
Accuracy
Missing values
Scalability

1. Overfitting

Definition

Overfitting occurs when a classification model learns the training data too perfectly, including noise and irrelevant details.

As a result:

The model performs well on training data
But performs poorly on new unseen data

Explanation

An overfitted model becomes too specific and loses generalization capability.

Example

Suppose a student memorizes exact answers instead of understanding concepts.

The student:

Performs well on known questions
Performs poorly on new questions

Similarly, overfitted models fail on new datasets.

Causes of Overfitting

1. Small Training Dataset

Insufficient data causes models to memorize patterns.

2. Too Many Attributes

Large numbers of features increase complexity.

3. Noise in Data

Incorrect or irrelevant data misleads the model.

Problems Caused by Overfitting

Reduced prediction accuracy
Poor generalization
Unreliable results

Methods to Reduce Overfitting

1. Pruning

Removing unnecessary branches in decision trees.

2. Using More Training Data

Larger datasets improve learning.

3. Feature Selection

Removing irrelevant attributes.

4. Cross Validation

Testing model performance on multiple datasets.

2. Accuracy

Definition

Accuracy measures how correctly a model predicts outcomes.

It is one of the most important performance measures.

Formula for Accuracy

Accuracy=\frac{\text{Correct Predictions}}{\text{Total Predictions}}

Example

Suppose:

Total predictions = 100
Correct predictions = 90

Then:

Accuracy=\frac{90}{100}=0.9

Accuracy = 90%

Factors Affecting Accuracy

1. Quality of Training Data

Poor-quality data reduces accuracy.

2. Missing Values

Incomplete data affects learning.

3. Noise

Incorrect data lowers prediction performance.

4. Model Selection

Different algorithms provide different accuracy levels.

Importance of Accuracy

Measures model reliability
Helps compare classifiers
Improves decision making

3. Missing Values

Definition

Missing values occur when some attribute values are absent in the dataset.

Causes of Missing Values

1. Data Entry Errors

Information may be skipped accidentally.

2. Hardware or Software Failures

Data may not be recorded properly.

3. User Refusal

Users may not provide certain information.

Example

Name	Age	Salary
Amit	25	50000
Ravi	—	45000

Age value is missing for Ravi.

Problems Caused by Missing Values

Reduces mining accuracy
Produces incorrect predictions
Increases processing difficulty

Methods for Handling Missing Values

1. Ignore Records

Remove tuples with missing values.

2. Manual Filling

Users manually enter missing data.

3. Mean or Average Method

Replace missing numerical values with average values.

4. Most Frequent Value Method

Replace missing values using the most common value.

5. Predictive Methods

Use machine learning models to estimate missing values.

4. Scalability

Definition

Scalability refers to the ability of a classification or prediction algorithm to handle increasing amounts of data efficiently.

Importance of Scalability

Modern databases contain:

Millions of records
Large attribute sets
Continuous data streams

Algorithms must process data efficiently.

Challenges in Scalability

1. Large Data Volume

Processing huge datasets requires high resources.

2. Memory Limitations

Large datasets consume significant memory.

3. Processing Time

Complex algorithms may become very slow.

Methods to Improve Scalability

1. Parallel Processing

Tasks are divided among multiple processors.

2. Distributed Computing

Processing is distributed across systems.

3. Data Reduction

Reducing dataset size improves efficiency.

4. Efficient Algorithms

Using optimized mining algorithms.

5.3 Comparing Classification Methods

Different classification methods are evaluated based on several criteria.

Important comparison parameters include:

Accuracy
Speed
Robustness
Interpretability

1. Accuracy

Definition

Accuracy measures the correctness of classification results.

Importance

A highly accurate classifier provides reliable predictions.

Example

A disease diagnosis system with:

95% accuracy

is more reliable than one with:

70% accuracy

Factors Affecting Accuracy

Data quality
Noise
Missing values
Algorithm selection

2. Speed

Definition

Speed refers to:

Training speed
Prediction speed

of the classification algorithm.

Types of Speed

1. Training Speed

Time required to build the model.

2. Prediction Speed

Time required to classify new data.

Importance

Fast systems are important for:

Real-time applications
Online services
Large databases

3. Robustness

Definition

Robustness refers to the ability of a classifier to handle:

Noise
Missing values
Incorrect data

without major performance reduction.

Example

A robust model continues working effectively even if some data is incomplete.

Importance

Real-world data is often imperfect.

Robust systems provide stable performance.

4. Interpretability

Definition

Interpretability means how easily humans can understand the classification model.

Example

Decision trees are highly interpretable because rules are easy to understand.

Importance

Interpretable models help:

Explain decisions
Build user trust
Support business understanding

Comparison Table of Classification Methods

Criteria	Meaning
Accuracy	Correctness of predictions
Speed	Time required for training and prediction
Robustness	Ability to handle noisy or incomplete data
Interpretability	Ease of understanding the model

5.4 Classification by Decision Tree Induction

Decision Tree Induction is one of the most popular classification techniques.

It constructs a tree-like structure for decision making.

Decision trees are widely used because they are:

Simple
Easy to understand
Highly interpretable

1. Decision Tree Concept

Definition

A Decision Tree is a tree-structured classifier where:

Internal nodes represent tests on attributes
Branches represent outcomes of tests
Leaf nodes represent class labels

Structure of Decision Tree

1. Root Node

Represents the topmost decision attribute.

2. Internal Nodes

Represent attribute tests.

3. Branches

Represent test outcomes.

4. Leaf Nodes

Represent final class labels.

Example of Decision Tree

                  Weather
                 /      \
              Sunny    Rainy
               /           \
           Play          Don't Play

Working of Decision Tree

The tree classifies data by moving from:

Root node
Through branches
To leaf nodes

based on attribute values.

2. Tree Construction

Tree construction is the process of building the decision tree from training data.

Steps in Tree Construction

Step 1: Select Best Attribute

Choose the most important attribute for splitting data.

Step 2: Create Root Node

Selected attribute becomes the root node.

Step 3: Partition Data

Data is divided according to attribute values.

Step 4: Repeat Recursively

The process continues for each subset.

Step 5: Stop Condition

Construction stops when:

All records belong to same class
No attributes remain
Dataset becomes empty

3. Attribute Selection

Definition

Attribute selection determines the best attribute for splitting data during tree construction.

The selected attribute should:

Maximize class separation
Reduce uncertainty

Attribute Selection Measures

Common measures include:

Information Gain
Gain Ratio
Gini Index

1. Information Gain

Information Gain measures reduction in uncertainty after splitting data.

Higher information gain indicates a better attribute.

Entropy Formula

Entropy(S)=-\sum_{i=1}^{n} p_i\log_2 p_i

Information Gain Formula

Gain(S,A)=Entropy(S)-\sum_{v \in Values(A)}\frac{|S_v|}{|S|}Entropy(S_v)

2. Gain Ratio

Gain Ratio improves Information Gain by reducing bias toward attributes with many values.

3. Gini Index

Gini Index measures impurity in data.

Lower Gini Index indicates better splitting.

Gini Formula

Gini(S)=1-\sum_{i=1}^{n} p_i

4. Advantages of Decision Trees

1. Simple and Easy to Understand

Tree structures are human-readable.

2. Fast Classification

Prediction is quick after tree construction.

3. Handles Large Datasets

Works efficiently for many applications.

4. Supports Both Numerical and Categorical Data

Flexible for different data types.

5. Requires Less Data Preparation

Minimal preprocessing is needed.

5. Disadvantages of Decision Trees

1. Overfitting Problem

Trees may become overly complex.

2. Instability

Small data changes may produce different trees.

3. Bias Toward Dominant Classes

Unbalanced datasets may affect results.

4. Complex Trees Become Difficult to Interpret

Large trees reduce readability.

6. Applications of Decision Trees

1. Medical Diagnosis

Disease classification systems.

2. Banking

Loan approval prediction.

3. Fraud Detection

Detecting suspicious transactions.

4. Education Systems

Student performance prediction.

5. Business Intelligence

Customer behavior analysis and sales prediction.

notes-shivam

Explorer

5. Classification and Prediction

5. Classification and Prediction

5.1 Classification and Prediction

1. Definition of Classification

2. Example of Classification

Examples

3. Working of Classification

Step 1: Training Phase

Example

Step 2: Testing Phase

4. Definition of Prediction

5. Examples of Prediction

Examples

6. Difference Between Classification and Prediction

7. Applications of Classification and Prediction

1. Banking Sector

Applications

2. Healthcare

Applications

3. Education Systems

Applications

4. Business and Marketing

Applications

5. Weather Forecasting

Applications

6. E-Commerce

Applications

5.2 Issues Regarding Classification and Prediction

1. Overfitting

Definition

Explanation

Example

Causes of Overfitting

1. Small Training Dataset

2. Too Many Attributes

3. Noise in Data

Problems Caused by Overfitting

Methods to Reduce Overfitting

1. Pruning

2. Using More Training Data

3. Feature Selection

4. Cross Validation

2. Accuracy

Definition

Formula for Accuracy

Example

Factors Affecting Accuracy

1. Quality of Training Data

2. Missing Values

3. Noise

4. Model Selection

Importance of Accuracy

3. Missing Values

Definition

Causes of Missing Values

1. Data Entry Errors

2. Hardware or Software Failures

3. User Refusal

Example

Problems Caused by Missing Values

Methods for Handling Missing Values

1. Ignore Records

2. Manual Filling

3. Mean or Average Method

4. Most Frequent Value Method

5. Predictive Methods

4. Scalability

Definition

Importance of Scalability

Challenges in Scalability

1. Large Data Volume

2. Memory Limitations

3. Processing Time

Methods to Improve Scalability

1. Parallel Processing

2. Distributed Computing

3. Data Reduction

4. Efficient Algorithms