6. Clustering

Clustering is an important unsupervised learning technique in Data Mining used to group similar data objects into clusters.

Objects within the same cluster are highly similar.
Objects in different clusters are dissimilar.

Unlike classification:

Clustering does not require predefined class labels.
Groups are formed automatically based on similarities in data.

Applications of Clustering

Market analysis
Customer segmentation
Pattern recognition
Image processing
Medical diagnosis
Social network analysis

6.1 Introduction to Clustering

1. Definition of Clustering

Clustering is the process of grouping data objects into clusters such that:

Intra-cluster similarity is high
Inter-cluster similarity is low

Each cluster represents a collection of related data objects.

2. Example of Clustering

A shopping website may group customers based on:

Purchasing habits
Age
Income
Interests

Customers with similar behavior are placed in the same cluster.

3. Characteristics of Clustering

1. Unsupervised Learning

Does not require labeled training data
Automatically discovers patterns in data

2. Similarity-Based Grouping

Objects are grouped based on similarity measures such as:

Distance
Density
Statistical similarity

3. High Intra-Cluster Similarity

Objects within the same cluster are highly similar.

4. High Inter-Cluster Dissimilarity

Objects belonging to different clusters are highly different.

5. Automatic Pattern Discovery

Helps identify hidden structures and relationships in data.

6. Data Reduction

Large datasets can be simplified by representing them as clusters.

4. Importance of Clustering

Clustering helps to:

Discover hidden patterns
Organize large datasets
Simplify data analysis
Improve decision-making

6.2 Cluster Analysis

1. Meaning of Cluster Analysis

Cluster Analysis is the process of analyzing data objects and organizing them into meaningful clusters.

The main objective is to:

Maximize similarity within clusters
Minimize similarity between clusters

2. Objectives of Cluster Analysis

1. Pattern Discovery

Identifies hidden relationships and trends in data.

Example

Grouping customers with similar buying behavior.

2. Data Simplification

Reduces complexity by grouping similar records together.

3. Knowledge Discovery

Extracts useful information from large datasets.

4. Data Organization

Organizes data into understandable structures.

5. Decision Support

Helps organizations identify trends and make better decisions.

6. Outlier Detection

Identifies abnormal objects that do not belong to any cluster.

3. Characteristics of Good Clustering

A good clustering method should:

Produce high-quality clusters
Handle noisy data
Be scalable for large datasets
Handle different data types
Work efficiently

6.3 Need for Clustering

Clustering is necessary because modern systems generate huge amounts of complex data that are difficult to analyze manually.

1. Pattern Recognition

Clustering helps recognize patterns and relationships among data objects.

Examples

Face recognition
Handwriting recognition
Speech recognition

Importance

Improves automation
Enhances machine learning systems
Detects hidden structures

2. Data Segmentation

Clustering divides large datasets into smaller meaningful groups.

Example

Customer segmentation based on:

Age
Income
Shopping behavior

Importance

Improves marketing strategies
Provides personalized services
Increases customer satisfaction

3. Knowledge Discovery

Clustering helps extract useful knowledge from large datasets.

Example

Grouping patients with similar symptoms.

Importance

Supports research
Improves business intelligence
Assists data analysis

6.4 Major Clustering Methods

Clustering methods are classified based on how clusters are formed.

Types of Clustering Methods

Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods

1. Partitioning Methods

Definition

Partitioning methods divide data into a predefined number of clusters.

Each object belongs to exactly one cluster.

Characteristics

Fast and simple
Suitable for medium-sized datasets
Requires the number of clusters in advance

Examples

K-Means
K-Medoids

2. Hierarchical Methods

Definition

Hierarchical clustering creates a hierarchy of clusters.

Clusters are formed by:

Merging smaller clusters
Dividing larger clusters

Types

1. Agglomerative Method

Bottom-up approach
Small clusters merge into larger clusters

2. Divisive Method

Top-down approach
Large clusters divide into smaller clusters

Characteristics

Produces a dendrogram (tree structure)
Does not require initial cluster count

Limitations

Computationally expensive
Difficult for very large datasets

3. Density-Based Methods

Definition

Clusters are formed based on dense regions of data points separated by sparse regions.

Example

DBSCAN

Characteristics

Detects arbitrary-shaped clusters
Handles noise effectively

Advantages

Suitable for spatial data
Identifies outliers

4. Grid-Based Methods

Definition

Data space is divided into finite grid cells, and clustering is performed on grids instead of individual objects.

Characteristics

Fast processing
Efficient for large datasets

Advantages

Computationally efficient
Performance independent of number of objects

5. Model-Based Methods

Definition

Assumes data is generated from statistical models and identifies clusters using probability distributions.

Characteristics

Uses mathematical models
Handles complex datasets

Advantages

Flexible
Provides accurate clustering

Limitations

Computationally complex
Requires statistical assumptions

6.5 Types of Data in Cluster Analysis

Different data types require different clustering techniques.

1. Interval-Scaled Variables

Definition

Continuous numerical variables measured on equal scales without a true zero point.

Examples

Temperature in Celsius
Calendar dates

Characteristics

Equal intervals between values
No true zero

2. Binary Variables

Definition

Variables having only two possible values.

Examples

Yes/No
True/False
Pass/Fail

3. Nominal Variables

Definition

Categorical variables without any ordering.

Examples

Color
Nationality
Department names

Characteristics

No ranking
Only category labels

4. Ordinal Variables

Definition

Variables representing ordered categories.

Examples

Low, Medium, High
Customer satisfaction levels
Student grades

Characteristics

Order exists
Exact differences are unknown

5. Ratio-Scaled Variables

Definition

Numerical variables having a true zero point.

Examples

Height
Weight
Age
Salary

Characteristics

Supports arithmetic operations
Ratios are meaningful

6.6 Partitioning Methods

Partitioning methods divide data into a fixed number of clusters.

Types

K-Means
K-Medoids

1. K-Means Method

Definition

K-Means is a partitioning clustering algorithm that divides data into K clusters using mean values.

Each cluster is represented by a centroid.

Objective

Minimize the distance between data points and cluster centroids.

Steps of K-Means Algorithm

Step 1: Select Number of Clusters

Choose the value of K.

Step 2: Initialize Centroids

Randomly select K centroids.

Step 3: Assign Data Points

Assign each object to the nearest centroid.

Step 4: Recalculate Centroids

Compute new centroids for clusters.

Step 5: Repeat

Repeat the process until clusters stabilize.

Example

Customers may be grouped into clusters based on:

Income
Spending behavior

Advantages

1. Simple and Easy to Implement

Widely used because of simplicity.

2. Fast Processing

Efficient for large datasets.

3. Scalable

Works well for moderate-to-large datasets.

4. Effective for Numerical Data

Performs well on continuous variables.

Limitations

1. Requires K in Advance

Number of clusters must be predefined.

2. Sensitive to Initial Centroids

Different initial centroids may produce different results.

3. Sensitive to Noise and Outliers

Outliers affect centroid calculation.

4. Poor for Non-Spherical Clusters

Works best for spherical clusters.

2. K-Medoids Method

Definition

K-Medoids is a partitioning algorithm where clusters are represented by actual data objects called medoids.

Medoid

A medoid is the most centrally located object in a cluster.

Unlike centroids, medoids are actual existing data points.

Steps of K-Medoids Algorithm

Step 1: Select Initial Medoids

Randomly choose K medoids.

Step 2: Assign Objects

Assign objects to the nearest medoid.

Step 3: Replace Medoids

Swap medoids with non-medoid objects to improve clustering.

Step 4: Compute Cost

Calculate total clustering cost.

Step 5: Repeat

Continue until no improvement occurs.

Advantages

1. Robust Against Noise

Less affected by outliers.

2. Uses Real Data Objects

Medoids are actual data points.

3. Better for Noisy Data

Produces stable clusters.

Limitations

1. Computationally Expensive

More costly than K-Means.

2. Slower Processing

Not suitable for extremely large datasets.

3. Requires K in Advance

Number of clusters must be specified.

6.7 Applications of Data Mining in Various Sectors

1. Banking

Credit risk analysis
Fraud detection
Loan approval prediction
Customer segmentation

2. Healthcare

Disease diagnosis
Patient record analysis
Medical image analysis
Drug discovery

3. Education

Student performance prediction
Attendance analysis
Personalized learning systems

4. Retail

Market basket analysis
Inventory management
Customer behavior analysis
Recommendation systems

5. Telecommunications

Network optimization
Customer churn prediction
Fraud detection

6. E-Commerce

Product recommendation
Customer targeting
Sales prediction

7. Fraud Detection

Credit card fraud detection
Insurance fraud analysis
Cybersecurity monitoring

Sentiment analysis
Trend detection
User behavior analysis
Advertisement targeting

notes-shivam

Explorer

6. Clustering

6. Clustering

Applications of Clustering

6.1 Introduction to Clustering

1. Definition of Clustering

2. Example of Clustering

3. Characteristics of Clustering

1. Unsupervised Learning

2. Similarity-Based Grouping

3. High Intra-Cluster Similarity

4. High Inter-Cluster Dissimilarity

5. Automatic Pattern Discovery

6. Data Reduction

4. Importance of Clustering

6.2 Cluster Analysis

1. Meaning of Cluster Analysis

2. Objectives of Cluster Analysis

1. Pattern Discovery

Example

2. Data Simplification

3. Knowledge Discovery

4. Data Organization

5. Decision Support

6. Outlier Detection

3. Characteristics of Good Clustering

6.3 Need for Clustering

1. Pattern Recognition

Examples

Importance

2. Data Segmentation

Example

Importance

3. Knowledge Discovery

Example

Importance

6.4 Major Clustering Methods

Types of Clustering Methods

1. Partitioning Methods

Definition

Characteristics

Examples

2. Hierarchical Methods

Definition

Types

1. Agglomerative Method

2. Divisive Method

Characteristics

Limitations

3. Density-Based Methods

Definition

Example

Characteristics

Advantages

4. Grid-Based Methods

Definition

Characteristics

Advantages

5. Model-Based Methods

Definition

Characteristics

Advantages

Limitations

6.5 Types of Data in Cluster Analysis

1. Interval-Scaled Variables

Definition

Examples

Characteristics

2. Binary Variables

Definition

Examples

3. Nominal Variables

Definition

Examples

Characteristics

4. Ordinal Variables

Definition

Examples

Characteristics