6. Clustering

Clustering is an important unsupervised learning technique in Data Mining used to group similar data objects into clusters.

  • Objects within the same cluster are highly similar.

  • Objects in different clusters are dissimilar.

Unlike classification:

  • Clustering does not require predefined class labels.

  • Groups are formed automatically based on similarities in data.

Applications of Clustering

  • Market analysis

  • Customer segmentation

  • Pattern recognition

  • Image processing

  • Medical diagnosis

  • Social network analysis


6.1 Introduction to Clustering

1. Definition of Clustering

Clustering is the process of grouping data objects into clusters such that:

  • Intra-cluster similarity is high

  • Inter-cluster similarity is low

Each cluster represents a collection of related data objects.


2. Example of Clustering

A shopping website may group customers based on:

  • Purchasing habits

  • Age

  • Income

  • Interests

Customers with similar behavior are placed in the same cluster.


3. Characteristics of Clustering

1. Unsupervised Learning

  • Does not require labeled training data

  • Automatically discovers patterns in data


2. Similarity-Based Grouping

Objects are grouped based on similarity measures such as:

  • Distance

  • Density

  • Statistical similarity


3. High Intra-Cluster Similarity

Objects within the same cluster are highly similar.


4. High Inter-Cluster Dissimilarity

Objects belonging to different clusters are highly different.


5. Automatic Pattern Discovery

Helps identify hidden structures and relationships in data.


6. Data Reduction

Large datasets can be simplified by representing them as clusters.


4. Importance of Clustering

Clustering helps to:

  • Discover hidden patterns

  • Organize large datasets

  • Simplify data analysis

  • Improve decision-making


6.2 Cluster Analysis

1. Meaning of Cluster Analysis

Cluster Analysis is the process of analyzing data objects and organizing them into meaningful clusters.

The main objective is to:

  • Maximize similarity within clusters

  • Minimize similarity between clusters


2. Objectives of Cluster Analysis

1. Pattern Discovery

Identifies hidden relationships and trends in data.

Example

Grouping customers with similar buying behavior.


2. Data Simplification

Reduces complexity by grouping similar records together.


3. Knowledge Discovery

Extracts useful information from large datasets.


4. Data Organization

Organizes data into understandable structures.


5. Decision Support

Helps organizations identify trends and make better decisions.


6. Outlier Detection

Identifies abnormal objects that do not belong to any cluster.


3. Characteristics of Good Clustering

A good clustering method should:

  • Produce high-quality clusters

  • Handle noisy data

  • Be scalable for large datasets

  • Handle different data types

  • Work efficiently


6.3 Need for Clustering

Clustering is necessary because modern systems generate huge amounts of complex data that are difficult to analyze manually.

1. Pattern Recognition

Clustering helps recognize patterns and relationships among data objects.

Examples

  • Face recognition

  • Handwriting recognition

  • Speech recognition

Importance

  • Improves automation

  • Enhances machine learning systems

  • Detects hidden structures


2. Data Segmentation

Clustering divides large datasets into smaller meaningful groups.

Example

Customer segmentation based on:

  • Age

  • Income

  • Shopping behavior

Importance

  • Improves marketing strategies

  • Provides personalized services

  • Increases customer satisfaction


3. Knowledge Discovery

Clustering helps extract useful knowledge from large datasets.

Example

Grouping patients with similar symptoms.

Importance

  • Supports research

  • Improves business intelligence

  • Assists data analysis


6.4 Major Clustering Methods

Clustering methods are classified based on how clusters are formed.

Types of Clustering Methods

  1. Partitioning Methods

  2. Hierarchical Methods

  3. Density-Based Methods

  4. Grid-Based Methods

  5. Model-Based Methods


1. Partitioning Methods

Definition

Partitioning methods divide data into a predefined number of clusters.

Each object belongs to exactly one cluster.

Characteristics

  • Fast and simple

  • Suitable for medium-sized datasets

  • Requires the number of clusters in advance

Examples

  • K-Means

  • K-Medoids


2. Hierarchical Methods

Definition

Hierarchical clustering creates a hierarchy of clusters.

Clusters are formed by:

  • Merging smaller clusters

  • Dividing larger clusters

Types

1. Agglomerative Method

  • Bottom-up approach

  • Small clusters merge into larger clusters

2. Divisive Method

  • Top-down approach

  • Large clusters divide into smaller clusters

Characteristics

  • Produces a dendrogram (tree structure)

  • Does not require initial cluster count

Limitations

  • Computationally expensive

  • Difficult for very large datasets


3. Density-Based Methods

Definition

Clusters are formed based on dense regions of data points separated by sparse regions.

Example

  • DBSCAN

Characteristics

  • Detects arbitrary-shaped clusters

  • Handles noise effectively

Advantages

  • Suitable for spatial data

  • Identifies outliers


4. Grid-Based Methods

Definition

Data space is divided into finite grid cells, and clustering is performed on grids instead of individual objects.

Characteristics

  • Fast processing

  • Efficient for large datasets

Advantages

  • Computationally efficient

  • Performance independent of number of objects


5. Model-Based Methods

Definition

Assumes data is generated from statistical models and identifies clusters using probability distributions.

Characteristics

  • Uses mathematical models

  • Handles complex datasets

Advantages

  • Flexible

  • Provides accurate clustering

Limitations

  • Computationally complex

  • Requires statistical assumptions


6.5 Types of Data in Cluster Analysis

Different data types require different clustering techniques.


1. Interval-Scaled Variables

Definition

Continuous numerical variables measured on equal scales without a true zero point.

Examples

  • Temperature in Celsius

  • Calendar dates

Characteristics

  • Equal intervals between values

  • No true zero


2. Binary Variables

Definition

Variables having only two possible values.

Examples

  • Yes/No

  • True/False

  • Pass/Fail


3. Nominal Variables

Definition

Categorical variables without any ordering.

Examples

  • Color

  • Nationality

  • Department names

Characteristics

  • No ranking

  • Only category labels


4. Ordinal Variables

Definition

Variables representing ordered categories.

Examples

  • Low, Medium, High

  • Customer satisfaction levels

  • Student grades

Characteristics

  • Order exists

  • Exact differences are unknown


5. Ratio-Scaled Variables

Definition

Numerical variables having a true zero point.

Examples

  • Height

  • Weight

  • Age

  • Salary

Characteristics

  • Supports arithmetic operations

  • Ratios are meaningful


6.6 Partitioning Methods

Partitioning methods divide data into a fixed number of clusters.

Types

  1. K-Means

  2. K-Medoids


1. K-Means Method

Definition

K-Means is a partitioning clustering algorithm that divides data into K clusters using mean values.

Each cluster is represented by a centroid.


Objective

Minimize the distance between data points and cluster centroids.


Steps of K-Means Algorithm

Step 1: Select Number of Clusters

Choose the value of K.

Step 2: Initialize Centroids

Randomly select K centroids.

Step 3: Assign Data Points

Assign each object to the nearest centroid.

Step 4: Recalculate Centroids

Compute new centroids for clusters.

Step 5: Repeat

Repeat the process until clusters stabilize.


Example

Customers may be grouped into clusters based on:

  • Income

  • Spending behavior


Advantages

1. Simple and Easy to Implement

Widely used because of simplicity.

2. Fast Processing

Efficient for large datasets.

3. Scalable

Works well for moderate-to-large datasets.

4. Effective for Numerical Data

Performs well on continuous variables.


Limitations

1. Requires K in Advance

Number of clusters must be predefined.

2. Sensitive to Initial Centroids

Different initial centroids may produce different results.

3. Sensitive to Noise and Outliers

Outliers affect centroid calculation.

4. Poor for Non-Spherical Clusters

Works best for spherical clusters.


2. K-Medoids Method

Definition

K-Medoids is a partitioning algorithm where clusters are represented by actual data objects called medoids.


Medoid

A medoid is the most centrally located object in a cluster.

Unlike centroids, medoids are actual existing data points.


Steps of K-Medoids Algorithm

Step 1: Select Initial Medoids

Randomly choose K medoids.

Step 2: Assign Objects

Assign objects to the nearest medoid.

Step 3: Replace Medoids

Swap medoids with non-medoid objects to improve clustering.

Step 4: Compute Cost

Calculate total clustering cost.

Step 5: Repeat

Continue until no improvement occurs.


Advantages

1. Robust Against Noise

Less affected by outliers.

2. Uses Real Data Objects

Medoids are actual data points.

3. Better for Noisy Data

Produces stable clusters.


Limitations

1. Computationally Expensive

More costly than K-Means.

2. Slower Processing

Not suitable for extremely large datasets.

3. Requires K in Advance

Number of clusters must be specified.


6.7 Applications of Data Mining in Various Sectors

1. Banking

  • Credit risk analysis

  • Fraud detection

  • Loan approval prediction

  • Customer segmentation


2. Healthcare

  • Disease diagnosis

  • Patient record analysis

  • Medical image analysis

  • Drug discovery


3. Education

  • Student performance prediction

  • Attendance analysis

  • Personalized learning systems


4. Retail

  • Market basket analysis

  • Inventory management

  • Customer behavior analysis

  • Recommendation systems


5. Telecommunications

  • Network optimization

  • Customer churn prediction

  • Fraud detection


6. E-Commerce

  • Product recommendation

  • Customer targeting

  • Sales prediction


7. Fraud Detection

  • Credit card fraud detection

  • Insurance fraud analysis

  • Cybersecurity monitoring


8. Social Media Analytics

  • Sentiment analysis

  • Trend detection

  • User behavior analysis

  • Advertisement targeting