3. Introduction to Data Mining
Data Mining is an important field in computer science and information systems that focuses on extracting useful knowledge and hidden patterns from large volumes of data. With the rapid growth of databases, organizations generate enormous amounts of information daily. Simply storing data is not enough; organizations need techniques to analyze the data and discover meaningful information.
Data Mining combines concepts from:
-
Database systems
-
Statistics
-
Machine learning
-
Artificial intelligence
-
Pattern recognition
-
Data Warehousing
It helps organizations make better decisions, predict future trends, and improve business strategies.
3.1 Introduction to Data Mining
1. Definition of Data Mining
Data Mining is the process of discovering hidden patterns, useful information, relationships, trends, and knowledge from large databases using intelligent techniques.
It is also known as:
-
Knowledge Discovery
-
Knowledge Extraction
-
Data Analysis
Data Mining converts raw data into useful knowledge that can support:
-
Business decision-making
-
Prediction
-
Trend analysis
-
Strategic planning
2. Formal Definition of Data Mining
Data Mining can be defined as:
“The process of extracting valid, previously unknown, understandable, and useful patterns from large databases.”
3. Features of Data Mining
1. Automatic Knowledge Discovery
Data Mining automatically identifies hidden patterns and relationships.
2. Analysis of Large Databases
It handles huge amounts of data efficiently.
3. Prediction and Forecasting
Mining techniques help predict future outcomes.
Example
-
Sales forecasting
-
Weather prediction
-
Stock market analysis
4. Pattern Identification
Discovers trends and relationships among data.
Example
Customers buying bread may also buy butter.
5. Decision Support
Helps organizations make accurate business decisions.
4. Evolution of Data Mining
Data Mining evolved gradually from database management systems and data analysis technologies.
1. Data Collection Era
In the early stage, organizations focused mainly on collecting and storing data.
Technologies Used
-
File systems
-
Basic databases
2. Database Management Era
Databases were introduced to manage structured data efficiently.
Features
-
Data storage
-
Query processing
-
Transaction management
Examples
-
Relational databases
-
DBMS systems
3. Data Warehousing Era
Organizations started integrating data from multiple sources into centralized warehouses.
Features
-
Historical data storage
-
OLAP analysis
-
Reporting systems
4. Data Mining Era
Advanced techniques were developed to discover hidden knowledge from warehouses and databases.
Technologies Used
-
Machine learning
-
Artificial intelligence
-
Statistical analysis
-
Pattern recognition
5. Big Data and AI Era
Modern Data Mining now handles:
-
Massive datasets
-
Real-time data
-
Cloud computing
-
AI-based analytics
3.2 Need for Data Mining
Organizations require Data Mining because modern databases contain huge volumes of data that are difficult to analyze manually.
1. Extraction of Useful Knowledge
Large databases contain hidden valuable information.
Data Mining extracts:
-
Useful patterns
-
Relationships
-
Trends
-
Business insights
Example
A supermarket analyzing customer purchase history to improve product placement.
2. Pattern Discovery
Data Mining identifies hidden patterns and correlations among data.
Examples
-
Customer buying behavior
-
Fraud detection patterns
-
Website usage patterns
Importance
Pattern discovery helps organizations:
-
Improve services
-
Increase profit
-
Reduce risks
3. Business Intelligence
Data Mining supports Business Intelligence systems.
Explanation
Business Intelligence helps organizations:
-
Analyze performance
-
Monitor growth
-
Predict future trends
-
Improve planning
Example
Banks use mining techniques to:
-
Detect fraud
-
Analyze customer behavior
-
Approve loans
4. Better Decision Making
Mining systems provide accurate analytical information for management decisions.
Example
A company identifying low-performing products before losses increase.
5. Competitive Advantage
Organizations using mining techniques gain advantages over competitors through better understanding of data.
3.3 KDD Process (Knowledge Discovery in Databases)
KDD stands for Knowledge Discovery in Databases.
It is the complete process of extracting useful knowledge from large databases.
Data Mining is only one step within the KDD process.

Steps in KDD Process
1. Data Cleaning
Data cleaning removes:
-
Noise
-
Missing values
-
Duplicate data
-
Inconsistencies
Purpose
Improve data quality before analysis.
Example
Correcting invalid customer phone numbers.
2. Data Integration
Data from multiple sources is combined into a unified dataset.
Sources May Include
-
Databases
-
Files
-
Web sources
-
Applications
Example
Combining sales data from multiple branch offices.
3. Data Selection
Relevant data is selected from the database for mining.
Explanation
Not all data is necessary for analysis.
Example
Selecting customer age and income for loan prediction.
4. Data Transformation
Data is converted into suitable forms for mining.
Transformation Methods
-
Normalization
-
Aggregation
-
Generalization
Example
Converting daily sales into monthly sales summaries.
5. Data Mining
Mining algorithms are applied to discover patterns and knowledge.
Techniques Used
-
Classification
-
Clustering
-
Association analysis
-
Prediction
6. Pattern Evaluation
Interesting and useful patterns are identified from discovered results.
Purpose
Remove irrelevant or unimportant patterns.
Example
Identifying highly profitable customer groups.
7. Knowledge Presentation
The discovered knowledge is presented in understandable forms.
Methods
-
Reports
-
Charts
-
Graphs
-
Dashboards
-
Visualization tools
Purpose
Help users interpret mining results effectively.
3.4 Data Mining Architecture
Data Mining Architecture defines the structure and components of a Data Mining system.
It describes how different modules interact to perform mining operations.

Components of Data Mining Architecture
1. Database/Data Warehouse Server
This component stores and manages data.
Functions
-
Data retrieval
-
Query processing
-
Data storage
-
Database management
Sources
-
Relational databases
-
Data warehouses
-
Flat files
-
Transactional systems
2. Knowledge Base
The knowledge base stores background information used during mining.
Contains
-
User beliefs
-
Domain knowledge
-
Threshold values
-
Concept hierarchies
Importance
Helps guide mining processes and improve mining accuracy.
3. Data Mining Engine
This is the core component of the mining system.
Functions
Performs mining tasks such as:
-
Classification
-
Clustering
-
Association analysis
-
Prediction
-
Outlier detection
Importance
Responsible for discovering hidden patterns.
4. Pattern Evaluation Module
This module evaluates discovered patterns and identifies useful knowledge.
Functions
-
Measure interestingness
-
Filter unimportant patterns
-
Improve mining efficiency
Example
Removing weak association rules with low support.
5. User Interface
Allows communication between users and the mining system.
Functions
-
Query input
-
Result visualization
-
Report generation
-
Interactive analysis
Examples
-
Dashboards
-
Graphical reports
-
Visualization tools
3.5 Data Mining Functionalities
Data Mining functionalities describe the types of knowledge that can be discovered from data.
1. Concept Description
Concept description summarizes general characteristics of data.
Types
1. Characterization
Describes features of a target class.
Example
Characteristics of high-income customers.
2. Discrimination
Compares one class with another.
Example
Comparing:
-
High-performing students
-
Low-performing students
2. Association Analysis
Association analysis discovers relationships among items.
Purpose
Find frequently occurring item combinations.
Example
Customers buying:
-
Bread
-
Butter
together frequently.
Application Areas
-
Market basket analysis
-
Product recommendation systems
3. Classification
Classification assigns data into predefined categories.
Explanation
A model is trained using labeled data.
Examples
-
Email spam detection
-
Disease diagnosis
-
Student grade prediction
4. Prediction
Prediction estimates future values or trends.
Example
-
Sales forecasting
-
Weather prediction
-
Stock market prediction
Difference Between Classification and Prediction
| Classification | Prediction |
|---|---|
| Predicts categories | Predicts continuous values |
| Example: Pass/Fail | Example: Salary amount |
5. Clustering
Clustering groups similar data objects together.
Explanation
Unlike classification, clusters are formed automatically without predefined classes.
Example
Grouping customers based on:
-
Purchasing behavior
-
Income level
-
Interests
6. Outlier Analysis
Outlier analysis identifies abnormal or unusual data objects.
Examples
-
Fraudulent banking transactions
-
Network intrusions
-
Abnormal medical reports
Importance
Helps detect rare but important events.
3.6 Data Mining Task Primitives
Task primitives define the basic requirements and specifications for Data Mining tasks.
They help users communicate mining objectives to the system.
1. Task-Relevant Data
Specifies the data to be mined.
Includes
-
Database name
-
Table name
-
Attributes
-
Conditions
Example
Mining customer purchase data from a sales database.
2. Kind of Knowledge to be Mined
Specifies the type of mining functionality required.
Examples
-
Classification
-
Clustering
-
Association rules
-
Prediction
Importance
Helps the mining engine choose suitable algorithms.
3. Background Knowledge
Background knowledge provides additional domain information.
Examples
-
Concept hierarchies
-
Business rules
-
Domain constraints
Example
Hierarchy:
- City → State → Country
Importance
Improves mining accuracy and understanding.
4. Interest Measures
Interest measures determine the usefulness of discovered patterns.
Common Measures
1. Support
Measures frequency of occurrence.
2. Confidence
Measures reliability of association rules.
Example
Rule:
Bread → ButterHigh confidence means customers buying bread frequently buy butter.
Importance
Helps remove unimportant patterns.
3.7 Integration of Data Mining System with Database or Data Warehouse System
Data Mining systems are often integrated with databases or Data Warehouses for efficient data access and analysis.
1. Coupling Schemes
Coupling refers to the degree of integration between Data Mining systems and databases.
1. No Coupling
Mining system works independently from the database.
Features
-
Separate processing
-
Low integration
Disadvantages
-
Poor efficiency
-
Slow processing
2. Loose Coupling
Mining system uses database functions partially.
Features
-
Moderate integration
-
Some database support
Advantages
- Better efficiency than no coupling
3. Semi-Tight Coupling
Some mining functions are implemented inside the database system.
Features
-
Improved performance
-
Better integration
4. Tight Coupling
Mining functions are fully integrated into the database or Data Warehouse.
Features
-
High performance
-
Efficient query processing
-
Better scalability
Advantages
-
Fast mining operations
-
Reduced data transfer
-
Better resource utilization
2. Benefits of Integration
Integrating mining systems with databases provides several advantages.
Advantages
1. Efficient Data Access
Mining algorithms directly access database data.
2. Improved Data Consistency
Integrated systems maintain centralized data control.
3. Better Security
Database security mechanisms protect mining data.
4. Reduced Data Redundancy
No need to copy data repeatedly.
5. Faster Processing
Database indexing and optimization improve mining speed.
3. Performance Improvement
Integration improves overall mining performance.
Reasons
-
Reduced data transfer overhead
-
Faster query execution
-
Better memory utilization
-
Parallel processing support
Example
Mining directly on a Data Warehouse is faster than exporting data to external tools.