1. Introduction to Data Warehousing

Data Warehousing is an important concept in modern database systems and business intelligence. Organizations generate huge amounts of data every day from transactions, customers, sales, websites, banking systems, and many other sources. Managing and analyzing this data efficiently is necessary for business growth and decision-making.

A Data Warehouse helps organizations collect, integrate, store, and analyze large volumes of data from multiple sources in a centralized system. It supports reporting, business analysis, forecasting, and strategic planning.


1.1 Data Warehousing

1. Definition of Data Warehousing

Data Warehousing is the process of collecting, integrating, organizing, and storing large amounts of data from multiple heterogeneous sources into a centralized repository called a Data Warehouse for analysis and decision-making purposes.

A Data Warehouse stores:

  • Current data

  • Historical data

  • Summarized data

in a structured form for analytical processing.

Unlike operational databases, Data Warehouses are designed mainly for:

  • Data analysis

  • Reporting

  • Forecasting

  • Business intelligence

  • Strategic decision-making

The concept of Data Warehousing was introduced by Bill Inmon.

Standard Definition by Bill Inmon

“A subject-oriented, integrated, time-variant, and non-volatile collection of data used to support management decision making.”

This definition highlights the four major characteristics of a Data Warehouse.


2. Characteristics of Data Warehouse

1. Subject-Oriented

A Data Warehouse is organized around major business subjects instead of daily business operations.

Explanation

Operational databases focus on processes such as:

  • Billing

  • Banking transactions

  • Order processing

A Data Warehouse focuses on business subjects such as:

  • Sales

  • Customers

  • Products

  • Employees

  • Finance

This organization helps management analyze important business areas effectively.

Example

A retail company may analyze:

  • Product sales trends

  • Customer purchasing behavior

  • Regional revenue growth

instead of handling individual customer transactions.

Advantages

  • Simplifies data analysis

  • Improves decision-making

  • Provides better business understanding


2. Integrated

A Data Warehouse integrates data collected from multiple heterogeneous sources into a consistent format.

Explanation

Different systems may use:

  • Different naming conventions

  • Different data formats

  • Different coding methods

  • Different measurement units

Before storing data in the warehouse, the data is:

  • Cleaned

  • Transformed

  • Standardized

  • Integrated

This ensures consistency throughout the system.

Example

Source SystemGender Representation
System AM / F
System BMale / Female
System C1 / 0

In the Data Warehouse, all values may be standardized into:

  • Male

  • Female

Advantages

  • Eliminates inconsistencies

  • Improves data quality

  • Increases reliability of reports


3. Time-Variant

A Data Warehouse stores historical data over long periods of time.

Explanation

Each record in the warehouse is associated with a time dimension such as:

  • Day

  • Month

  • Quarter

  • Year

Operational databases usually store current data, while Data Warehouses maintain historical records for long-term analysis.

Example

A company may store:

  • Sales data of the last 10 years

  • Employee performance history

  • Monthly revenue trends

Advantages

  • Supports trend analysis

  • Helps forecasting

  • Enables comparison of past and present performance


4. Non-Volatile

Once data enters the Data Warehouse, it is not frequently modified or deleted.

Explanation

The warehouse mainly performs:

  • Data loading

  • Data retrieval

  • Data analysis

Frequent operations such as:

  • Insert

  • Update

  • Delete

are minimal compared to operational systems.

Historical data remains stable for future analysis.

Example

Past sales records remain permanently stored for reporting and comparison.

Advantages

  • Provides stable information

  • Improves consistency

  • Prevents accidental data modification


3. Purpose of Data Warehouse

1. Decision Support

Data Warehouses help managers and executives make strategic business decisions.

Explanation

The warehouse provides:

  • Summarized reports

  • Trend analysis

  • Historical comparisons

  • Business insights

Example

Management can identify:

  • Best-selling products

  • Most profitable regions

  • Customer purchasing patterns


2. Data Analysis

Data Warehouses allow organizations to analyze large volumes of data efficiently.

Explanation

Complex analytical queries can be executed quickly because data is:

  • Organized

  • Integrated

  • Optimized for analysis

Example

  • Year-wise sales analysis

  • Product performance analysis

  • Regional market analysis


3. Historical Record Maintenance

A Data Warehouse stores long-term historical information.

Explanation

Historical data helps organizations:

  • Study business growth

  • Identify trends

  • Predict future outcomes

Example

  • Revenue growth over 15 years

  • Customer growth analysis

  • Seasonal sales trends


4. Business Intelligence Support

Data Warehouses form the foundation of Business Intelligence systems.

Explanation

They support:

  • Dashboards

  • Reporting tools

  • Data mining

  • Forecasting systems

  • Analytical processing

Example

A dashboard displaying:

  • Daily sales

  • Monthly revenue

  • Customer growth statistics


5. Improved Data Quality

Data is cleaned and standardized before entering the warehouse.

Explanation

Data cleaning removes:

  • Duplicate records

  • Missing values

  • Incorrect data

  • Inconsistent formats

Advantages

  • Improves accuracy

  • Increases reliability

  • Enhances reporting quality


1.2 Difference Between Operational Database System and Data Warehouse

Operational Database Systems and Data Warehouses serve different purposes in an organization.

Operational systems focus on daily business transactions, while Data Warehouses focus on analysis and decision support.


1. Operational Database System

Operational Database Systems are databases used to manage day-to-day business operations and transactions.

These systems are also called:

  • OLTP Systems

  • Online Transaction Processing Systems

Features of Operational Database Systems

1. Real-Time Transaction Processing

Transactions are processed immediately.

Examples

  • ATM withdrawal

  • Online shopping payment

  • Railway ticket booking


2. Frequent Insert, Update, Delete Operations

Operational systems continuously modify data.

Example

  • Updating account balances

  • Adding customer orders

  • Cancelling reservations


3. Stores Current Data

Mainly maintains active and current information.

Historical data storage is limited.


4. Optimized for Fast Processing

Designed for:

  • High transaction speed

  • Fast response time

  • Multiple concurrent users


Examples of Operational Database Systems

  • Banking systems

  • Hospital management systems

  • Railway reservation systems

  • E-commerce transaction systems


2. Data Warehouse

A Data Warehouse is used for analytical processing and decision support.

These systems are also called:

  • OLAP Systems

  • Online Analytical Processing Systems

Features of Data Warehouse

1. Stores Historical Data

Maintains long-term historical records.


2. Supports Complex Queries

Used for:

  • Reporting

  • Trend analysis

  • Aggregation

  • Forecasting


3. Read-Oriented System

Most operations involve:

  • Reading data

  • Retrieving reports

  • Performing analysis


4. Supports Strategic Decisions

Used by:

  • Managers

  • Executives

  • Analysts

  • Researchers

for planning and decision-making.


Examples of Data Warehouse Applications

  • Business intelligence dashboards

  • Sales analysis systems

  • Customer behavior analysis

  • Financial forecasting systems


3. Comparison between OLTP and OLAP Systems

OLTP: Online Transaction Processing:

  • OLTP refers to a class of systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing.
  • OLTP is used to refer to processing in which the system responds immediately to user requests.
  • They cover most of the day to day operations of an organization such as purchasing, inventory, manufacturing, banking, payroll, registration and accounting.
  • An Automatic Teller Machine (ATM) for a bank, Railway Reservation System are examples of a commercial transaction processing application.

OLAP: Online Analytical Processing:

  • Online Analytical Processing, or OLAP, is an approach to quickly provide answers to analytical queries that are multi-dimensional in nature.
  • OLAP organizes and presents data in various formats in order to accommodate the diverse needs of the different users.
  • It serves users or knowledge workers in the role of data analysis and decision making.
  • The typical applications of OLAP are in business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas.
BasisOLTPOLAP
Full FormOnline Transaction ProcessingOnline Analytical Processing
PurposeDaily business operationsData analysis and decision support
UsersClerks, operatorsManagers, analysts, executives
Data TypeCurrent dataHistorical data
Query TypeSimple transactional queriesComplex analytical queries
OperationsInsert, Update, DeleteRead and Analyze
Response TimeVery fastComparatively slower
Database DesignHighly normalizedDenormalized
FocusOperational efficiencyAnalytical efficiency
ExampleBanking systemBusiness reporting system

1.3 Data Warehouse Users

Different categories of users interact with a Data Warehouse according to their responsibilities and business requirements.

1. Management Users

Management users use summarized reports and business analysis for decision-making.

Responsibilities

  • Strategic planning

  • Monitoring business performance

  • Resource management

  • Reviewing organizational growth

Example

A sales manager analyzing:

  • Monthly sales reports

  • Product performance

  • Regional sales growth

Importance

Management users use warehouse data to improve organizational planning and efficiency.


2. Analysts

Analysts perform detailed examination of data to identify patterns and trends.

Responsibilities

  • Data analysis

  • Report generation

  • Market analysis

  • Trend identification

Example

A marketing analyst studying:

  • Customer purchasing behavior

  • Product demand patterns

  • Seasonal sales trends

Importance

Analysts convert raw data into meaningful business insights.


3. Executives

Executives use summarized dashboards and high-level reports.

Responsibilities

  • Business strategy formulation

  • Investment planning

  • Organizational monitoring

  • Decision making

Example

A CEO reviewing:

  • Annual profit growth

  • Company performance

  • Market expansion statistics

Importance

Executives use warehouse information for long-term organizational planning.


4. Data Scientists and Researchers

These users perform advanced analytical operations on warehouse data.

Responsibilities

  • Data mining

  • Machine learning

  • Predictive analysis

  • Statistical modeling

Example

Predicting customer churn using historical transaction data.

Importance

They help organizations discover hidden patterns and future predictions.


1.4 Benefits of Data Warehousing

1. Improved Decision Making

Data Warehouses provide accurate and integrated information for better decisions.

Explanation

Managers can make effective decisions because the warehouse provides:

  • Reliable data

  • Historical records

  • Trend analysis

  • Business summaries

Example

Identifying:

  • Best-selling products

  • High-profit regions

  • Weak business areas


2. Faster Query Response

Data Warehouses are optimized for analytical queries.

Explanation

Complex reports are generated quickly because:

  • Data is indexed

  • Aggregations are precomputed

  • Schema is optimized for analysis

Example

Generating:

  • Annual sales reports

  • Revenue comparison reports

  • Product performance reports

within seconds.


3. Historical Data Analysis

Organizations can analyze long-term business performance.

Explanation

Historical data helps in:

  • Forecasting

  • Trend analysis

  • Market prediction

Example

Comparing current sales with sales of previous five years.


4. Data Consistency and Integration

Data from multiple systems is standardized before storage.

Benefits

  • Reduced redundancy

  • Improved accuracy

  • Unified data view

  • Better reliability

Example

Customer data from different branches stored in a common format.


1.5 Metadata

1. Definition of Metadata

Metadata means “data about data”.

It provides information describing:

  • Structure of data

  • Source of data

  • Meaning of data

  • Relationships among data

Metadata helps users understand and manage warehouse data efficiently.

Example

For a sales column, metadata may contain:

  • Data type

  • Source table

  • Last update date

  • Column description


2. Role of Metadata in Data Warehouse

Metadata acts as a roadmap or directory of the Data Warehouse.

Functions of Metadata

  • Describes warehouse structure

  • Helps locate data

  • Supports ETL processes

  • Assists query optimization

  • Maintains consistency


3. Classification of Metadata

1. Technical Metadata

Contains technical information about data.

Includes

  • Table names

  • Column names

  • Data types

  • Indexes

  • Schema definitions

Example

Information about database tables and relationships.


2. Business Metadata

Contains business-related definitions understandable by business users.

Includes

  • Business rules

  • Definitions

  • KPIs

  • Calculations

Example

Definition of “Net Profit”.


3. Operational Metadata

Contains information about warehouse operations.

Includes

  • Data loading time

  • Backup schedules

  • Refresh schedules

  • Error logs

Example

Status of ETL execution.


4. Importance of Metadata

1. Data Understanding

Helps users understand the meaning and structure of data.


2. Data Management

Assists administrators in managing warehouse systems efficiently.


3. Query Optimization

Helps database systems execute queries efficiently.


4. Data Integration Support

Supports integration of data from multiple sources consistently.


1.6 Data Marts

1. Definition of Data Mart

A Data Mart is a smaller and specialized subset of a Data Warehouse designed for a specific department or business function.

It contains focused data relevant to a particular group of users.

Examples

  • Sales Data Mart

  • HR Data Mart

  • Finance Data Mart


2. Types of Data Marts

1. Dependent Data Mart

Created from an existing Data Warehouse.

Features

  • Uses centralized warehouse data

  • Maintains consistency

  • Easier management


2. Independent Data Mart

Created directly from operational systems without a central warehouse.

Features

  • Faster implementation

  • Department-specific control

  • Lower initial cost


3. Hybrid Data Mart

Combines data from:

  • Data Warehouse

  • Operational systems

  • External sources

Features

  • Flexible

  • Combines multiple data sources

  • Supports broader analysis


3. Reasons for Creating Data Marts

1. Department-Specific Analysis

Provides customized information for departments.

Example

The HR department accesses only employee-related information.


2. Faster Access to Data

Smaller data size improves query speed.


3. Reduced Implementation Cost

Cheaper than implementing a full enterprise warehouse.


4. Improved Performance

Less data volume improves efficiency and response time.


4. Building Data Marts

1. Top-Down Approach

In this approach, the Enterprise Data Warehouse is created first.

Process

  1. Create centralized Data Warehouse

  2. Extract department-specific data

  3. Build individual Data Marts

Advantages

  • Better integration

  • High consistency

  • Centralized control

Disadvantages

  • Expensive

  • Time-consuming


2. Bottom-Up Approach

In this approach, Data Marts are created first and later integrated into a warehouse.

Process

  1. Create department-level Data Marts

  2. Combine Data Marts later

  3. Build Enterprise Data Warehouse

Advantages

  • Faster implementation

  • Lower initial cost

  • Easier development

Disadvantages

  • Risk of inconsistency

  • Difficult integration


1.7 Data Warehouse Architecture

Data Warehouse Architecture defines the structure and organization of the warehouse system.


1. Two-Tier Architecture

A two-tier architecture directly connects clients with the Data Warehouse server.

Components

1. Client Layer

Used by end users for:

  • Querying

  • Reporting

  • Analysis

2. Data Warehouse Server Layer

Responsible for:

  • Storing warehouse data

  • Processing queries

  • Managing warehouse operations

Advantages

  • Simple architecture

  • Faster communication

Disadvantages

  • Limited scalability

  • Heavy server load


2. Three-Tier Architecture

Three-tier architecture is the most commonly used warehouse architecture.

other_sources

1. Bottom Tier

Contains:

  • Data sources

  • ETL tools

  • Data Warehouse database

Functions

  • Data extraction

  • Data cleaning

  • Data transformation

  • Data loading


2. Middle Tier

Contains OLAP servers.

Functions

  • Query processing

  • Data aggregation

  • Multidimensional analysis


3. Top Tier

Contains front-end tools for end users.

Examples

  • Dashboards

  • Reporting tools

  • Data mining applications


1.8 Data Warehouse Schema

A schema defines the logical structure of a Data Warehouse.


1. Star Schema

A Star Schema consists of:

  • One central fact table

  • Multiple dimension tables

The structure resembles a star.


1. Fact Table

Contains:

  • Numerical measures

  • Foreign keys

Examples

  • Sales amount

  • Quantity sold

  • Revenue


2. Dimension Tables

Contain descriptive information about business entities.

Examples

  • Product details

  • Customer information

  • Time information


3. Characteristics of Star Schema

  • Denormalized structure

  • Simple design

  • Fewer joins


4. Advantages of Star Schema

  • Faster query performance

  • Easy to understand

  • Efficient reporting


2. Snowflake Schema

A Snowflake Schema is an extension of Star Schema where dimension tables are normalized.


1. Normalized Dimension Tables

Dimension tables are divided into multiple related tables.

Example

Location dimension divided into:

  • City

  • State

  • Country


2. Features of Snowflake Schema

  • Reduced redundancy

  • Better storage efficiency

  • More complex joins


3. Advantages of Snowflake Schema

  • Reduced data duplication

  • Better data integrity


4. Disadvantages of Snowflake Schema

  • Slower query performance

  • Complex structure


3. Fact Constellation Schema

Fact Constellation Schema is also called Galaxy Schema.

It contains:

  • Multiple fact tables

  • Shared dimension tables

1. Multiple Fact Tables

Examples

  • Sales fact table

  • Shipping fact table


2. Shared Dimension Tables

The same dimension tables are shared among multiple fact tables.

Examples

  • Product dimension

  • Time dimension


1.9 OLAP (Online Analytical Processing)

OLAP is a technology used for fast multidimensional analysis of warehouse data.


1. Need for OLAP

1. Multidimensional Analysis

Data can be analyzed from multiple perspectives.

Example

Sales analysis based on:

  • Product

  • Region

  • Time


2. Decision Support

Helps management make strategic business decisions.

Example

Identifying profitable products and markets.


3. Fast Analytical Queries

OLAP systems are optimized for:

  • Aggregation

  • Summarization

  • Complex calculations


2. OLAP Operations

Multi-dimensional data cube

1. Roll-Up

Aggregates data from lower levels to higher levels.

Example

City sales → State sales → Country sales


2. Drill-Down

Moves from summarized data to detailed data.

Example

Yearly sales → Monthly sales → Daily sales


3. Slice

Selects a single dimension value.

Example

Sales data for the year 2025 only.


4. Dice

Selects multiple dimensions and values.

Example

Sales for:

  • Product A

  • West Region

  • Year 2025


5. Pivot

Rotates data axes for different data views.

Example

Converting rows into columns in reports.


3. OLAP Models

1. ROLAP (Relational OLAP)

Stores data in relational databases.

Features

  • Highly scalable

  • Handles large volumes of data

Disadvantages

  • Slower query performance

2. MOLAP (Multidimensional OLAP)

Stores data in multidimensional cubes.

Features

  • Very fast query response

  • Precomputed aggregations

Disadvantages

  • Higher storage requirement

3. HOLAP (Hybrid OLAP)

HOLAP combines features of ROLAP and MOLAP.

Features

  • Balance between speed and scalability

  • Summary data stored in cubes

  • Detailed data stored in relational databases

Advantages

  • Improved performance

  • Efficient storage usage