1. Statistical Models Overview

Definition:
A statistical model describes how data are generated and how variables relate to each other.
It represents a mathematical relationship between dependent (response) and independent (predictor) variables.

General form:

where

  • ( Y ): dependent variable

  • ( X ): independent variable(s)

  • ( f(X) ): systematic component (model function)

  • ( \epsilon ): random error term

Types of Models:

  1. Descriptive – summarize data (mean, median, variance)

  2. Inferential – estimate population parameters, test hypotheses

  3. Predictive – forecast or predict outcomes using regression or ML

In R:
Common modeling functions include:

lm()      # Linear models (regression)
glm()     # Generalized linear models
aov()     # ANOVA
cor()     # Correlation
summary() # Model summaries

2. Correlation Analysis

Definition:

Correlation measures the strength and direction of a linear relationship between two quantitative variables.

  • ( r ) ranges from -1 to +1

  • ( r = 1 ): perfect positive correlation

  • ( r = -1 ): perfect negative correlation

  • ( r = 0 ): no linear correlation


R Syntax:

cor(x, y, method = "pearson")
cor.test(x, y, method = "pearson")  # gives test and confidence interval

Methods:

  • "pearson" – linear correlation (default)

  • "spearman" – rank correlation (non-parametric)

  • "kendall" – concordance-based correlation


Example:

Dataset:
Car speed vs stopping distance (from built-in dataset cars)

data(cars)
head(cars)
cor(cars$speed, cars$dist)
cor.test(cars$speed, cars$dist)

Output Interpretation:

  • Suppose r = 0.8069 → strong positive correlation.

  • As speed increases, stopping distance tends to increase.


Real-world case:

In finance, correlation helps assess how two stocks move together.
For example, correlation between NIFTY50 and SENSEX daily returns shows market linkage.

cor(nifty_returns, sensex_returns)

3. Regression Analysis

Definition:

Regression models quantify the relationship between a dependent variable and one or more independent variables.

Types:

  • Simple Linear Regression: one predictor

  • Multiple Linear Regression: multiple predictors

  • Nonlinear Regression: relationship not linear in parameters

  • Logistic Regression: categorical response (binary outcomes)


A. Simple Linear Regression

Model:

R Syntax:

model <- lm(y ~ x, data = dataset)
summary(model)
plot(x, y)
abline(model, col = "red")

Example:

data(cars)
model <- lm(dist ~ speed, data = cars)
summary(model)

Interpretation:

  • Estimate(Intercept) → expected dist when speed = 0

  • Estimate(speed) → expected change in distance for each 1-unit increase in speed

  • R-squared → how well model explains data variance

Visualization:

plot(cars$speed, cars$dist, main="Speed vs Distance", xlab="Speed", ylab="Distance")
abline(model, col="blue")

B. Multiple Linear Regression

Model:

R Syntax:

model2 <- lm(Sales ~ TV + Radio + Newspaper, data = marketing)
summary(model2)

Example Dataset:

marketing <- data.frame(
  TV = c(230, 44, 17, 151, 180),
  Radio = c(37, 39, 45, 41, 10),
  Newspaper = c(69, 45, 23, 15, 12),
  Sales = c(22, 10, 9, 18, 15)
)

Interpretation:

  • Coefficients show how each media type contributes to Sales.

  • p-value < 0.05 → significant predictor.

  • Adjusted R² accounts for multiple variables.


C. Logistic Regression

Used when the dependent variable is binary (0/1).
Example: Predicting if a patient has diabetes based on BMI and age.

model3 <- glm(diabetes ~ age + bmi, data = health, family = binomial)
summary(model3)

Real-world case:

In healthcare, regression models predict disease risk (heart disease vs cholesterol, age).
In marketing, regression predicts sales from ad expenditure.


4. Analysis of Variance (ANOVA)

Definition:

ANOVA tests if mean differences among three or more groups are statistically significant.

Hypotheses:

  • ( H_0 ): All group means are equal

  • ( H_1 ): At least one group mean differs


R Syntax:

result <- aov(y ~ factor, data = dataset)
summary(result)
TukeyHSD(result)  # pairwise comparison

Example:

Testing effect of fertilizer type on plant yield.

FertilizerYield
A20
A22
B27
B25
C30
C28
fertilizer <- data.frame(
  type = rep(c("A", "B", "C"), each = 2),
  yield = c(20, 22, 27, 25, 30, 28)
)
 
result <- aov(yield ~ type, data = fertilizer)
summary(result)
TukeyHSD(result)

Interpretation:

  • If p-value < 0.05, reject H₀ → mean yields differ by fertilizer type.

Real-world case:

  • In agriculture, ANOVA compares crop yields using different fertilizers.

  • In education, compares test scores of different teaching methods.


5. Creating Data for Complex Analysis

Manually Creating Data Frames

data <- data.frame(
  Name = c("A", "B", "C"),
  Score = c(80, 90, 75),
  Gender = c("M", "F", "M")
)

Generating Random Data

set.seed(123)
df <- data.frame(
  age = sample(18:60, 100, replace = TRUE),
  income = rnorm(100, mean = 50000, sd = 10000),
  gender = sample(c("Male", "Female"), 100, replace = TRUE)
)
head(df)

Combining and Merging

merged <- merge(df1, df2, by = "ID")

Importing Data

read.csv("data.csv")
read.table("data.txt", header=TRUE)

6. Summarizing Data

Descriptive Statistics in R:

summary(df)
mean(df$age)
median(df$income)
sd(df$income)
var(df$income)
range(df$income)

Grouped Summaries:

aggregate(income ~ gender, data=df, FUN=mean)
tapply(df$income, df$gender, mean)

Visual Summaries:

boxplot(df$income ~ df$gender)
hist(df$age)

7. Case Study: Predicting House Prices

Objective: Predict house prices using size and number of bedrooms.

house <- data.frame(
  size = c(1400, 1600, 1700, 1875, 1100),
  bedrooms = c(3, 3, 3, 4, 2),
  price = c(245000, 312000, 279000, 308000, 199000)
)
 
model <- lm(price ~ size + bedrooms, data = house)
summary(model)

Interpretation:

  • size has a strong positive coefficient → larger houses cost more.

  • Model helps predict prices of new houses.


8. Summary Table

ConceptFunctionKey OutputReal-world Application
Correlationcor(), cor.test()r, p-valueRelation between variables
Linear Regressionlm()Coefficients, R²Predict continuous outcomes
Logistic Regressionglm(..., family=binomial)Odds ratiosPredict binary outcomes
ANOVAaov()F-statistic, p-valueCompare group means
Summariessummary(), aggregate()Mean, median, sdDescriptive analysis