2. ML: Linear Regression

📅 2026-02-02 | #ML

Regression Problem

It is a statistical process for estimating the relationships between a dependent variable (outcome) and one or more independent variables (features).

Input $X$ : Attribute variables or features (typically numerical values).
Output $Y$ : Response variable that is aimed to be predicted.

Cyan: Goal is to estimate a function $f (X, β)$ such that $Y \approx f (X, β)$ .

It is called linear regression because this relation is assumed to be linear with an additive error term $ϵ$ representing statistical noise.

Simple Linear Regression Formulation

For a single feature vector $x$ , the regression models is defined as :

Y_{i} = β_{0} + β_{1} x_{i} + ϵ_{i}

$Y_{i}$ : observed response for $i$ -th training example
$x_{i}$ : input feature for the $i$ -th training example
$β_{0}$ : intercept (bias)
$b e t a_{1}$ : slope (weight)
$ϵ_{i}$ : residual error

This represent the actual training dataset values.

The fitted values or the prediction is :

\overset{y}{^}_{i} = \hat{β}_{0} + \hat{β}_{1} x_{i}

Residuals in linear regression

Ordinary Least Square (OLS)

It is the method to estimate the unknown parameters ( $β$ ) by minimizing the sum of the squares of the differences between the observed dependent variables and the predicted ones by the liner function. Squared error penalizes large errors more than smaller ones.

Derivation of Sum of Squared Errors (SSE)

Residual Sum of Squares (SSE) cost function $L$ is defined as :

L (β_{0}, β_{1}) = i = 1 \sum n ϵ_{i}^{2} = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} = i = 1 \sum n (y_{i} - (β_{0} + β_{1} x_{i}))^{2}

Green: Goal is to : $min (\sum_{i = 1}^{n} (y_{i} - \overset{y}{^}_{i})^{2})$

To find the optimal $β_{0}$ and $β_{1}$ , we take partial derivative w.r.t each parameter and set them to 0.

1. Derivative w.r.t $β_{0}$ :

To minimize, the derivative must be equal to 0 :

\frac{\partial L}{\partial β _{0}} = - 2 i = 1 \sum n (y_{i} - β_{0} - β_{1} x_{i}) = 0

Since $β_{0}$ and $β_{1}$ are constants :

i = 1 \sum n y_{i} - n β_{0} - β_{1} i = 1 \sum n x_{i} = 0

Dividing by $n$ :

\frac{1}{n} i = 1 \sum n y_{i} - \frac{1}{n} n β_{0} - \frac{1}{n} β_{1} i = 1 \sum n x_{i} = 0

\overset{y}{ˉ} - β_{0} - β_{1} \overset{x}{ˉ} = 0

β_{0} = \overset{y}{ˉ} - β_{1} \overset{x}{ˉ}

2. Derivative w.r.t $β_{1}$ :

\frac{\partial L}{\partial β _{1}} = - 2 i = 1 \sum n x_{i} (y_{i} - β_{0} - β_{1} x_{i}) = 0

Substitue $β_{0}$ with $\overset{y}{ˉ} - β_{1} \overset{x}{ˉ}$ :

i = 1 \sum n x_{i} (y_{i} - (\overset{y}{ˉ} - β_{1} \overset{x}{ˉ}) - β_{1} x_{i}) = 0

i = 1 \sum n x_{i} (y_{i} - \overset{y}{ˉ} + β_{1} \overset{x}{ˉ} - β_{1} x_{i}) = 0

i = 1 \sum n x_{i} ((y_{i} - \overset{y}{ˉ}) - β_{1} (x_{i} - \overset{x}{ˉ})) = 0

i = 1 \sum n x_{i} (y_{i} - \overset{y}{ˉ}) - β_{1} i = 1 \sum n x_{i} (x_{i} - \overset{x}{ˉ}) = 0

Rearranging :

\hat{β}_{1} = \frac{\sum _{i = 1}^{n} x _{i} ( y _{i} - y ˉ )}{\sum _{i = 1}^{n} x _{i} ( x _{i} - x ˉ )}

Identity : $\sum (x_{i} - \overset{x}{ˉ}) = 0$ and same is true for $y_{i}, \overset{y}{ˉ}$ .

Thus, numerator becomes :

i = 1 \sum n x_{i} (y_{i} - \overset{y}{ˉ}) = i = 1 \sum n x_{i} (y_{i} - \overset{y}{ˉ}) - \overset{x}{ˉ} = 0 (y_{i} - \overset{y}{ˉ}) = i = 1 \sum n (x_{i} - \overset{x}{ˉ}) (y_{i} - \overset{y}{ˉ})

And thus denominator becomes :

i = 1 \sum n x_{i} (x_{i} - \overset{x}{ˉ}) = i = 1 \sum n (x_{i} - \overset{x}{ˉ}) (x_{i} - \overset{x}{ˉ}) = i = 1 \sum n (x_{i} - \overset{x}{ˉ})^{2}

And finally the slope $β_{1}$ becomes :

\hat{β}_{1} = \frac{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ^{2}}

And in terms of covariance and variance :

\hat{β}_{1} = \frac{C o v ( X , Y )}{Va r ( X )}

Sum of Squares Decomposition and $R^{2}$

To evaluate the goodness of fit, we decompose the total variability of the response variable.

SST (Total Sum of Squares) : Measures total variance in observed $Y$ :

SST = i = 1 \sum n (y_{i} - \overset{y}{ˉ})^{2}

real vs mean

SSR (Sum of Squares Regression) : Measures variance explained by the model :

SSR = i = 1 \sum n (\overset{y}{^}_{i} - \overset{y}{ˉ})^{2}

predicted vs mean

SSE (Sum of Squares Error) : Measures unexplained variance (residuals) :

SSE = i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

real vs predicted

These are related as :

SST = SSR + SSE

Coefficient of Determination ( $R^{2}$ )

$R^{2}$ represents the proportion of the variance for the dependent variable that's explained by an independent variable.

R^{2} = \frac{SSR}{SST}

Also,

1 = \frac{SST}{SST} = \frac{SSR + SSE}{SST} = R^{2} + \frac{SSE}{SST}

The best model will have $R^{2} = 1$ .

Correlation Coefficient ( $r^{2}$ )

For simple linear regression, $R^{2}$ is the square of Pearson Correlation Coefficient ( $r$ ) :

r = \frac{C o v ( X , Y )}{σ _{X} σ _{Y}} ⟹ R^{2} = r^{2}

Types of Errors

Different metrics are used for different purposes :

1. Mean Squared Error (MSE) :

MSE = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

Differentiable and useful for optimization.
Heavily penalizes large outliers (squaring term).

2. Root Mean Squared Error (RMSE) :

RMSE = MSE

Same unit as the target variable $Y$ , making it interpretable.

3. Mean Absolute Error (MAE) :

M A E = \frac{1}{n} i = 1 \sum n ∣ y_{i} - \overset{y}{^}_{i} ∣

More robust to outliers than MSE, but not differentiable at 0.

Multiple Linear Regression

When multiple features are present like $x_{1}, x_{2}, \dots, x_{n}$ , the model becomes :

\overset{y}{^}^{(i)} = w_{0} + w_{1} x_{1}^{(i)} + \dots + w_{n} x_{n}^{(i)}

Vector-Matrix representation

Add a bias term as $x_{0} = 1$ for the intercept $w_{0}$ into the weight vector :

Input Matrix $X$ : Dimensions $N \times (n + 1)$

X = 11 ⋮ 1 x_{1}^{(1)} x_{1}^{(2)} ⋮ x_{1}^{(N)} \dots \dots ⋱ \dots x_{n}^{(1)} x_{n}^{(2)} ⋮ x_{n}^{(N)}

$x_{i}^{(j)}$ : $j$ -th training example's $i$ -th feature.
Weight Vector $w$ : Dimensions $(n + 1) \times N$

w = [w_{0}, w_{1}, \dots, w_{n}]^{T}

Target Vector $y$ : Dimensions $N \times 1$

The prediction can be written using the inner product for a single example or matrix multiplication for the whole dataset:

N \times 1 \hat{Y} = N \times (n + 1) X \cdot N \times 1 w

Closed Form / Normal Form Equation

Green: To find the coefficient $w$ , minimize the sum of squared error (SSE)

Define the cost function

Cost function (quantifies the error between a model's predicted outputs and the actual target values) :

J (w) = w^{*} min i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2}

$w^{*}$ : optimal value of the parameter.

SSE = $∥ y - \overset{y}{^} ∥^{2}$ which is equal to magnitude of the vector $y - \overset{y}{^}$ . Due to the fact that $∥ z ∥^{2} = z_{1}^{2} + \dots z_{p}^{2} = z^{T} \cdot z$ :

SSE = (y - \overset{y}{^})^{T} (y - \overset{y}{^})

And since $\overset{y}{^} = X \cdot w$

SSE = (y - Xw)^{T} (y - Xw) = J (w)

Thus,

J (w) = (y^{T} - (Xw)^{T}) (y - Xw)

J (w) = (y^{T} - w^{T} X^{T}) (y - Xw)

Solving this gives :

J (w) = y^{T} y - y^{T} Xw - w^{T} X^{T} y + w^{T} X^{T} Xw

The term $y^{T} Xw$ will give a scalar value of dimension ( $1 \times 1$ ). And because transpose of a scalar is equal, $y^{T} Xw = w^{T} X^{T} y$ . Thus,

J (w) = y^{T} y - 2 w^{T} X^{T} y + w^{T} X^{T} Xw

Computing the gradient

To find the optimal $w$ that minimizes error, we calculate the gradient of $J (w)$ with respect to $w$ and set it to zero.

\frac{\partial J ( w )}{\partial w} = \frac{\partial}{\partial w} (y^{T} y - 2 w^{T} X^{T} y + w^{T} X^{T} Xw)

$\frac{\partial}{\partial w} (y^{T} y) = 0$
- Constant with respect to $w$
$\frac{\partial}{\partial w} (- 2 w^{T} X^{T} y) = - 2 X^{T} y$
- $w$ is our variable vector $(d \times 1)$ .
- $a = - 2 X^{T} y$ is a constant vector $(d \times 1)$ because it doesn't contain $w$ .
- The term $- 2 w^{T} X^{T} y$ can be rewritten as the dot product $w^{T} a$ .
- Rule: The derivative of a dot product with respect to one of the vectors is just the other vector.

\frac{\partial}{\partial w} (w^{T} a) = a

$\frac{\partial}{\partial w} (w^{T} X^{T} Xw) = 2 X^{T} Xw$
- $w$ is a vector.
- $A = X^{T} X$ is a square matrix.
- The expression $w^{T} A w$ is called a Quadratic Form.
- Rule: The derivative of a quadratic form $x^{T} A x$ depends on whether matrix $A$ is symmetric.

\frac{\partial}{\partial x} (x^{T} A x) = 2 A x

Another rule followed :

\frac{\partial Xw}{\partial w} = X^{T}

Thus, the final gradient becomes :

J (w) = - 2 X^{T} y + 2 X^{T} Xw = - 2 X^{T} (y - Xw)

Solving for $w$

Equating the gradient to 0 :

J (w) = - 2 X^{T} (y - Xw) = 0

X^{T} (y - Xw) = 0

w = \frac{X ^{T} y}{X ^{T} X}

Thus, the closed form or the normal form equation is :

w = (X^{T} X)^{- 1} X^{T} y

Limitations

It requires $(X^{T} X)$ to be invertible, i.e., the features must not be perfectly correlated.
For larger data, computation required to compute the inverse will be too large.

Gradient Descent

When the normal equation becomes too computationally expensive, we use Gradient Descent : an iterative optimization algorithm.

Convex Bowl

Cost Funtion (Mean Squared Error Form)

J (w) = \frac{1}{2 n} i = 1 \sum n (y^{(i)} - w^{T} x^{(i)})^{2}

The $1/2$ factor makes the derivative cleaner.

Update Rule

Update the weights by moving in the opposite direction of the gradient / negative gradient.

w_{j} := w_{j} - α \frac{\partial J ( w )}{\partial w _{j}}

$α$ is the learning rate

The gradient for a specific weight $w_{j}$ is :

\frac{\partial J ( w )}{\partial w _{j}} = \frac{1}{n} i = 1 \sum n (w^{T} x^{(i)} - y^{(i)}) x_{j}^{(i)}

Types of Gradient Descent

Type	Description	Pros	Cons
Batch GD	Uses all N training examples for every update.	Stable convergence.	Slow for large datasets; memory intensive
Stochastic GD (SGD)	Uses 1 random training example per update.	Faster iterations; escapes local minima	High variance updates; noisy convergence
Mini-Batch GD	Uses a small batch (b) of examples per update.	Balances stability and speed.	Hyperparameter b to tune

Learning Rate ( $α$ )

Different Alpha values result

It is a critical hyperparameter that controls the step size taken towards a minimum of a loss function during optimization

$α$ too small: Convergence is guaranteed but very slow; requires many updates.
$α$ too large: The steps may overshoot the minimum, causing the algorithm to oscillate or diverge (cost increases).
Optimal $α$ : Smoothly reaches the minima.

Learning rate in bowl

Learning Rate Trajectories

Worked out examples of Gradient Descent

Problem : Fit : $y = w_{0} + w_{1} x$ .
Dataset :
- Point 1: $(x^{(1)}, y^{(1)}) = (1, 2)$
- Point 2: $(x^{(2)}, y^{(2)}) = (2, 4)$
Initialization :
- $w_{0} = 0, w_{1} = 0$
- Learning Rate $α = 0.1$ .

Batch Gradient Descent

It calculates gradient using sum over all points ( $N = 2$ ).

$\overset{y}{^}^{(1)} = 0 + 0 (1) = 0$
- Error : $0 - 2 = - 2$
$\overset{y}{^}^{(2)} = 0 + 0 (2) = 0$
- Error : $0 - 4 = - 4$

Gradients :

\frac{\partial J ( w )}{\partial w _{0}} = \frac{1}{n} i = 1 \sum n (w^{T} x^{(i)} - y^{(i)})

Because $x_{j}^{(i)} = 1$ for $j = 0$ . Thus, :

\frac{\partial J}{\partial w _{0}} = \frac{1}{2} i = 1 \sum 2 (Error^{(i)}) \cdot x_{0}^{(i)} = \frac{1}{2} (- 2 (1) - 4 (1)) = - 3

\frac{\partial J ( w )}{\partial w _{1}} = \frac{1}{n} i = 1 \sum n (w^{T} x^{(i)} - y^{(i)}) x_{1}^{(i)}

\frac{\partial J}{\partial w _{1}} = \frac{1}{2} \sum (Error \times x) = \frac{1}{2} (- 2 (1) - 4 (2)) = - 5

Update :

w_{0} := 0 - 0.1 (- 3) = 0.3

w_{1} := 0 - 0.1 (- 5) = 0.5

Thus, the model after 1 epoch (1 complete pass through the dataset) is

y = 0.3 + 0.5 x

Stochastic Gradient Descent (SGD)

It updates after each example.

w_{j} := w_{j} - α Gradient for single example (\overset{y}{^}^{(i)} - y^{(i)}) x_{j}^{(i)}

Iteration 1 :

Pred : $\overset{y}{^}^{(1)} = w_{0} (1) + w_{1} (x^{(1)}) = 0 (1) + 0 (1) = 0$
$Error = (\overset{y}{^}^{(1)} - y^{(1)}) = 0 - 2 = - 2$
Gradient for $w_{0}$ as $x_{0} = 1$ .

\frac{\partial J}{\partial w _{0}} = Error \times 1 = - 2

Gradient for $w_{1}$ as $x_{1} = 1$ .

\frac{\partial J}{\partial w _{1}} = Error \times x^{(1)} = - 2 \times 1 = - 2

Update :

w_{0} := 0 - 0.1 (- 2) = 0.2

w_{1} := 0 - 0.1 (- 2) = 0.2

Thus, the current model is : $y = 0.2 + 0.2 x$

Iteration 2

Use the updated weights from Iteration 1.

Prediction : $\overset{y}{^}^{(2)} = 0.2 (1) + 0.2 (2) = 0.2 + 0.4 = 0.6$
$Error = (\overset{y}{^}^{(2)} - y^{(2)}) = 0.6 - 4 = - 3.4$
Compute gradients

\frac{\partial J}{\partial w _{0}} = - 3.4 \times 1 = - 3.4

\frac{\partial J}{\partial w _{1}} = - 3.4 \times 2 = - 6.8

Update Weights :

w_{0} := 0.2 - 0.1 (- 3.4) = 0.2 + 0.34 = 0.54

w_{1} := 0.2 - 0.1 (- 6.8) = 0.2 + 0.68 = 0.88

Thus, the final model after 1 epoch is :

y = 0.54 + 0.88 x

With this post on Linear Regression, normal form equation, gradient descent, types of errors, OLS is over.