4. ML: Confusion Matrix, Bias-Variance & Regularization

📅 2026-02-03 | #ML

Model Evaluation : Confusion Matrix & Metrics

To optimize a model, its performance must be defined. In classification, raw accuracy is often insufficient, especially with imbalanced classes. The foundation of evaluation is the Confusion Matrix.

For a binary classifier, the predictions are categorized into 4 buckets based on the intersection of the predicted class and the actual truth values.

Confusion Matrix

True Positive (TP): Predicted (+) and Actual (+).
True Negative (TN): Predicted (-) and Actual (-).
False Positive (FP): Predicted (+) and Actual (-) (Type I Error).
False Negative (FN): Predicted (-) and Actual (+) (Type II Error).

These 4 values are used to derive metrics to evaluate specific aspects of the model performance :

1. Accuracy

Ratio of correct prediction to the total pool.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

Accuracy fails to distinguish between types of errors which is critical in unbalanced datasets.

2. Precision

The percentage of predicted positives that are actually positive.

Precision = \frac{TP}{TP + FP}

High precision means low False Positives.
Positive Predictive Value

3. Recall

The percentage of actual positives that were correctly identified.

Recall = \frac{TP}{TP + FN}

Sensitivity/True Positive Rate
High recall means low False Negatives.

4. Selectivity (Specificity) :

While Recall measures how well we find positives, Selectivity measures how well we reject negatives.

Selectivity = \frac{TN}{TN + FP}

5. F1-Score

The harmonic mean of Precision and Recall.

F_{1} = \frac{2 \cdot Precision \cdot Recall}{Precision + Recall}

It penalizes extreme values more than the arithmetic mean, ensuring a model is balanced.

6. Area Under ROC Curve (AUC-ROC)

Pink: Receiver Operating Characteristic (ROC) curve plots the Recall (TPR) against the False Positive Rate $(1 - Specificity)$ at various threshold settings.

The Area Under this Curve (AUC) represents the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

AUC-ROC

In case of more than 2 classes, compute Precision/Recall for each class independently, then average the scores. Treats all classes equally (good for checking performance on rare classes).

Bias-Variance Trade-Off

One of the most fundamental problem is Generalization : Creating a hypothesis $h (x)$ that performs well on unseen data.

The error of a model can be decomposed into 3 parts : Bias, Variance & Irreducible Error.

Bias

It is the error caused by oversimplifying assumptions in the model. A high bias model will be too simple for the model and miss important patterns in the data. It will lead to underfitting.

Bias measures how far the average prediction of a model (over many different training sets) is from the true function.

Variance

Variance is error caused by too much sensitivity to the training data. A high variance model will be too complex and fit the noise in the training data. It will lead to overfitting.

Variance measures how much the model’s predictions would change if it were trained on a different dataset.

Mathematical Derivation of MSE Decomposition

Assume a true relationship $y = f (x) + ϵ$ , where $ϵ$ is noise. We estimate this with a model $\hat{f} (x)$ .

The expected mean squared error (MSE) on an unseen sample $x$ is:

E rror (x) = E [(y - \hat{f} (x))^{2}]

= E [(f (x) + ϵ - \hat{f} (x))^{2}]

= E [(f (x) - \hat{f} (x))^{2}] + σ^{2}

Focusing on the estimation error $E [(f (x) - \hat{f} (x))^{2}]$ ,

Let $E [\hat{f} (x)]$ be the average prediction of our model over infinite training sets. We add and subtract this term :

E [(f (x) - E [\hat{f} (x)] + E [\hat{f} (x)] - \hat{f} (x))^{2}]

Expanding this square $(a + b)^{2} = a^{2} + b^{2} + 2 ab$ :

Bias term : $(f (x) - E [\hat{f} (x)])^{2}$ is the squared Bias.
- It measures how far the average model is from the truth.
Variance term : $E [(E [\hat{f} (x)] - \hat{f} (x))^{2}]$ is the Variance.
- It measures how much any single model fluctuates around the average model.
Cross term : The cross term vanishes because $E [\hat{f} (x) - E [\hat{f} (x)]] = 0$ .

Thus the Final Relation

Total Error = Bias^{2} + Variance + Irreducible Error

Total Error = Bias^{2} (E [\hat{f} (x)] - (f (x)))^{2} + Variance E [(E [\hat{f} (x)] - \hat{f} (x))^{2}] + ϵ

Thus, the error is made up of bias and variance and it is important to find the right balance.

Low Complexity : High Bias, Low Variance. The model is too rigid.
High Complexity : Low Bias, High Variance. The model is too flexible and captures noise.
The Sweet Spot : The goal is to find the complexity level where the sum of Bias $^{2}$ and Variance is minimized.

Bias-Variance Tradeoff

Underfitting : Occurs when a model is too simple to capture underlying patterns, performing poorly on both training and test data.
Overfitting : Occurs when a model is too complex and learns noise in the training data as if it were a real pattern. It leads to high accuracy on training data but poor performance on testing data.

Underfitting & Overfitting

Regularization

When the model is too complex, i.e. it is overfitting, the variance can be reduced by constraining the model weights. This is called regularization. It works by adding a penalty term to the Loss Function (Residual Sum of Squares [RSS]).

Total Cost = RSS + λ \cdot (Penalty Term)

$λ$ is the tuning parameter.
As $λ \to \infty$ , coefficients shrink toward zero, reducing variance but increasing bias.

Ridge Regression ( $L_{2}$ Regularization)

Ridge adds the squared magnitude of coefficients as the penalty.

\hat{β}_{r i d g e} = β argmin (i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} + λ j = 1 \sum p β_{j}^{2})

Expanding it :

\hat{β}_{r i d g e} = β argmin (i = 1 \sum n (y_{i} - β_{0} - j = 1 \sum p β_{j} x_{ij})^{2} + λ j = 1 \sum p β_{j}^{2})

Shrinks coefficients toward zero but never exactly to zero. It includes all features in the final model (no variable selection).

Green: The constraint region $β_{1}^{2} + β_{2}^{2} \leq s$ is a circle. The RSS ellipses usually hit the circle at a non-axis point, keeping $β$ non-zero.

Lasso Regression ( $L_{1}$ Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) adds the absolute value of coefficients.

\hat{β}_{l a sso} = β argmin (i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} + λ j = 1 \sum p ∣ β_{j} ∣)