Regularization Techniques

Methods to prevent overfitting and improve model generalization

What is Regularization?

Regularization is a fundamental technique in machine learning that prevents models from overfitting to training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on unseen data.

The core idea is to add constraints or penalties to the learning process that encourage simpler models. This trades a small increase in training error for a significant decrease in validation error. The art lies in finding the right balance between fitting the data and maintaining simplicity.

Regularized Loss=Data Loss+λRegularization Term\text{Regularized Loss} = \text{Data Loss} + \lambda \cdot \text{Regularization Term}

Where λ\lambda controls the strength of regularization. Too little and the model overfits; too much and it underfits. Different techniques add different types of constraints, from penalizing large weights to randomly dropping connections during training.

Overfitting vs Regularization

Model Complexity Comparison

The overfitted model (red) captures noise, while the regularized model (green) stays closer to the true function (blue).

L1 vs L2 Regularization

Regularization Techniques

L1 Regularization (Lasso)

weight-based
Ltotal=Ldata+λiwiL_{total} = L_{data} + \lambda \sum_{i} |w_i|

Adds the sum of absolute values of parameters to the loss function, promoting sparsity.

L2 Regularization (Ridge)

weight-based
Ltotal=Ldata+λiwi2L_{total} = L_{data} + \lambda \sum_{i} w_i^2

Adds the sum of squared parameters to the loss function, encouraging small weights.

Elastic Net

weight-based
Ltotal=Ldata+λ1iwi+λ2iwi2L_{total} = L_{data} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2

Combines L1 and L2 regularization to get benefits of both.

Dropout

dropout
y^=f(W(xm)),miBernoulli(p)\hat{y} = f(W \cdot (x \odot m)), \quad m_i \sim \text{Bernoulli}(p)

Randomly deactivates neurons during training to prevent co-adaptation.

Batch Normalization

normalization
x^=xμBσB2+ϵ\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

Normalizes inputs of each layer to reduce internal covariate shift.

Layer Normalization

normalization
x^=xμLσL2+ϵ\hat{x} = \frac{x - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}

Normalizes across features instead of batch dimension.

Early Stopping

early-stopping
Stop when Lval(t)>Lval(tk) for patience p\text{Stop when } L_{val}(t) > L_{val}(t-k) \text{ for patience } p

Stops training when validation performance stops improving.

Weight Decay

weight-based
wt+1=wtη(L+λwt)w_{t+1} = w_t - \eta(\nabla L + \lambda w_t)

Directly decays weights during optimization, closely related to L2 regularization.

Data Augmentation

data-augmentation
Daug={(T(xi),yi):TT}\mathcal{D}_{aug} = \{(T(x_i), y_i) : T \in \mathcal{T}\}

Artificially increases training data through transformations.

Best Practices

Choosing Regularization

  • Start with L2/Weight Decay: It's the safest default choice
  • Use L1 for feature selection: When you need sparse models
  • Dropout for neural networks: Especially fully connected layers
  • Batch/Layer Norm: For deep networks and training stability
  • Data augmentation: When you have domain knowledge
  • Combine techniques: Often works better than single method

Common Pitfalls

  • Over-regularizing: Can lead to underfitting
  • Wrong technique: L1 for correlated features can be unstable
  • Ignoring validation: Always tune on validation set
  • Dropout at test time: Remember to scale or turn off
  • Batch norm with small batches: Can be unstable
  • Not adjusting learning rate: Regularization may require different LR

Quick Decision Guide

Linear models: L2 (Ridge) or L1 (Lasso) regularization

Neural networks: Dropout + Weight Decay + Batch Norm

Small dataset: Strong regularization + Data augmentation

Feature selection needed: L1 or Elastic Net

Transformers: Dropout + Weight Decay + Layer Norm