Regularization Techniques

Methods to prevent overfitting and improve model generalization

What is Regularization?

Regularization is a fundamental technique in machine learning that prevents models from overfitting to training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on unseen data.

The core idea is to add constraints or penalties to the learning process that encourage simpler models. This trades a small increase in training error for a significant decrease in validation error. The art lies in finding the right balance between fitting the data and maintaining simplicity.

\text{Regularized Loss} = \text{Data Loss} + \lambda \cdot \text{Regularization Term}

Where $\lambda$ controls the strength of regularization. Too little and the model overfits; too much and it underfits. Different techniques add different types of constraints, from penalizing large weights to randomly dropping connections during training.

Overfitting vs Regularization

Model Complexity Comparison

The overfitted model (red) captures noise, while the regularized model (green) stays closer to the true function (blue).

L1 vs L2 Regularization

L1 Lambda: 0.1

L2 Lambda: 0.1

Regularization Techniques

L1 Regularization (Lasso)

weight-based

L_{total} = L_{data} + \lambda \sum_{i} |w_i|

Adds the sum of absolute values of parameters to the loss function, promoting sparsity.

L2 Regularization (Ridge)

weight-based

L_{total} = L_{data} + \lambda \sum_{i} w_i^2

Adds the sum of squared parameters to the loss function, encouraging small weights.

Elastic Net

weight-based

L_{total} = L_{data} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2

Combines L1 and L2 regularization to get benefits of both.

Dropout

dropout

\hat{y} = f(W \cdot (x \odot m)), \quad m_i \sim \text{Bernoulli}(p)

Randomly deactivates neurons during training to prevent co-adaptation.

Batch Normalization

normalization

\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}

Normalizes inputs of each layer to reduce internal covariate shift.

Layer Normalization

normalization

\hat{x} = \frac{x - \mu_L}{\sqrt{\sigma_L^2 + \epsilon}}

Normalizes across features instead of batch dimension.

Early Stopping

early-stopping

\text{Stop when } L_{val}(t) > L_{val}(t-k) \text{ for patience } p

Stops training when validation performance stops improving.

Weight Decay

weight-based

w_{t+1} = w_t - \eta(\nabla L + \lambda w_t)

Directly decays weights during optimization, closely related to L2 regularization.

Data Augmentation

data-augmentation

\mathcal{D}_{aug} = \{(T(x_i), y_i) : T \in \mathcal{T}\}

Artificially increases training data through transformations.

Best Practices

Choosing Regularization

Start with L2/Weight Decay: It's the safest default choice
Use L1 for feature selection: When you need sparse models
Dropout for neural networks: Especially fully connected layers
Batch/Layer Norm: For deep networks and training stability
Data augmentation: When you have domain knowledge
Combine techniques: Often works better than single method

Common Pitfalls

Over-regularizing: Can lead to underfitting
Wrong technique: L1 for correlated features can be unstable
Ignoring validation: Always tune on validation set
Dropout at test time: Remember to scale or turn off
Batch norm with small batches: Can be unstable
Not adjusting learning rate: Regularization may require different LR

Quick Decision Guide

• Linear models: L2 (Ridge) or L1 (Lasso) regularization

• Neural networks: Dropout + Weight Decay + Batch Norm

• Small dataset: Strong regularization + Data augmentation

• Feature selection needed: L1 or Elastic Net

• Transformers: Dropout + Weight Decay + Layer Norm