Regularization Techniques
Methods to prevent overfitting and improve model generalization
What is Regularization?
Regularization is a fundamental technique in machine learning that prevents models from overfitting to training data. Overfitting occurs when a model learns the training data too well, including its noise and peculiarities, leading to poor generalization on unseen data.
The core idea is to add constraints or penalties to the learning process that encourage simpler models. This trades a small increase in training error for a significant decrease in validation error. The art lies in finding the right balance between fitting the data and maintaining simplicity.
Where controls the strength of regularization. Too little and the model overfits; too much and it underfits. Different techniques add different types of constraints, from penalizing large weights to randomly dropping connections during training.
Overfitting vs Regularization
Model Complexity Comparison
The overfitted model (red) captures noise, while the regularized model (green) stays closer to the true function (blue).
L1 vs L2 Regularization
Regularization Techniques
L1 Regularization (Lasso)
weight-basedAdds the sum of absolute values of parameters to the loss function, promoting sparsity.
L2 Regularization (Ridge)
weight-basedAdds the sum of squared parameters to the loss function, encouraging small weights.
Elastic Net
weight-basedCombines L1 and L2 regularization to get benefits of both.
Dropout
dropoutRandomly deactivates neurons during training to prevent co-adaptation.
Batch Normalization
normalizationNormalizes inputs of each layer to reduce internal covariate shift.
Layer Normalization
normalizationNormalizes across features instead of batch dimension.
Early Stopping
early-stoppingStops training when validation performance stops improving.
Weight Decay
weight-basedDirectly decays weights during optimization, closely related to L2 regularization.
Data Augmentation
data-augmentationArtificially increases training data through transformations.
Best Practices
Choosing Regularization
- Start with L2/Weight Decay: It's the safest default choice
- Use L1 for feature selection: When you need sparse models
- Dropout for neural networks: Especially fully connected layers
- Batch/Layer Norm: For deep networks and training stability
- Data augmentation: When you have domain knowledge
- Combine techniques: Often works better than single method
Common Pitfalls
- Over-regularizing: Can lead to underfitting
- Wrong technique: L1 for correlated features can be unstable
- Ignoring validation: Always tune on validation set
- Dropout at test time: Remember to scale or turn off
- Batch norm with small batches: Can be unstable
- Not adjusting learning rate: Regularization may require different LR
Quick Decision Guide
• Linear models: L2 (Ridge) or L1 (Lasso) regularization
• Neural networks: Dropout + Weight Decay + Batch Norm
• Small dataset: Strong regularization + Data augmentation
• Feature selection needed: L1 or Elastic Net
• Transformers: Dropout + Weight Decay + Layer Norm