Hyperparameters
Master the art of configuring machine learning models. Learn what hyperparameters are, why they matter, and how to find the optimal settings for your model.
What Are Hyperparameters?
Hyperparameters are the configuration settings of a machine learning algorithm that are set before training begins. Unlike model parameters (weights and biases) that are learned during training, hyperparameters control the training process itself and the structure of the model.
Parameters vs Hyperparameters
Parameters (Learned)
- Neural network weights
- Biases in each layer
- Embeddings in NLP models
- Learned from data via backpropagation
Hyperparameters (Set by You)
- Learning rate
- Batch size
- Number of layers
- Set before training starts
Why They Matter
Hyperparameters can make the difference between a model that doesn't learn at all and one that achieves state-of-the-art performance. They control:
- Model Capacity: How complex patterns the model can learn
- Training Speed: How quickly the model converges
- Generalization: How well the model performs on new data
- Stability: Whether training converges or diverges
Key Insight: There's no universal set of “best” hyperparameters. The optimal values depend on your specific dataset, model architecture, and problem domain. This is why hyperparameter tuning is crucial for achieving good performance.
Common Hyperparameters Explained
Learning Rate ()
Controls the step size during gradient descent. Too high and training diverges, too low and it takes forever.
Common range: 0.0001 to 0.1
Batch Size
Number of samples processed before updating weights. Affects memory usage and gradient quality.
Number of Epochs
How many times to iterate through the entire dataset. More isn't always better!
Use early stopping with validation loss
Dropout Rate
Probability of dropping neurons during training to prevent overfitting.
Visualization: Active vs dropped neurons
Network Depth
- Number of hidden layers
- More layers = more capacity
- Risk: vanishing gradients
Layer Width
- Neurons per layer
- Wider = more parameters
- Balance with depth
Activation Functions
- ReLU, Sigmoid, Tanh
- Affects gradient flow
- Problem-specific choice
Hyperparameter Tuning Methods
Exhaustively searches through a manually specified subset of hyperparameter space.
Advantages
- Guaranteed to find best in grid
- Simple to implement
- Parallelizable
Disadvantages
- Computationally expensive
- Curse of dimensionality
- May miss optimal values
1param_grid = {
2 'learning_rate': [0.001, 0.01, 0.1],
3 'batch_size': [16, 32, 64],
4 'dropout': [0.2, 0.3, 0.5]
5}
6# Total: 3 × 3 × 3 = 27 combinations
Samples random combinations from specified distributions. Often more efficient than grid search.
Advantages
- More efficient exploration
- Can find unexpected optima
- Easy to add more trials
Disadvantages
- No guarantee of optimum
- May sample poor regions
- Results vary between runs
Research Finding: Random search often finds better hyperparameters than grid search with the same computational budget, especially when only a few hyperparameters truly matter.
Builds a probabilistic model of the objective function and uses it to select promising hyperparameters to evaluate.
How It Works
- 1Build surrogate model (e.g., Gaussian Process)
- 2Use acquisition function to find next point
- 3Evaluate objective at selected point
- 4Update surrogate model and repeat
Genetic Algorithms
Evolve populations of hyperparameters through selection, crossover, and mutation.
Population Based Training
Train multiple models in parallel, periodically replacing poor performers.
Hyperband
Adaptive resource allocation, stopping poor configurations early.
Neural Architecture Search
Automatically design neural network architectures using ML.
Best Practices for Hyperparameter Tuning
Before You Start
- 1
Start with defaults
Many frameworks provide sensible defaults. Get a baseline first.
- 2
Use a validation set
Never tune on test data. Split: train/validation/test.
- 3
Log everything
Track all experiments: hyperparameters, metrics, training curves.
During Tuning
- 1
Use log scale for learning rate
Sample from [0.0001, 0.001, 0.01] not [0.001, 0.002, 0.003].
- 2
Consider interactions
Learning rate and batch size often interact. Tune together.
- 3
Early stopping saves time
Don't waste resources on clearly bad configurations.
Data Leakage
Don't let test data influence hyperparameter choices
Overfitting to Validation
Too much tuning can overfit to validation set
Random Seeds
Use multiple seeds to ensure robustness
Learning Rate Schedules
Instead of using a fixed learning rate, schedules adjust it during training for better convergence.
Step Decay
Drop by factor every N epochs
Exponential Decay
Smooth exponential decrease
Cosine Annealing
Cosine-shaped decay
Warm Restarts
Periodic resets for exploration
Practical Example: Tuning a CNN
Initial Hyperparameters
1{
2 "learning_rate": 0.01,
3 "batch_size": 32,
4 "dropout": 0.5,
5 "weight_decay": 0.0001,
6 "optimizer": "SGD"
7}
8// Validation Acc: 82%
After Bayesian Optimization
1{
2 "learning_rate": 0.0023,
3 "batch_size": 48,
4 "dropout": 0.35,
5 "weight_decay": 0.00015,
6 "optimizer": "AdamW"
7}
8// Validation Acc: 91%
9% Improvement!
Proper hyperparameter tuning found a better optimizer (AdamW), lower learning rate with better convergence, optimal batch size for the hardware, and reduced dropout that was too aggressive.
Tools and Libraries
Optuna
Define-by-run API, pruning, visualization
Weights & Biases
Experiment tracking, hyperparameter sweeps
Ray Tune
Distributed tuning, many algorithms
Hyperopt
Bayesian optimization with TPE
Keras Tuner
Native Keras/TensorFlow integration
NNI
Microsoft's AutoML toolkit