Hyperparameters

Master the art of configuring machine learning models. Learn what hyperparameters are, why they matter, and how to find the optimal settings for your model.

What Are Hyperparameters?

Hyperparameters are the configuration settings of a machine learning algorithm that are set before training begins. Unlike model parameters (weights and biases) that are learned during training, hyperparameters control the training process itself and the structure of the model.

Parameters vs Hyperparameters

Parameters (Learned)

Neural network weights
Biases in each layer
Embeddings in NLP models
Learned from data via backpropagation

Hyperparameters (Set by You)

Learning rate
Batch size
Number of layers
Set before training starts

Why They Matter

Hyperparameters can make the difference between a model that doesn't learn at all and one that achieves state-of-the-art performance. They control:

Model Capacity: How complex patterns the model can learn
Training Speed: How quickly the model converges
Generalization: How well the model performs on new data
Stability: Whether training converges or diverges

Key Insight: There's no universal set of “best” hyperparameters. The optimal values depend on your specific dataset, model architecture, and problem domain. This is why hyperparameter tuning is crucial for achieving good performance.

Common Hyperparameters Explained

Learning Rate ( $\alpha$ )

Controls the step size during gradient descent. Too high and training diverges, too low and it takes forever.

w_{t+1} = w_t - \alpha \cdot \nabla L

Too High (1.0)Good (0.001)Too Low (0.00001)

Common range: 0.0001 to 0.1

Batch Size

Number of samples processed before updating weights. Affects memory usage and gradient quality.

SGD (1)Mini-batch (32-512)Full batch

Noisy

Fast per step

Balanced

Good trade-off

Stable

Memory heavy

Number of Epochs

How many times to iterate through the entire dataset. More isn't always better!

Too Few:

Underfitting

Just Right:

Optimal

Too Many:

Overfitting

Use early stopping with validation loss

Dropout Rate

Probability of dropping neurons during training to prevent overfitting.

0% (No dropout)20-50% (Common)80%+ (Too much)

Visualization: Active vs dropped neurons

Architecture Hyperparameters

Network Depth

Number of hidden layers
More layers = more capacity
Risk: vanishing gradients

Layer Width

Neurons per layer
Wider = more parameters
Balance with depth

Activation Functions

ReLU, Sigmoid, Tanh
Affects gradient flow
Problem-specific choice

Hyperparameter Tuning Methods

Grid Search

Exhaustively searches through a manually specified subset of hyperparameter space.

Advantages

Guaranteed to find best in grid
Simple to implement
Parallelizable

Disadvantages

Computationally expensive
Curse of dimensionality
May miss optimal values

grid_search.pypython

1param_grid = {
2    'learning_rate': [0.001, 0.01, 0.1],
3    'batch_size': [16, 32, 64],
4    'dropout': [0.2, 0.3, 0.5]
5}
6# Total: 3 × 3 × 3 = 27 combinations

Random Search

Samples random combinations from specified distributions. Often more efficient than grid search.

Advantages

More efficient exploration
Can find unexpected optima
Easy to add more trials

Disadvantages

No guarantee of optimum
May sample poor regions
Results vary between runs

Research Finding: Random search often finds better hyperparameters than grid search with the same computational budget, especially when only a few hyperparameters truly matter.

Bayesian Optimization

Builds a probabilistic model of the objective function and uses it to select promising hyperparameters to evaluate.

How It Works

1Build surrogate model (e.g., Gaussian Process)
2Use acquisition function to find next point
3Evaluate objective at selected point
4Update surrogate model and repeat

Best For

Expensive Models

Few evaluations needed

Libraries

Optuna, Hyperopt

Production-ready tools

Advanced Methods

Genetic Algorithms

Evolve populations of hyperparameters through selection, crossover, and mutation.

Good for: Complex spaces

Population Based Training

Train multiple models in parallel, periodically replacing poor performers.

Good for: Large scale

Hyperband

Adaptive resource allocation, stopping poor configurations early.

Good for: Limited budget

Neural Architecture Search

Automatically design neural network architectures using ML.

Good for: Novel architectures

Best Practices for Hyperparameter Tuning

Before You Start

1
Start with defaults
Many frameworks provide sensible defaults. Get a baseline first.
2
Use a validation set
Never tune on test data. Split: train/validation/test.
3
Log everything
Track all experiments: hyperparameters, metrics, training curves.

During Tuning

1
Use log scale for learning rate
Sample from [0.0001, 0.001, 0.01] not [0.001, 0.002, 0.003].
2
Consider interactions
Learning rate and batch size often interact. Tune together.
3
Early stopping saves time
Don't waste resources on clearly bad configurations.

Common Pitfalls to Avoid

Data Leakage

Don't let test data influence hyperparameter choices

Overfitting to Validation

Too much tuning can overfit to validation set

Random Seeds

Use multiple seeds to ensure robustness

Learning Rate Schedules

Instead of using a fixed learning rate, schedules adjust it during training for better convergence.

Step Decay

Drop by factor every N epochs

\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}

Exponential Decay

Smooth exponential decrease

\alpha_t = \alpha_0 \cdot e^{-kt}

Cosine Annealing

Cosine-shaped decay

\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\pi t/T))

Warm Restarts

Periodic resets for exploration

SGDR technique

Practical Example: Tuning a CNN

Initial Hyperparameters

initial_config.jsonjson

1{
2  "learning_rate": 0.01,
3  "batch_size": 32,
4  "dropout": 0.5,
5  "weight_decay": 0.0001,
6  "optimizer": "SGD"
7}
8// Validation Acc: 82%

After Bayesian Optimization

optimized_config.jsonjson

1{
2  "learning_rate": 0.0023,
3  "batch_size": 48,
4  "dropout": 0.35,
5  "weight_decay": 0.00015,
6  "optimizer": "AdamW"
7}
8// Validation Acc: 91%

9% Improvement!

Proper hyperparameter tuning found a better optimizer (AdamW), lower learning rate with better convergence, optimal batch size for the hardware, and reduced dropout that was too aggressive.

Tools and Libraries

Optuna

Define-by-run API, pruning, visualization

Python

Production Ready

Weights & Biases

Experiment tracking, hyperparameter sweeps

Cloud Based

Collaborative

Ray Tune

Distributed tuning, many algorithms

Scalable

Multi-framework

Hyperopt

Bayesian optimization with TPE

Mature

Research Proven

Keras Tuner

Native Keras/TensorFlow integration

Simple API

TensorFlow

NNI

Microsoft's AutoML toolkit

Full Featured

Enterprise

Hyperparameters

What Are Hyperparameters?

Parameters vs Hyperparameters

Parameters (Learned)

Hyperparameters (Set by You)

Why They Matter

Common Hyperparameters Explained

Learning Rate (α\alphaα)

Batch Size

Number of Epochs

Dropout Rate

Network Depth

Layer Width

Activation Functions

Hyperparameter Tuning Methods

Advantages

Disadvantages

Advantages

Disadvantages

How It Works

Genetic Algorithms

Population Based Training

Hyperband

Neural Architecture Search

Best Practices for Hyperparameter Tuning

Before You Start

During Tuning

Learning Rate Schedules

Step Decay

Exponential Decay

Cosine Annealing

Warm Restarts

Practical Example: Tuning a CNN

Initial Hyperparameters

After Bayesian Optimization

Tools and Libraries

Optuna

Weights & Biases

Ray Tune

Hyperopt

Keras Tuner

NNI

Learning Rate ( $\alpha$ )