Hyperparameters

Master the art of configuring machine learning models. Learn what hyperparameters are, why they matter, and how to find the optimal settings for your model.

What Are Hyperparameters?

Hyperparameters are the configuration settings of a machine learning algorithm that are set before training begins. Unlike model parameters (weights and biases) that are learned during training, hyperparameters control the training process itself and the structure of the model.

Parameters vs Hyperparameters

Parameters (Learned)

  • Neural network weights
  • Biases in each layer
  • Embeddings in NLP models
  • Learned from data via backpropagation

Hyperparameters (Set by You)

  • Learning rate
  • Batch size
  • Number of layers
  • Set before training starts

Why They Matter

Hyperparameters can make the difference between a model that doesn't learn at all and one that achieves state-of-the-art performance. They control:

  • Model Capacity: How complex patterns the model can learn
  • Training Speed: How quickly the model converges
  • Generalization: How well the model performs on new data
  • Stability: Whether training converges or diverges

Key Insight: There's no universal set of “best” hyperparameters. The optimal values depend on your specific dataset, model architecture, and problem domain. This is why hyperparameter tuning is crucial for achieving good performance.

Common Hyperparameters Explained

Learning Rate (α\alpha)

Controls the step size during gradient descent. Too high and training diverges, too low and it takes forever.

wt+1=wtαLw_{t+1} = w_t - \alpha \cdot \nabla L
Too High (1.0)Good (0.001)Too Low (0.00001)

Common range: 0.0001 to 0.1

Batch Size

Number of samples processed before updating weights. Affects memory usage and gradient quality.

SGD (1)Mini-batch (32-512)Full batch
Noisy
Fast per step
Balanced
Good trade-off
Stable
Memory heavy

Number of Epochs

How many times to iterate through the entire dataset. More isn't always better!

Too Few:
Underfitting
Just Right:
Optimal
Too Many:
Overfitting

Use early stopping with validation loss

Dropout Rate

Probability of dropping neurons during training to prevent overfitting.

0% (No dropout)20-50% (Common)80%+ (Too much)

Visualization: Active vs dropped neurons

Architecture Hyperparameters

Network Depth

  • Number of hidden layers
  • More layers = more capacity
  • Risk: vanishing gradients

Layer Width

  • Neurons per layer
  • Wider = more parameters
  • Balance with depth

Activation Functions

  • ReLU, Sigmoid, Tanh
  • Affects gradient flow
  • Problem-specific choice

Hyperparameter Tuning Methods

Grid Search

Exhaustively searches through a manually specified subset of hyperparameter space.

Advantages

  • Guaranteed to find best in grid
  • Simple to implement
  • Parallelizable

Disadvantages

  • Computationally expensive
  • Curse of dimensionality
  • May miss optimal values
grid_search.pypython
1param_grid = {
2    'learning_rate': [0.001, 0.01, 0.1],
3    'batch_size': [16, 32, 64],
4    'dropout': [0.2, 0.3, 0.5]
5}
6# Total: 3 × 3 × 3 = 27 combinations
Random Search

Samples random combinations from specified distributions. Often more efficient than grid search.

Advantages

  • More efficient exploration
  • Can find unexpected optima
  • Easy to add more trials

Disadvantages

  • No guarantee of optimum
  • May sample poor regions
  • Results vary between runs

Research Finding: Random search often finds better hyperparameters than grid search with the same computational budget, especially when only a few hyperparameters truly matter.

Bayesian Optimization

Builds a probabilistic model of the objective function and uses it to select promising hyperparameters to evaluate.

How It Works

  1. 1Build surrogate model (e.g., Gaussian Process)
  2. 2Use acquisition function to find next point
  3. 3Evaluate objective at selected point
  4. 4Update surrogate model and repeat
Best For
Expensive Models
Few evaluations needed
Libraries
Optuna, Hyperopt
Production-ready tools
Advanced Methods

Genetic Algorithms

Evolve populations of hyperparameters through selection, crossover, and mutation.

Good for: Complex spaces

Population Based Training

Train multiple models in parallel, periodically replacing poor performers.

Good for: Large scale

Hyperband

Adaptive resource allocation, stopping poor configurations early.

Good for: Limited budget

Neural Architecture Search

Automatically design neural network architectures using ML.

Good for: Novel architectures

Best Practices for Hyperparameter Tuning

Before You Start

  • 1

    Start with defaults

    Many frameworks provide sensible defaults. Get a baseline first.

  • 2

    Use a validation set

    Never tune on test data. Split: train/validation/test.

  • 3

    Log everything

    Track all experiments: hyperparameters, metrics, training curves.

During Tuning

  • 1

    Use log scale for learning rate

    Sample from [0.0001, 0.001, 0.01] not [0.001, 0.002, 0.003].

  • 2

    Consider interactions

    Learning rate and batch size often interact. Tune together.

  • 3

    Early stopping saves time

    Don't waste resources on clearly bad configurations.

Common Pitfalls to Avoid

Data Leakage

Don't let test data influence hyperparameter choices

Overfitting to Validation

Too much tuning can overfit to validation set

Random Seeds

Use multiple seeds to ensure robustness

Learning Rate Schedules

Instead of using a fixed learning rate, schedules adjust it during training for better convergence.

Step Decay

Drop by factor every N epochs

αt=α0γt/s\alpha_t = \alpha_0 \cdot \gamma^{\lfloor t/s \rfloor}

Exponential Decay

Smooth exponential decrease

αt=α0ekt\alpha_t = \alpha_0 \cdot e^{-kt}

Cosine Annealing

Cosine-shaped decay

αt=αmin+12(αmaxαmin)(1+cos(πt/T))\alpha_t = \alpha_{min} + \frac{1}{2}(\alpha_{max} - \alpha_{min})(1 + \cos(\pi t/T))

Warm Restarts

Periodic resets for exploration

SGDR technique

Practical Example: Tuning a CNN

Initial Hyperparameters

initial_config.jsonjson
1{
2  "learning_rate": 0.01,
3  "batch_size": 32,
4  "dropout": 0.5,
5  "weight_decay": 0.0001,
6  "optimizer": "SGD"
7}
8// Validation Acc: 82%

After Bayesian Optimization

optimized_config.jsonjson
1{
2  "learning_rate": 0.0023,
3  "batch_size": 48,
4  "dropout": 0.35,
5  "weight_decay": 0.00015,
6  "optimizer": "AdamW"
7}
8// Validation Acc: 91%

9% Improvement!

Proper hyperparameter tuning found a better optimizer (AdamW), lower learning rate with better convergence, optimal batch size for the hardware, and reduced dropout that was too aggressive.

Tools and Libraries

Optuna

Define-by-run API, pruning, visualization

Python
Production Ready

Weights & Biases

Experiment tracking, hyperparameter sweeps

Cloud Based
Collaborative

Ray Tune

Distributed tuning, many algorithms

Scalable
Multi-framework

Hyperopt

Bayesian optimization with TPE

Mature
Research Proven

Keras Tuner

Native Keras/TensorFlow integration

Simple API
TensorFlow

NNI

Microsoft's AutoML toolkit

Full Featured
Enterprise