Activation Functions
Explore different activation functions used in neural networks
Sigmoid
Squashes input to range (0, 1). Historically popular for binary classification.
Tanh
Zero-centered version of sigmoid. Squashes input to range (-1, 1).
ReLU
Most popular activation for hidden layers. Simple and effective.
Leaky ReLU
Addresses dying ReLU by allowing small negative gradients (α = 0.01).
ELU
Exponential Linear Unit - smooth negative values, mean activation closer to zero.
Swish
Self-gated activation discovered by neural architecture search. Now known as SiLU.
GELU
Gaussian Error Linear Unit - used in BERT, GPT, and modern transformers.
Softmax
Converts logits to probabilities for multi-class classification. Output sums to 1.
Mish
Smooth, non-monotonic, self-regularized activation. Similar to Swish but often better.
Quick Comparison
Function | Output Range | Computational Cost | Gradient Flow | Common Use |
---|---|---|---|---|
Sigmoid | (0, 1) | High | Poor (vanishing) | Binary classification output |
Tanh | (-1, 1) | High | Moderate | RNN hidden states |
ReLU | [0, ∞) | Very Low | Good (positive) | CNN/Deep network hidden layers |
Leaky ReLU | (-∞, ∞) | Very Low | Good | Alternative to ReLU |
ELU | (-α, ∞) | Moderate | Good | Deep networks |
Softmax | (0, 1) sum=1 | High | Good | Multi-class output |
GELU | ≈(-0.17, ∞) | Moderate | Excellent | Transformers (BERT, GPT) |
Swish/SiLU | ≈(-0.28, ∞) | Moderate-High | Excellent | Computer vision |
Mish | ≈(-0.31, ∞) | High | Excellent | YOLOv4, CV tasks |
Choosing an Activation Function
For most cases:
- Hidden layers: Start with ReLU or its variants (Leaky ReLU, ELU)
- Output layer: Sigmoid (binary), Softmax (multi-class), Linear (regression)
- Transformers: GELU or Swish
- RNNs: Tanh or sigmoid for gates