Activation Functions

Explore different activation functions used in neural networks

Sigmoid

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Squashes input to range (0, 1). Historically popular for binary classification.

Output: (0, 1)Since 1990

Tanh

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Zero-centered version of sigmoid. Squashes input to range (-1, 1).

Output: (-1, 1)Since 1991

ReLU

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Most popular activation for hidden layers. Simple and effective.

Output: [0, ∞)Since 2000

Leaky ReLU

LReLU(x)={xx>0αxx0\text{LReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

Addresses dying ReLU by allowing small negative gradients (α = 0.01).

Output: (-∞, ∞)Since 2013

ELU

ELU(x)={xx>0α(ex1)x0\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

Exponential Linear Unit - smooth negative values, mean activation closer to zero.

Output: (-α, ∞)Since 2015

Swish

Swish(x)=xσ(x)\text{Swish}(x) = x \cdot \sigma(x)

Self-gated activation discovered by neural architecture search. Now known as SiLU.

Output: ≈(-0.28, ∞)Since 2017

GELU

GELU(x)=xΦ(x)\text{GELU}(x) = x \cdot \Phi(x)

Gaussian Error Linear Unit - used in BERT, GPT, and modern transformers.

Output: ≈(-0.17, ∞)Since 2016

Softmax

Softmax(xi)=exijexj\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Converts logits to probabilities for multi-class classification. Output sums to 1.

Output: (0, 1)Since 1959

Mish

Mish(x)=xtanh(ln(1+ex))\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))

Smooth, non-monotonic, self-regularized activation. Similar to Swish but often better.

Output: ≈(-0.31, ∞)Since 2019

Quick Comparison

FunctionOutput RangeComputational CostGradient FlowCommon Use
Sigmoid(0, 1)HighPoor (vanishing)Binary classification output
Tanh(-1, 1)HighModerateRNN hidden states
ReLU[0, ∞)Very LowGood (positive)CNN/Deep network hidden layers
Leaky ReLU(-∞, ∞)Very LowGoodAlternative to ReLU
ELU(-α, ∞)ModerateGoodDeep networks
Softmax(0, 1) sum=1HighGoodMulti-class output
GELU≈(-0.17, ∞)ModerateExcellentTransformers (BERT, GPT)
Swish/SiLU≈(-0.28, ∞)Moderate-HighExcellentComputer vision
Mish≈(-0.31, ∞)HighExcellentYOLOv4, CV tasks

Choosing an Activation Function

For most cases:

  • Hidden layers: Start with ReLU or its variants (Leaky ReLU, ELU)
  • Output layer: Sigmoid (binary), Softmax (multi-class), Linear (regression)
  • Transformers: GELU or Swish
  • RNNs: Tanh or sigmoid for gates