Activation Functions

Explore different activation functions used in neural networks

Sigmoid

\sigma(x) = \frac{1}{1 + e^{-x}}

Squashes input to range (0, 1). Historically popular for binary classification.

Output: (0, 1)Since 1990

\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Zero-centered version of sigmoid. Squashes input to range (-1, 1).

Output: (-1, 1)Since 1991

\text{ReLU}(x) = \max(0, x)

Most popular activation for hidden layers. Simple and effective.

Output: [0, ∞)Since 2000

\text{LReLU}(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}

Addresses dying ReLU by allowing small negative gradients (α = 0.01).

Output: (-∞, ∞)Since 2013

\text{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}

Exponential Linear Unit - smooth negative values, mean activation closer to zero.

Output: (-α, ∞)Since 2015

\text{Swish}(x) = x \cdot \sigma(x)

Self-gated activation discovered by neural architecture search. Now known as SiLU.

Output: ≈(-0.28, ∞)Since 2017

\text{GELU}(x) = x \cdot \Phi(x)

Gaussian Error Linear Unit - used in BERT, GPT, and modern transformers.

Output: ≈(-0.17, ∞)Since 2016

\text{Softmax}(x_i) = \frac{e^{x_i}}{\sum_{j} e^{x_j}}

Converts logits to probabilities for multi-class classification. Output sums to 1.

Output: (0, 1)Since 1959

\text{Mish}(x) = x \cdot \tanh(\ln(1 + e^x))

Smooth, non-monotonic, self-regularized activation. Similar to Swish but often better.

Output: ≈(-0.31, ∞)Since 2019

Function	Output Range	Computational Cost	Gradient Flow	Common Use
Sigmoid	(0, 1)	High	Poor (vanishing)	Binary classification output
Tanh	(-1, 1)	High	Moderate	RNN hidden states
ReLU	[0, ∞)	Very Low	Good (positive)	CNN/Deep network hidden layers
Leaky ReLU	(-∞, ∞)	Very Low	Good	Alternative to ReLU
ELU	(-α, ∞)	Moderate	Good	Deep networks
Softmax	(0, 1) sum=1	High	Good	Multi-class output
GELU	≈(-0.17, ∞)	Moderate	Excellent	Transformers (BERT, GPT)
Swish/SiLU	≈(-0.28, ∞)	Moderate-High	Excellent	Computer vision
Mish	≈(-0.31, ∞)	High	Excellent	YOLOv4, CV tasks

For most cases: