Attention & Feed-Forward Networks

Explore self-attention mechanisms and feed-forward transformations

Configuration

Number of Attention Heads8

Hidden Dimension512

Model Statistics

Attention Heads

Parallel attention patterns

FFN Hidden Size

2048

4x expansion in FFN

Parameters

2.10M

Attention + FFN weights

Self-Attention Weights Matrix

	The	cat	sat	on	the	mat

Darker blue indicates stronger attention between tokens

FFN Activation Functions

Layer Normalization

Multi-Head Attention Distribution

Each attention head learns different patterns and relationships between tokens

Transformer Block Flow

1. Multi-Head Attention: Input → Q, K, V projections → Scaled dot-product attention → Concatenate heads

2. Add & Norm: Residual connection + Layer normalization

3. Feed-Forward Network: Linear → GELU activation → Linear → Dropout

4. Add & Norm: Another residual connection + Layer normalization

Understanding Attention & FFN

Self-attention allows the model to weigh the importance of different tokens when processing each position. The feed-forward network then applies non-linear transformations to create richer representations. Multiple attention heads enable the model to attend to different types of relationships simultaneously.