Attention & Feed-Forward Networks

Explore self-attention mechanisms and feed-forward transformations

Configuration

Model Statistics

Attention Heads
8
Parallel attention patterns
FFN Hidden Size
2048
4x expansion in FFN
Parameters
2.10M
Attention + FFN weights

Self-Attention Weights Matrix

Thecatsatonthemat
Darker blue indicates stronger attention between tokens

FFN Activation Functions

Layer Normalization

Multi-Head Attention Distribution

Each attention head learns different patterns and relationships between tokens

Transformer Block Flow

1. Multi-Head Attention: Input → Q, K, V projections → Scaled dot-product attention → Concatenate heads
2. Add & Norm: Residual connection + Layer normalization
3. Feed-Forward Network: Linear → GELU activation → Linear → Dropout
4. Add & Norm: Another residual connection + Layer normalization

Understanding Attention & FFN

Self-attention allows the model to weigh the importance of different tokens when processing each position. The feed-forward network then applies non-linear transformations to create richer representations. Multiple attention heads enable the model to attend to different types of relationships simultaneously.