Attention & Feed-Forward Networks
Explore self-attention mechanisms and feed-forward transformations
Configuration
Model Statistics
Attention Heads
8
Parallel attention patterns
FFN Hidden Size
2048
4x expansion in FFN
Parameters
2.10M
Attention + FFN weights
Self-Attention Weights Matrix
The | cat | sat | on | the | mat |
---|
Darker blue indicates stronger attention between tokens
FFN Activation Functions
Layer Normalization
Multi-Head Attention Distribution
Each attention head learns different patterns and relationships between tokens
Transformer Block Flow
1. Multi-Head Attention: Input → Q, K, V projections → Scaled dot-product attention → Concatenate heads
2. Add & Norm: Residual connection + Layer normalization
3. Feed-Forward Network: Linear → GELU activation → Linear → Dropout
4. Add & Norm: Another residual connection + Layer normalization
Understanding Attention & FFN
Self-attention allows the model to weigh the importance of different tokens when processing each position. The feed-forward network then applies non-linear transformations to create richer representations. Multiple attention heads enable the model to attend to different types of relationships simultaneously.