Flash Attention

Revolutionizing Transformer Efficiency Through Memory-Aware Algorithms

Flash Attention Flash Attention 2 Flash Attention 3 View Code

Introduction

Flash Attention represents one of the most significant algorithmic breakthroughs in modern deep learning, fundamentally changing how we approach the computational bottleneck of attention mechanisms in transformers. Developed by Tri Dao, Daniel Fu, Stefano Ermon, Atri Rudra, and Christopher Ré at Stanford, this innovation addresses the quadratic memory complexity that has long plagued large-scale transformer models.

The transformer architecture's attention mechanism, while powerful, suffers from a fundamental scaling problem: memory usage grows quadratically with sequence length. For a sequence of length N, standard attention requires $O(N^2)$ memory to store the attention matrix. Flash Attention solves this through a mathematically elegant approach that maintains identical outputs while reducing memory complexity to linear scaling.

Key Insight

Flash Attention achieves exact attention computation with linear memory by leveraging GPU memory hierarchy and online softmax algorithms.

The Memory Wall Problem

Traditional attention computation follows a straightforward but memory-intensive process. Given query (Q), key (K), and value (V) matrices of dimensions [N, d], the standard algorithm computes:

1. Score Matrix

S = QK^T

Results in [N×N] matrix

2. Softmax

P = \text{softmax}(S)

Maintains [N×N] dimensions

3. Output

O = PV

Produces [N×d] output

Memory Crisis at Scale

A 64K sequence requires storing ~16 billion floating-point numbers for the attention matrix alone, consuming 64GB of memory just for one attention layer!

Sequence Length	Attention Matrix Size	Memory (FP32)	Memory (FP16)
2K	4M elements	16 MB	8 MB
8K	64M elements	256 MB	128 MB
32K	1B elements	4 GB	2 GB
64K	4B elements	16 GB	8 GB
128K	16B elements	64 GB	32 GB

Mathematical Foundation

Online vs Offline Algorithms

Understanding the difference between online and offline algorithms is crucial to appreciating Flash Attention's innovation:

Offline Algorithm (Traditional)

•Requires all data to be available upfront
•Must store entire intermediate results
•Multiple passes over data
•High memory usage: O(N²) for attention

1. Compute all scores S = QK^T
2. Store entire N×N matrix
3. Apply softmax to all elements
4. Multiply with V

Online Algorithm (Flash Attention)

•Processes data incrementally as it arrives
•Updates results on-the-fly
•Single pass through data
•Low memory usage: O(N) for statistics

1. Process blocks incrementally
2. Update running statistics
3. Accumulate results online
4. Never store full matrix

Key Insight

Online algorithms can produce identical results to offline algorithms while using dramatically less memory by cleverly updating partial results.

Online Softmax Algorithm

The foundation of Flash Attention lies in reformulating softmax computation. For a vector $x = [x_1, x_2, ..., x_n]$ , traditional softmax requires:

Standard Softmax

\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}

Flash Attention implements an online version with numerical stability:

Safe Softmax

m = \max(x_1, x_2, ..., x_n)

\text{softmax}(x_i) = \frac{e^{x_i - m}}{\sum_{j=1}^{n} e^{x_j - m}}

Online Updates

m_{new} = \max(m_{old}, x_i)

s_{new} = s_{old} \cdot e^{m_{old} - m_{new}} + e^{x_i - m_{new}}

Block-Wise Attention Computation

Flash Attention partitions the N×N attention matrix into blocks of size B and processes them sequentially:

Block Processing Strategy

Step 1

Partition matrices

Q, K, V \rightarrow [\lceil N/B \rceil, B, d]

Step 2

Local computation

S_{ij} = Q_i K_j^T

Step 3

Online accumulation

O_i = \frac{O_i \cdot l_{old} + \tilde{P}_{ij}V_j}{l_{new}}

Core Flash Attention Algorithm

flash_attention_core.pypython

1# Flash Attention Core Algorithm
2for i in range(num_query_blocks):
3    for j in range(num_key_blocks):
4        # Load blocks to shared memory
5        Q_i = load_block(Q, i)
6        K_j, V_j = load_blocks(K, V, j)
7        
8        # Compute local attention scores
9        S_ij = Q_i @ K_j.T
10        
11        # Online softmax: update running max
12        m_new = max(m_old, rowmax(S_ij))
13        
14        # Compute exp with numerical stability
15        P_ij = exp(S_ij - m_new)
16        
17        # Update running sum for normalization
18        l_new = l_old * exp(m_old - m_new) + rowsum(P_ij)
19        
20        # Accumulate output with online rescaling
21        O_i = (O_i * l_old * exp(m_old - m_new) + P_ij @ V_j) / l_new
22        
23        # Update statistics for next iteration
24        m_old, l_old = m_new, l_new

Memory & Performance Analysis

Standard Attention

Memory Requirements

Q, K, V storage:3Nd
Attention matrix:N²
Total:O(N² + Nd)

Flash Attention

Memory Requirements

Q, K, V storage:3Nd
Block matrices:B²
Statistics:O(N)
Total:O(Nd + B²)

Memory Reduction Factor

For N=64K and B=128: Memory reduction = N²/B² ≈ 250,000× for attention matrix storage!

Memory Usage Comparison

Flash Attention uses O(N) memory vs Standard's O(N²)

Speed Comparison

Flash Attention is 2-4x faster on long sequences

Optimal Block Size Selection

B = \left\lfloor \sqrt{\frac{\text{SRAM\_size}}{4d}} \right\rfloor

Where SRAM_size represents available GPU shared memory (typically ~100KB on modern GPUs)

GPU Memory Hierarchy Utilization

Flash Attention is carefully designed around the GPU memory hierarchy to maximize performance:

Memory Type	Bandwidth	Capacity	Usage in Flash Attention
Registers	~100 TB/s	Very Limited	Accumulate results, temporary values
Shared Memory (SRAM)	~15 TB/s	~100KB	Block computation, online statistics
Global Memory (HBM)	~1.5 TB/s	High (40-80GB)	Store Q, K, V matrices

I/O Complexity

Flash Attention achieves optimal I/O complexity: $O(N^2d/B + Nd)$ where B is the fast memory size.

Evolution: Flash Attention 2 & 3

Flash Attention 2 (2023) - “Faster, Better, Simpler”

Flash Attention 2 significantly improved upon the original by addressing algorithmic inefficiencies and better utilizing modern GPU architectures.

Key Improvements

Reduced non-matmul FLOPs: Minimized redundant operations by 2.5×
Better parallelization: Split Q across warps instead of K/V
Optimized for H100: Leverages new GPU features
Backward pass optimization: 2.5× faster gradient computation

Performance Gains

2× faster than Flash Attention 1
Reaches 72% of theoretical max FLOPS on A100
Supports multi-query and grouped-query attention
Enables 16K context Llama-7B on single A100

flash_attn_2_improvements.pypython

# Flash Attention 2 Key Optimizations
# 1. Better work partitioning
for warp in range(num_warps):
    Q_warp = Q[warp::num_warps]  # Split Q across warps
    for block in K_blocks:
        compute_attention(Q_warp, block)

# 2. Reduced rescaling operations
# FA1: Rescale O after each block
# FA2: Keep track of logsumexp, rescale once at end

# 3. Support for variable sequence lengths
# Efficiently handle padding without wasted computation

Flash Attention 3 (2024) - “Hardware-Aware for Hopper GPUs”

Flash Attention 3 is specifically designed for NVIDIA Hopper architecture (H100/H800), exploiting new hardware features for unprecedented performance.

Hopper-Specific Optimizations

WGMMA instructions: New tensor core operations for async execution
TMA (Tensor Memory Accelerator): Hardware-accelerated async data movement
Pingpong scheduling: Overlap compute and memory operations
FP8 support: Lower precision for even faster computation

Performance Breakthrough

1.5-2× faster than Flash Attention 2 on H100
740 TFLOPS achieved (75% utilization)
FP8: 1.2 PFLOPS possible
Supports GQA with minimal overhead

flash_attn_3_hopper.cucuda

// Flash Attention 3: Hopper-specific features
// Asynchronous producer-consumer pattern

// Producer thread block: Load data using TMA
tma::async_load(Q_smem, Q_gmem, pipeline);
tma::async_load(K_smem, K_gmem, pipeline);

// Consumer warpgroup: Compute using WGMMA
pipeline.consumer_wait();
wgmma::mma(acc, Q_smem, K_smem);  // Async tensor core op

// Pingpong double buffering
while (tiles_remaining) {
    // Buffer A: Load next tile
    tma::async_load(buffer_A, next_tile);
    
    // Buffer B: Compute current tile
    wgmma::mma(acc, buffer_B);
    
    swap(buffer_A, buffer_B);
}

Performance Evolution

Version	Year	Key Innovation	Speed vs Standard	Best GPU
Flash Attention	2022	Online softmax + Blocking	2-4×	A100
Flash Attention 2	2023	Better parallelization	5-9×	A100/H100
Flash Attention 3	2024	Hopper-specific (TMA, WGMMA)	10-15×	H100

Sparse Flash Attention

Local: Each token attends to nearby tokens
Global: Special tokens attend to all positions
Random: Sparse random connections
Block-Sparse: Predefined sparsity patterns

Applications & Impact

Document Processing

• Legal document analysis (100K+ tokens)
• Full scientific paper comprehension
• Entire code repository analysis

Multimodal AI

• High-resolution image processing
• Long-form video understanding
• Extended audio conversations

Training Efficiency

• 2-4× faster training
• Larger batch sizes
• Better gradient flow

Speed Improvement

5.1×

For 64K sequences

Memory Reduction

95%

For attention matrices

Max Sequence

256K+

Tokens supported

Mathematical Correctness

Theorem

For any partitioning of the attention computation into blocks, Flash Attention produces outputs identical to standard attention up to numerical precision.

Proof Sketch

The proof relies on the associativity of attention operations when properly normalized:

\text{softmax}([x_1, x_2, ..., x_n]) = [\text{softmax}_{\text{combined}}(\text{softmax}_{\text{block}_1}(x_1), \text{softmax}_{\text{block}_2}(x_2), ...)]

Where the combination operation maintains the invariant:

\sum_i e^{x_i - \text{global\_max}} = \sum_{\text{blocks}} \left(\text{block\_sum} \times e^{\text{block\_max} - \text{global\_max}}\right)

Future Research Directions

Algorithmic Improvements

1
Adaptive Block Sizing:Dynamic block sizes based on attention patterns
2
Hierarchical Attention:Multi-level attention computation
3
Approximate Methods:Trading accuracy for further speedups

Hardware Co-design

1
Attention Accelerators:Custom silicon for attention computation
2
Memory Optimization:Specialized memory layouts for transformers
3
Quantum Attention:Exploring quantum algorithms for attention

Conclusion

Flash Attention represents a paradigm shift in efficient attention computation, demonstrating that algorithmic innovation can overcome fundamental scaling limitations. By leveraging mathematical properties of softmax and careful consideration of hardware constraints, it achieves linear memory scaling while maintaining exact computational equivalence to standard attention.

The broader impact extends beyond immediate performance gains. Flash Attention has democratized access to long-context models, enabled new applications previously computationally infeasible, and established principles for memory-efficient algorithm design that influence numerous other deep learning components.

As we continue scaling language models and extending context lengths, Flash Attention's principles—block-wise computation, online algorithms, and hardware-aware design—provide a foundation for future innovations in efficient transformer architectures. The journey from $O(N^2)$ to $O(N)$ memory complexity exemplifies how deep understanding of mathematical structures, combined with hardware awareness, can break through seemingly fundamental limitations.

Key Takeaway

Flash Attention shows that the most impactful optimizations often come not from incremental improvements, but from fundamentally rethinking how algorithms interact with hardware.