Tokenizer & BPE Algorithm

Explore how Byte Pair Encoding builds vocabulary and tokenizes text

What is Byte-Pair Encoding?

BPE (Byte-Pair Encoding) is a data compression technique adapted for NLP tokenization that learns an optimal subword vocabulary from training data. It's the foundation of tokenization in modern language models like GPT, LLaMA, and BERT.

🎯 Core Concept

BPE iteratively merges the most frequent adjacent character pairs in a corpus, building a vocabulary of subword units that represent common patterns in language. This creates a sweet spot between character-level (too granular) and word-level (too sparse) tokenization.

⚖️ The Trade-off Solution

Large vocabularies are comprehensive but computationally expensive. Small vocabularies are efficient but can't represent all words. BPE achieves excellent coverage with manageable vocabulary sizes by learning meaningful subword units.

🔄 Deterministic & Reversible

Any text can be encoded into tokens and perfectly decoded back. The algorithm applies the same learned merge rules consistently, ensuring reproducible results.

🌍 Language Agnostic

Requires no linguistic knowledge yet discovers meaningful structures purely from statistical patterns. Works for any language or even programming code.

📊 Handles Unknown Words

Naturally handles rare words and out-of-vocabulary terms by breaking them into known subword components. Even completely new words can be tokenized.

How BPE Works

The Algorithm in Simple Terms

1

Start with Characters

Begin with a vocabulary of all individual characters in your training corpus. Every word is represented as a sequence of characters. For example, “hello” becomes ['h', 'e', 'l', 'l', 'o'].

2

Count Character Pairs

Find all adjacent character pairs in the corpus and count their frequency. For instance, if “th” appears 1000 times and “he” appears 800 times across all text.

3

Merge Most Frequent Pair

Take the most frequent pair and merge it into a single token. If “th” is most common, create a new token “th” and replace all occurrences. Now “the” becomes ['th', 'e'] instead of ['t', 'h', 'e'].

4

Repeat Until Target Size

Continue finding and merging the next most frequent pairs. Over iterations, you build up from characters to subwords to common words. Stop when you reach your desired vocabulary size (e.g., 50,000 tokens).

5

Apply to New Text

To tokenize new text, apply the learned merge rules in the exact order they were learned. Unknown words naturally break down into known subword pieces that were learned during training.

Example: The word “unhappy” might tokenize as [“un”, “happ”, “y”] - capturing the prefix “un”, root “happ”, and suffix “y”. Even if “unhappy” wasn't in training data, BPE can handle it using learned subwords.

Why Modern LLMs Use BPE

BPE provides the optimal balance between vocabulary size, computational efficiency, and linguistic coverage. The learned subword vocabulary captures meaningful units like prefixes (“un-”, “re-”), suffixes (“-ing”, “-tion”), and common word parts, enabling models to understand word morphology and generalize to new words.

Phase 1: Training

Common Patterns First

The first merges like “th”, “e⏎”, “ing⏎” capture the most frequent patterns in English, allowing efficient compression of common words.

Subword Benefits

BPE can handle unknown words by breaking them into known subwords. Even new words can be tokenized using learned character patterns.

Compression Efficiency

By merging frequent pairs, BPE achieves ~2-3x compression while maintaining meaningful linguistic units that help models understand language.

Vocabulary Size
2000
Unique tokens in model
Merge Rules
1740
Learned character combinations
Avg Token Length
6.6
Characters per token in vocabulary
Special Tokens
4
PAD, UNK, BOS, EOS

Complete Vocabulary (0 tokens)

IDToken (Raw)DisplayLengthType
Phase 2: Tokenization

How Tokenization Works

When you input text, the tokenizer applies the learned merge rules in the exact order they were learned during training. Watch below as your text transforms from individual characters to the final tokens through the merge sequence.

Input Text

Quick Examples:

Characters: 44
Words: 9

Tokenized Output

Tokens
0
from 44 characters
Compression
0x
chars per token
Avg Token Length
0
characters

Token Length Distribution

Token Type Distribution

Understanding BPE Tokenization

BPE starts with character-level tokens and iteratively merges the most frequent pairs. This creates a vocabulary of subword units that can efficiently represent any text. The algorithm balances vocabulary size with sequence length, making it ideal for neural language models.