Tokenizer & BPE Algorithm
Explore how Byte Pair Encoding builds vocabulary and tokenizes text
What is Byte-Pair Encoding?
BPE (Byte-Pair Encoding) is a data compression technique adapted for NLP tokenization that learns an optimal subword vocabulary from training data. It's the foundation of tokenization in modern language models like GPT, LLaMA, and BERT.
🎯 Core Concept
BPE iteratively merges the most frequent adjacent character pairs in a corpus, building a vocabulary of subword units that represent common patterns in language. This creates a sweet spot between character-level (too granular) and word-level (too sparse) tokenization.
⚖️ The Trade-off Solution
Large vocabularies are comprehensive but computationally expensive. Small vocabularies are efficient but can't represent all words. BPE achieves excellent coverage with manageable vocabulary sizes by learning meaningful subword units.
🔄 Deterministic & Reversible
Any text can be encoded into tokens and perfectly decoded back. The algorithm applies the same learned merge rules consistently, ensuring reproducible results.
🌍 Language Agnostic
Requires no linguistic knowledge yet discovers meaningful structures purely from statistical patterns. Works for any language or even programming code.
📊 Handles Unknown Words
Naturally handles rare words and out-of-vocabulary terms by breaking them into known subword components. Even completely new words can be tokenized.
The Algorithm in Simple Terms
Start with Characters
Begin with a vocabulary of all individual characters in your training corpus. Every word is represented as a sequence of characters. For example, “hello” becomes ['h', 'e', 'l', 'l', 'o'].
Count Character Pairs
Find all adjacent character pairs in the corpus and count their frequency. For instance, if “th” appears 1000 times and “he” appears 800 times across all text.
Merge Most Frequent Pair
Take the most frequent pair and merge it into a single token. If “th” is most common, create a new token “th” and replace all occurrences. Now “the” becomes ['th', 'e'] instead of ['t', 'h', 'e'].
Repeat Until Target Size
Continue finding and merging the next most frequent pairs. Over iterations, you build up from characters to subwords to common words. Stop when you reach your desired vocabulary size (e.g., 50,000 tokens).
Apply to New Text
To tokenize new text, apply the learned merge rules in the exact order they were learned. Unknown words naturally break down into known subword pieces that were learned during training.
Example: The word “unhappy” might tokenize as [“un”, “happ”, “y”] - capturing the prefix “un”, root “happ”, and suffix “y”. Even if “unhappy” wasn't in training data, BPE can handle it using learned subwords.
Why Modern LLMs Use BPE
BPE provides the optimal balance between vocabulary size, computational efficiency, and linguistic coverage. The learned subword vocabulary captures meaningful units like prefixes (“un-”, “re-”), suffixes (“-ing”, “-tion”), and common word parts, enabling models to understand word morphology and generalize to new words.
Common Patterns First
The first merges like “th”, “e⏎”, “ing⏎” capture the most frequent patterns in English, allowing efficient compression of common words.
Subword Benefits
BPE can handle unknown words by breaking them into known subwords. Even new words can be tokenized using learned character patterns.
Compression Efficiency
By merging frequent pairs, BPE achieves ~2-3x compression while maintaining meaningful linguistic units that help models understand language.
Complete Vocabulary (0 tokens)
ID | Token (Raw) | Display | Length | Type |
---|
How Tokenization Works
When you input text, the tokenizer applies the learned merge rules in the exact order they were learned during training. Watch below as your text transforms from individual characters to the final tokens through the merge sequence.
Input Text
Quick Examples: