Word Embeddings & Vector Representations

Transform discrete tokens into continuous vector spaces where meaning lives

What are Word Embeddings?

Word embeddings are the bridge between discrete symbols (tokens) and continuous mathematics (neural networks). They transform words into dense vectors of real numbers, enabling mathematical operations on language and capturing semantic relationships in geometric space.

The Core Insight

Words with similar meanings should have similar representations. In embedding space, “cat” and “dog” are closer together than “cat” and “democracy”. This geometric structure emerges naturally from training on text where similar words appear in similar contexts.

Why Not One-Hot?

One-hot encoding creates sparse vectors where each word is orthogonal to every other word—no relationships captured. A 50,000 word vocabulary needs 50,000-dimensional vectors! Embeddings compress this to typically 128-1024 dimensions while encoding rich semantic information.

Distributed Representation

Each dimension doesn't represent a single feature but rather contributes to multiple semantic aspects. Meaning is distributed across all dimensions, making embeddings robust and information-dense.

Compositionality

Vector arithmetic works! The famous example: vector(“king”) - vector(“man”) + vector(“woman”) ≈ vector(“queen”). Relationships are encoded as vector offsets.

Transfer Learning

Pre-trained embeddings capture general language understanding that transfers to new tasks. This is the foundation of modern NLP—start with good embeddings, fine-tune for specific tasks.

How Embeddings Work

The Embedding Process

Token ID Lookup

Each token from the tokenizer has a unique ID (e.g., “cat” → 1234). This ID is used to look up the corresponding embedding vector from the embedding matrix.

Embedding Matrix

A large matrix of size [vocab_size × embedding_dim]. Each row is a learned vector for one token. For a 50k vocabulary with 768-dim embeddings, this is a 50,000 × 768 matrix (~146 MB).

Vector Retrieval

The embedding for token ID 1234 is simply row 1234 of the matrix. This is a fast O(1) lookup operation, essentially just array indexing.

Learning Process

During training, these vectors are adjusted via backpropagation to minimize the loss function. Words appearing in similar contexts get pulled closer together in the vector space.

Positional Encoding (Transformers)

Since transformers process all tokens in parallel, positional information is added to embeddings. This tells the model where each word appears in the sequence.

Key Insight: The embedding matrix is the largest single parameter block in many models. GPT-3's embedding matrix alone has ~38 billion parameters (50k tokens × 768 dims)!

Why Embeddings Enable Intelligence

Embeddings transform the discrete, symbolic nature of language into a continuous space where: (1) Similar concepts are geometrically close, (2) Relationships are vector operations, (3) Gradients can flow for learning, and (4) Neural networks can process meaning mathematically. This transformation is fundamental—without it, deep learning on text wouldn't be possible.

Phase 1: Training Embeddings

Classical Methods

Two architectures that revolutionized NLP:

CBOW: Predict word from context
Skip-gram: Predict context from word

Trained on billions of words, captures analogies and semantic relationships.

Global Vectors for Word Representation:

Combines global matrix factorization
With local context windows
Explicitly encodes co-occurrence statistics

Extension of Word2Vec with subword information:

Handles out-of-vocabulary words
Uses character n-grams
Better for morphologically rich languages

Modern Contextual Embeddings

Learned end-to-end with the model:

Context-dependent representations
Same word → different vectors in different contexts
Captures polysemy naturally

“Bank” near “river” vs “bank” near “money” get different embeddings!

Adding positional information:

Sinusoidal: Fixed mathematical functions
Learned: Trainable position embeddings
Relative: Encode distances between tokens

For BPE/WordPiece tokenization:

Each subword token gets an embedding
Rare words built from common pieces
Enables infinite vocabulary coverage

Key Properties of Good Embeddings

Semantic Similarity

Cosine Distance

Similar words → High cosine similarity

Linear Relationships

Vector Arithmetic

Analogies via addition/subtraction

Dimensionality

128-1024

Typical embedding dimensions

Coverage

99.9%+

Of text representable

Phase 2: Interactive Exploration

Input Tokens

Embedding Dimension128

32128256384512

Embedding Statistics

Vector Dimension

128

Parameters per token

Total Tokens

Unique embeddings

Memory Usage

2.00 KB

Float32 representation

PCA Projection (2D)

PCA reduces high-dimensional embeddings to 2D while preserving as much variance as possible. Nearby points have similar semantic meaning.

First 5 Dimensions

Each dimension captures different semantic features. Similar words have similar patterns across dimensions.

Common Embedding Patterns

Semantic Clusters

Related words form tight clusters: animals group together, colors group together, emotions group together. This emerges naturally from training.

Frequency Effects

Common words often have smaller magnitudes and occupy central positions. Rare words can have more extreme values.

Polysemy Challenge

Static embeddings give one vector per word, struggling with multiple meanings. Contextual embeddings solve this.

Next: Attention Mechanism

Now that tokens are embedded as vectors, the attention mechanism determines how they interact. Attention allows the model to focus on relevant parts of the input, creating context-aware representations that capture long-range dependencies and complex relationships.

Continue to Attention