Word Embeddings & Vector Representations

Transform discrete tokens into continuous vector spaces where meaning lives

What are Word Embeddings?

Word embeddings are the bridge between discrete symbols (tokens) and continuous mathematics (neural networks). They transform words into dense vectors of real numbers, enabling mathematical operations on language and capturing semantic relationships in geometric space.

The Core Insight

Words with similar meanings should have similar representations. In embedding space, “cat” and “dog” are closer together than “cat” and “democracy”. This geometric structure emerges naturally from training on text where similar words appear in similar contexts.

Why Not One-Hot?

One-hot encoding creates sparse vectors where each word is orthogonal to every other word—no relationships captured. A 50,000 word vocabulary needs 50,000-dimensional vectors! Embeddings compress this to typically 128-1024 dimensions while encoding rich semantic information.

Distributed Representation

Each dimension doesn't represent a single feature but rather contributes to multiple semantic aspects. Meaning is distributed across all dimensions, making embeddings robust and information-dense.

Compositionality

Vector arithmetic works! The famous example: vector(“king”) - vector(“man”) + vector(“woman”) ≈ vector(“queen”). Relationships are encoded as vector offsets.

Transfer Learning

Pre-trained embeddings capture general language understanding that transfers to new tasks. This is the foundation of modern NLP—start with good embeddings, fine-tune for specific tasks.

How Embeddings Work

The Embedding Process

1

Token ID Lookup

Each token from the tokenizer has a unique ID (e.g., “cat” → 1234). This ID is used to look up the corresponding embedding vector from the embedding matrix.

2

Embedding Matrix

A large matrix of size [vocab_size × embedding_dim]. Each row is a learned vector for one token. For a 50k vocabulary with 768-dim embeddings, this is a 50,000 × 768 matrix (~146 MB).

3

Vector Retrieval

The embedding for token ID 1234 is simply row 1234 of the matrix. This is a fast O(1) lookup operation, essentially just array indexing.

4

Learning Process

During training, these vectors are adjusted via backpropagation to minimize the loss function. Words appearing in similar contexts get pulled closer together in the vector space.

5

Positional Encoding (Transformers)

Since transformers process all tokens in parallel, positional information is added to embeddings. This tells the model where each word appears in the sequence.

Key Insight: The embedding matrix is the largest single parameter block in many models. GPT-3's embedding matrix alone has ~38 billion parameters (50k tokens × 768 dims)!

Why Embeddings Enable Intelligence

Embeddings transform the discrete, symbolic nature of language into a continuous space where: (1) Similar concepts are geometrically close, (2) Relationships are vector operations, (3) Gradients can flow for learning, and (4) Neural networks can process meaning mathematically. This transformation is fundamental—without it, deep learning on text wouldn't be possible.

Phase 1: Training Embeddings

Classical Methods

Two architectures that revolutionized NLP:

  • CBOW: Predict word from context
  • Skip-gram: Predict context from word

Trained on billions of words, captures analogies and semantic relationships.

Global Vectors for Word Representation:

  • Combines global matrix factorization
  • With local context windows
  • Explicitly encodes co-occurrence statistics

Extension of Word2Vec with subword information:

  • Handles out-of-vocabulary words
  • Uses character n-grams
  • Better for morphologically rich languages

Modern Contextual Embeddings

Learned end-to-end with the model:

  • Context-dependent representations
  • Same word → different vectors in different contexts
  • Captures polysemy naturally

“Bank” near “river” vs “bank” near “money” get different embeddings!

Adding positional information:

  • Sinusoidal: Fixed mathematical functions
  • Learned: Trainable position embeddings
  • Relative: Encode distances between tokens

For BPE/WordPiece tokenization:

  • Each subword token gets an embedding
  • Rare words built from common pieces
  • Enables infinite vocabulary coverage

Key Properties of Good Embeddings

Semantic Similarity
Cosine Distance
Similar words → High cosine similarity
Linear Relationships
Vector Arithmetic
Analogies via addition/subtraction
Dimensionality
128-1024
Typical embedding dimensions
Coverage
99.9%+
Of text representable
Phase 2: Interactive Exploration

Input Tokens

32128256384512

Embedding Statistics

Vector Dimension
128
Parameters per token
Total Tokens
4
Unique embeddings
Memory Usage
2.00 KB
Float32 representation

PCA Projection (2D)

PCA reduces high-dimensional embeddings to 2D while preserving as much variance as possible. Nearby points have similar semantic meaning.

First 5 Dimensions

Each dimension captures different semantic features. Similar words have similar patterns across dimensions.

Common Embedding Patterns

Semantic Clusters

Related words form tight clusters: animals group together, colors group together, emotions group together. This emerges naturally from training.

Frequency Effects

Common words often have smaller magnitudes and occupy central positions. Rare words can have more extreme values.

Polysemy Challenge

Static embeddings give one vector per word, struggling with multiple meanings. Contextual embeddings solve this.

Next: Attention Mechanism

Now that tokens are embedded as vectors, the attention mechanism determines how they interact. Attention allows the model to focus on relevant parts of the input, creating context-aware representations that capture long-range dependencies and complex relationships.

Continue to Attention