Language Model Head

See how the model generates predictions and output probabilities

Generation Settings

Lower = more focused, Higher = more creative

Output Statistics

Vocabulary Size
50,000
Total possible tokens
Top Prediction
N/A
0.0% probability
Perplexity
100.00
Lower is better

Top 10 Predictions

Logits vs Probabilities

Probability Distribution (First 100 tokens)

The long tail of the distribution shows how probability mass is distributed across the vocabulary

Sampling Methods Comparison

Token Distribution by Method

Softmax Transformation Process

1. Raw Logits: Hidden states projected to vocabulary dimension (50000)
2. Temperature Scaling: Logits divided by temperature (1)
3. Softmax: exp(logits) / sum(exp(logits)) → probabilities sum to 1.0
4. Sampling: Apply Top-K (50) or Top-P (0.90) filtering
5. Selection: Sample or pick max probability token

Understanding the LM Head

The Language Model head transforms the final hidden states into vocabulary-sized logits, then applies softmax to get probabilities. Different sampling strategies (greedy, temperature, top-k, top-p) control the randomness and diversity of generated text. Temperature scales the logits before softmax, while top-k and top-p limit the sampling pool.