Language Model Head
See how the model generates predictions and output probabilities
Generation Settings
Lower = more focused, Higher = more creative
Output Statistics
Vocabulary Size
50,000
Total possible tokens
Top Prediction
N/A
0.0% probability
Perplexity
100.00
Lower is better
Top 10 Predictions
Logits vs Probabilities
Probability Distribution (First 100 tokens)
The long tail of the distribution shows how probability mass is distributed across the vocabulary
Sampling Methods Comparison
Token Distribution by Method
Softmax Transformation Process
1. Raw Logits: Hidden states projected to vocabulary dimension (50000)
2. Temperature Scaling: Logits divided by temperature (1)
3. Softmax: exp(logits) / sum(exp(logits)) → probabilities sum to 1.0
4. Sampling: Apply Top-K (50) or Top-P (0.90) filtering
5. Selection: Sample or pick max probability token
Understanding the LM Head
The Language Model head transforms the final hidden states into vocabulary-sized logits, then applies softmax to get probabilities. Different sampling strategies (greedy, temperature, top-k, top-p) control the randomness and diversity of generated text. Temperature scales the logits before softmax, while top-k and top-p limit the sampling pool.