Language Model Head

See how the model generates predictions and output probabilities

Generation Settings

Temperature1.0

Lower = more focused, Higher = more creative

Top-K50

Top-P (Nucleus)0.90

Output Statistics

Vocabulary Size

50,000

Total possible tokens

Top Prediction

N/A

0.0% probability

Perplexity

100.00

Lower is better

Top 10 Predictions

Logits vs Probabilities

Probability Distribution (First 100 tokens)

The long tail of the distribution shows how probability mass is distributed across the vocabulary

Sampling Methods Comparison

Token Distribution by Method

Softmax Transformation Process

1. Raw Logits: Hidden states projected to vocabulary dimension (50000)

2. Temperature Scaling: Logits divided by temperature (1)

3. Softmax: exp(logits) / sum(exp(logits)) → probabilities sum to 1.0

4. Sampling: Apply Top-K (50) or Top-P (0.90) filtering

5. Selection: Sample or pick max probability token

Understanding the LM Head

The Language Model head transforms the final hidden states into vocabulary-sized logits, then applies softmax to get probabilities. Different sampling strategies (greedy, temperature, top-k, top-p) control the randomness and diversity of generated text. Temperature scales the logits before softmax, while top-k and top-p limit the sampling pool.