202602241347-llm-generation-parameters

🎯 Core Idea

LLM temperature is part of the decoding configuration that controls how the model turns next-token probabilities into an actual sequence. It does not change what the model knows. It changes how deterministic the model is when multiple plausible next tokens exist.

At a high level, decoding has two layers:

Model distribution: the model produces a probability distribution over the next token.
Sampling strategy: you choose how to pick the next token from that distribution.

Temperature rescales logits before sampling. Lower temperature makes the distribution sharper (more deterministic, more likely to choose the top token). Higher temperature flattens the distribution (more variety, more risk of drifting).

Common generation parameters (and what they do)

temperature
- Primary knob for randomness.
- Practical guidance: pick one main randomness knob and tune it first.
top_p (nucleus sampling)
- Restricts sampling to the smallest set of tokens whose cumulative probability is at least top_p.
- Often used as an alternative to temperature. Many APIs recommend not tuning both temperature and top_p at the same time.
top_k
- Restricts sampling to the top K tokens by probability.
- Useful for trimming long-tail tokens, but can be brittle if K is too small.
max_tokens / max_output_tokens / max_new_tokens
- Hard stop for output length.
- This is a safety and cost control. Too low can truncate before the model finishes.
stop sequences
- Another hard stop, driven by text patterns.
- Critical for structured outputs and tool-like interactions.
repetition penalty / presence penalty / frequency penalty
- Controls whether the model repeats itself.
- Useful when the model loops, overuses phrases, or keeps returning to the same items.
decoding mode flags (API-dependent)
- do_sample: sampling on/off.
- num_beams: beam search width.
- These can change the output more than temperature does.

How to tune for better performance

Start with the task type

Factual extraction, classification, deterministic transformation
- temperature: low (often 0 to 0.2)
- prefer tighter stops and shorter max tokens
General writing and brainstorming
- temperature: medium (often 0.5 to 0.9)
- consider top_p tuning if you want more diversity without extreme randomness
Code generation
- temperature: low to medium (often 0.1 to 0.4)
- strong stop sequences and max tokens help avoid run-on outputs

Avoid coupling knobs early

If you change temperature and top_p together, it becomes harder to understand what caused the change.
A common practice is to fix top_p at 1 and tune temperature, or fix temperature and tune top_p.

Measure output quality with a small evaluation set

Make a small list of representative prompts and desired outputs.
Compare configurations on correctness, consistency, and failure modes.

Treat prompt quality as a first-class parameter

A cleaner prompt and clearer constraints often beats parameter tuning.
Parameter tuning is usually best for controlling variance after the prompt is already stable.

🌲 Branching Questions

➡ How does temperature mathematically affect sampling, and what does more deterministic actually mean?

Most modern APIs implement sampling by taking the model’s logits (unnormalized scores) for the next token, optionally transforming them, then applying a softmax to get probabilities.

Temperature is a scalar applied before softmax. A common form is:

scaled_logits = logits / temperature

Then:

probs = softmax(scaled_logits)

If temperature is less than 1, dividing by a small number increases the magnitude of differences between logits. That makes the softmax distribution sharper: probability mass concentrates on the top tokens.

If temperature approaches 0, the distribution approaches an argmax decision rule. In practice many implementations treat temperature 0 as greedy decoding.

More deterministic means:

given the same prompt and system state, the output varies less across runs
the model is more likely to pick the most probable token at each step

This does not guarantee factual correctness. It reduces variance. If the most probable continuation is wrong, low temperature will repeat that wrong continuation consistently.

top_p is nucleus sampling. Instead of allowing every token to be sampled, you restrict the candidate set to the smallest set of tokens whose cumulative probability is at least p. Then you sample only within that set.

When to use top_p:

you want to avoid sampling from the very low-probability tail (which tends to produce weird or off-topic tokens)
you want a diversity control that adapts per step, because the nucleus size changes depending on how confident the model is

Why not tune both at once:

temperature changes the shape of the probability distribution
top_p changes which part of the distribution is allowed

Changing both makes it harder to reason about the effective sampling behavior. You can end up in confusing states such as:

high temperature (flatter distribution) plus low top_p (tiny nucleus) which can create brittle behavior
low temperature plus high top_p which may behave very similar to greedy decoding

A practical rule is to pick one main randomness knob:

either tune temperature and keep top_p at 1
or tune top_p and keep temperature at 1

➡ What is top_k good for, and what are common failure modes when it’s too small?

top_k restricts sampling to the K highest probability tokens at each step. It is a simpler hard cutoff than top_p.

What it’s good for:

trimming the long tail of tokens, which reduces weird low-probability sampling
making sampling behavior more predictable than unconstrained temperature sampling

Common failure modes when K is too small:

repetitive or generic outputs, because the model is forced to reuse a small token set
loss of rare but correct tokens, especially for domains with specialized vocabulary
brittle phrasing, where the model cannot choose a token that would have led to a better sentence later

If you use top_k, keep it high enough that the model still has room to express the right term, and use it mainly as a tail-trimming tool.

➡ How do repetition penalties (presence or frequency or repetition_penalty) differ, and when should I use each?

Different providers expose different names, but the intent is to prevent degenerate loops and reduce repetitive phrasing.

presence penalty
- penalizes tokens that have appeared at all
- encourages introducing new tokens and therefore new topics
- useful when the model keeps returning to the same idea or refuses to move on
frequency penalty
- penalizes tokens proportionally to how often they have appeared
- reduces repeated words and repeated phrases
- useful when outputs contain obvious repeated wording
repetition_penalty (common in open-source generation configs)
- typically rescales logits for tokens that have already occurred
- values above 1 penalize repetition, values below 1 can encourage repetition
- useful when you see looping, repeated lines, or repeated sentence starters

Practical usage:

if the model repeats phrases, start with a mild frequency-like penalty
if the model gets stuck on the same topic or structure, presence-like penalties can help
if the model loops badly, add a hard stop sequence and then use repetition_penalty as a second line of defense

➡ How should I tune parameters differently for extraction, summarization, writing, and code generation?

Extraction and classification:

goal: consistency and minimal hallucination
suggested:
- temperature low
- top_p high or fixed
- short max tokens
- strict stop sequences

Summarization:

goal: faithful compression
suggested:
- temperature low to medium
- stop sequences to prevent drifting into new content
- max tokens sized to the expected summary length

Writing and brainstorming:

goal: variety and useful options
suggested:
- temperature medium
- optionally constrain with top_p to avoid very low-probability weirdness
- allow longer outputs

Code generation:

goal: correctness and adherence to constraints
suggested:
- temperature low to medium
- strong stop sequences, especially if you want only code
- constrain max tokens to avoid long irrelevant commentary

A good default approach is to start conservative and only increase randomness when you need more diversity.

➡ What is the relationship between decoding mode (sampling vs beam search) and temperature or top_p?

There are two broad families:

Sampling-based decoding
- chooses the next token probabilistically
- temperature, top_p, and top_k directly matter
Search-based decoding (beam search)
- keeps multiple partial hypotheses and expands them to find high-likelihood sequences
- can reduce randomness even at higher temperature settings, depending on implementation

In practice:

for chat assistants, sampling is the common default
beam search can produce more stable and high-likelihood outputs, but it can also produce bland or repetitive responses

If you are optimizing for consistency, you can often get most of the benefit by using greedy decoding or low-temperature sampling without introducing beam search.

➡ How do max_tokens and stop sequences affect quality and cost, and how do I choose them safely?

max_tokens controls the output budget.
- too low: truncation, unfinished answers, or cut-off code
- too high: wasted cost, more chance to drift or ramble
stop sequences are deterministic guards.
- they tell the decoder to stop when a substring appears
- they are essential for structured formats and tool-like usage

Safe selection strategy:

estimate an upper bound for the expected output length
set max_tokens slightly above that bound
add stop sequences that match your output format boundaries

If you see the model rambling, stop sequences are usually a better fix than lowering temperature.

Why this is often true:

Temperature mostly changes how much randomness you allow when choosing the next token. It reduces variance, but it does not define where the answer should end. A low-temperature model can still produce a long continuation if the prompt allows it.
Stop sequences define an explicit boundary. They tell the decoder to stop when a specific string appears, which cuts off run-on sections even when the model would otherwise keep going.

Practical examples of stop sequences:

If you want a single JSON object, use a stop sequence like \n\n after the JSON block, or a sentinel like \nEND\n that you instruct the model to emit at the end.
If you want exactly N bullet points, you can stop when the model starts the next header or delimiter (for example \n### ).
For tool-like outputs, stop at a marker like `\n````, or at the beginning of a References section if you want to prevent extra commentary.

A good pattern is: define the output contract in the prompt, then enforce it with stop sequences and max_tokens. Use temperature only to control style and variance after the output shape is constrained.

➡ What is a practical tuning workflow (small eval set, metrics, iteration) that doesn’t waste time?

A practical workflow:

Define the output contract

what counts as correct
what formatting is required
what failure looks like

Create a small evaluation set

20 to 50 representative prompts
include edge cases and adversarial cases

Start from a baseline

keep temperature, top_p, and top_k at defaults
focus on prompt clarity first

Tune one knob at a time

change temperature or top_p, not both
observe changes in variance, verbosity, and error types

Track metrics that matter

pass rate on required constraints
hallucination rate (for factual tasks)
repetition rate
average token usage

Lock the config and monitor drift

models change, so re-run the eval set after model upgrades

The goal is to treat decoding params like config, not like creative experimentation.

202602241347-llm-generation-parameters

🎯 Core Idea

🌲 Branching Questions

➡ How does temperature mathematically affect sampling, and what does more deterministic actually mean?

➡ What is top_k good for, and what are common failure modes when it’s too small?

➡ How do repetition penalties (presence or frequency or repetition_penalty) differ, and when should I use each?

➡ How should I tune parameters differently for extraction, summarization, writing, and code generation?

➡ What is the relationship between decoding mode (sampling vs beam search) and temperature or top_p?

➡ How do max_tokens and stop sequences affect quality and cost, and how do I choose them safely?

➡ What is a practical tuning workflow (small eval set, metrics, iteration) that doesn’t waste time?

📚 References

🔗 Links to other cards

🎯 Core Idea

🌲 Branching Questions

➡ How does temperature mathematically affect sampling, and what does more deterministic actually mean?

➡ When should I use top_p instead of temperature, and why do many APIs recommend not tuning both at once?

➡ What is top_k good for, and what are common failure modes when it’s too small?

➡ How do repetition penalties (presence or frequency or repetition_penalty) differ, and when should I use each?

➡ How should I tune parameters differently for extraction, summarization, writing, and code generation?

➡ What is the relationship between decoding mode (sampling vs beam search) and temperature or top_p?

➡ How do max_tokens and stop sequences affect quality and cost, and how do I choose them safely?

➡ What is a practical tuning workflow (small eval set, metrics, iteration) that doesn’t waste time?

📚 References

🔗 Links to other cards