202602241347-llm-generation-parameters
🎯 Core Idea
LLM temperature is part of the decoding configuration that controls how the model turns next-token probabilities into an actual sequence. It does not change what the model knows. It changes how deterministic the model is when multiple plausible next tokens exist.
At a high level, decoding has two layers:
- Model distribution: the model produces a probability distribution over the next token.
- Sampling strategy: you choose how to pick the next token from that distribution.
Temperature rescales logits before sampling. Lower temperature makes the distribution sharper (more deterministic, more likely to choose the top token). Higher temperature flattens the distribution (more variety, more risk of drifting).
Common generation parameters (and what they do)
-
temperature
- Primary knob for randomness.
- Practical guidance: pick one main randomness knob and tune it first.
-
top_p (nucleus sampling)
- Restricts sampling to the smallest set of tokens whose cumulative probability is at least top_p.
- Often used as an alternative to temperature. Many APIs recommend not tuning both temperature and top_p at the same time.
-
top_k
- Restricts sampling to the top K tokens by probability.
- Useful for trimming long-tail tokens, but can be brittle if K is too small.
-
max_tokens / max_output_tokens / max_new_tokens
- Hard stop for output length.
- This is a safety and cost control. Too low can truncate before the model finishes.
-
stop sequences
- Another hard stop, driven by text patterns.
- Critical for structured outputs and tool-like interactions.
-
repetition penalty / presence penalty / frequency penalty
- Controls whether the model repeats itself.
- Useful when the model loops, overuses phrases, or keeps returning to the same items.
-
decoding mode flags (API-dependent)
- do_sample: sampling on/off.
- num_beams: beam search width.
- These can change the output more than temperature does.
How to tune for better performance
- Start with the task type
-
Factual extraction, classification, deterministic transformation
- temperature: low (often 0 to 0.2)
- prefer tighter stops and shorter max tokens
-
General writing and brainstorming
- temperature: medium (often 0.5 to 0.9)
- consider top_p tuning if you want more diversity without extreme randomness
-
Code generation
- temperature: low to medium (often 0.1 to 0.4)
- strong stop sequences and max tokens help avoid run-on outputs
- Avoid coupling knobs early
- If you change temperature and top_p together, it becomes harder to understand what caused the change.
- A common practice is to fix top_p at 1 and tune temperature, or fix temperature and tune top_p.
- Measure output quality with a small evaluation set
- Make a small list of representative prompts and desired outputs.
- Compare configurations on correctness, consistency, and failure modes.
- Treat prompt quality as a first-class parameter
- A cleaner prompt and clearer constraints often beats parameter tuning.
- Parameter tuning is usually best for controlling variance after the prompt is already stable.
🌲 Branching Questions
➡ How does temperature mathematically affect sampling, and what does more deterministic actually mean?
Most modern APIs implement sampling by taking the model’s logits (unnormalized scores) for the next token, optionally transforming them, then applying a softmax to get probabilities.
Temperature is a scalar applied before softmax. A common form is:
- scaled_logits = logits / temperature
Then:
- probs = softmax(scaled_logits)
If temperature is less than 1, dividing by a small number increases the magnitude of differences between logits. That makes the softmax distribution sharper: probability mass concentrates on the top tokens.
If temperature approaches 0, the distribution approaches an argmax decision rule. In practice many implementations treat temperature 0 as greedy decoding.
More deterministic means:
- given the same prompt and system state, the output varies less across runs
- the model is more likely to pick the most probable token at each step
This does not guarantee factual correctness. It reduces variance. If the most probable continuation is wrong, low temperature will repeat that wrong continuation consistently.
➡ When should I use top_p instead of temperature, and why do many APIs recommend not tuning both at once?
top_p is nucleus sampling. Instead of allowing every token to be sampled, you restrict the candidate set to the smallest set of tokens whose cumulative probability is at least p. Then you sample only within that set.
When to use top_p:
- you want to avoid sampling from the very low-probability tail (which tends to produce weird or off-topic tokens)
- you want a diversity control that adapts per step, because the nucleus size changes depending on how confident the model is
Why not tune both at once:
- temperature changes the shape of the probability distribution
- top_p changes which part of the distribution is allowed
Changing both makes it harder to reason about the effective sampling behavior. You can end up in confusing states such as:
- high temperature (flatter distribution) plus low top_p (tiny nucleus) which can create brittle behavior
- low temperature plus high top_p which may behave very similar to greedy decoding
A practical rule is to pick one main randomness knob:
- either tune temperature and keep top_p at 1
- or tune top_p and keep temperature at 1
➡ What is top_k good for, and what are common failure modes when it’s too small?
top_k restricts sampling to the K highest probability tokens at each step. It is a simpler hard cutoff than top_p.
What it’s good for:
- trimming the long tail of tokens, which reduces weird low-probability sampling
- making sampling behavior more predictable than unconstrained temperature sampling
Common failure modes when K is too small:
- repetitive or generic outputs, because the model is forced to reuse a small token set
- loss of rare but correct tokens, especially for domains with specialized vocabulary
- brittle phrasing, where the model cannot choose a token that would have led to a better sentence later
If you use top_k, keep it high enough that the model still has room to express the right term, and use it mainly as a tail-trimming tool.
➡ How do repetition penalties (presence or frequency or repetition_penalty) differ, and when should I use each?
Different providers expose different names, but the intent is to prevent degenerate loops and reduce repetitive phrasing.
-
presence penalty
- penalizes tokens that have appeared at all
- encourages introducing new tokens and therefore new topics
- useful when the model keeps returning to the same idea or refuses to move on
-
frequency penalty
- penalizes tokens proportionally to how often they have appeared
- reduces repeated words and repeated phrases
- useful when outputs contain obvious repeated wording
-
repetition_penalty (common in open-source generation configs)
- typically rescales logits for tokens that have already occurred
- values above 1 penalize repetition, values below 1 can encourage repetition
- useful when you see looping, repeated lines, or repeated sentence starters
Practical usage:
- if the model repeats phrases, start with a mild frequency-like penalty
- if the model gets stuck on the same topic or structure, presence-like penalties can help
- if the model loops badly, add a hard stop sequence and then use repetition_penalty as a second line of defense
➡ How should I tune parameters differently for extraction, summarization, writing, and code generation?
Extraction and classification:
- goal: consistency and minimal hallucination
- suggested:
- temperature low
- top_p high or fixed
- short max tokens
- strict stop sequences
Summarization:
- goal: faithful compression
- suggested:
- temperature low to medium
- stop sequences to prevent drifting into new content
- max tokens sized to the expected summary length
Writing and brainstorming:
- goal: variety and useful options
- suggested:
- temperature medium
- optionally constrain with top_p to avoid very low-probability weirdness
- allow longer outputs
Code generation:
- goal: correctness and adherence to constraints
- suggested:
- temperature low to medium
- strong stop sequences, especially if you want only code
- constrain max tokens to avoid long irrelevant commentary
A good default approach is to start conservative and only increase randomness when you need more diversity.
➡ What is the relationship between decoding mode (sampling vs beam search) and temperature or top_p?
There are two broad families:
-
Sampling-based decoding
- chooses the next token probabilistically
- temperature, top_p, and top_k directly matter
-
Search-based decoding (beam search)
- keeps multiple partial hypotheses and expands them to find high-likelihood sequences
- can reduce randomness even at higher temperature settings, depending on implementation
In practice:
- for chat assistants, sampling is the common default
- beam search can produce more stable and high-likelihood outputs, but it can also produce bland or repetitive responses
If you are optimizing for consistency, you can often get most of the benefit by using greedy decoding or low-temperature sampling without introducing beam search.
➡ How do max_tokens and stop sequences affect quality and cost, and how do I choose them safely?
-
max_tokens controls the output budget.
- too low: truncation, unfinished answers, or cut-off code
- too high: wasted cost, more chance to drift or ramble
-
stop sequences are deterministic guards.
- they tell the decoder to stop when a substring appears
- they are essential for structured formats and tool-like usage
Safe selection strategy:
- estimate an upper bound for the expected output length
- set max_tokens slightly above that bound
- add stop sequences that match your output format boundaries
If you see the model rambling, stop sequences are usually a better fix than lowering temperature.
Why this is often true:
- Temperature mostly changes how much randomness you allow when choosing the next token. It reduces variance, but it does not define where the answer should end. A low-temperature model can still produce a long continuation if the prompt allows it.
- Stop sequences define an explicit boundary. They tell the decoder to stop when a specific string appears, which cuts off run-on sections even when the model would otherwise keep going.
Practical examples of stop sequences:
- If you want a single JSON object, use a stop sequence like
\n\nafter the JSON block, or a sentinel like\nEND\nthat you instruct the model to emit at the end. - If you want exactly N bullet points, you can stop when the model starts the next header or delimiter (for example
\n###). - For tool-like outputs, stop at a marker like `\n````, or at the beginning of a References section if you want to prevent extra commentary.
A good pattern is: define the output contract in the prompt, then enforce it with stop sequences and max_tokens. Use temperature only to control style and variance after the output shape is constrained.
➡ What is a practical tuning workflow (small eval set, metrics, iteration) that doesn’t waste time?
A practical workflow:
- Define the output contract
- what counts as correct
- what formatting is required
- what failure looks like
- Create a small evaluation set
- 20 to 50 representative prompts
- include edge cases and adversarial cases
- Start from a baseline
- keep temperature, top_p, and top_k at defaults
- focus on prompt clarity first
- Tune one knob at a time
- change temperature or top_p, not both
- observe changes in variance, verbosity, and error types
- Track metrics that matter
- pass rate on required constraints
- hallucination rate (for factual tasks)
- repetition rate
- average token usage
- Lock the config and monitor drift
- models change, so re-run the eval set after model upgrades
The goal is to treat decoding params like config, not like creative experimentation.
📚 References
-
https://huggingface.co/docs/transformers/en/main_classes/text_generation
-
https://help.openai.com/en/articles/5072263-how-do-i-use-stop-sequences