202602241347-llm-generation-parameters

🎯 Core Idea

LLM temperature is part of the decoding configuration that controls how the model turns next-token probabilities into an actual sequence. It does not change what the model knows. It changes how deterministic the model is when multiple plausible next tokens exist.

At a high level, decoding has two layers:

Temperature rescales logits before sampling. Lower temperature makes the distribution sharper (more deterministic, more likely to choose the top token). Higher temperature flattens the distribution (more variety, more risk of drifting).

Common generation parameters (and what they do)

How to tune for better performance

  1. Start with the task type
  1. Avoid coupling knobs early
  1. Measure output quality with a small evaluation set
  1. Treat prompt quality as a first-class parameter

🌲 Branching Questions

➡ How does temperature mathematically affect sampling, and what does more deterministic actually mean?

Most modern APIs implement sampling by taking the model’s logits (unnormalized scores) for the next token, optionally transforming them, then applying a softmax to get probabilities.

Temperature is a scalar applied before softmax. A common form is:

Then:

If temperature is less than 1, dividing by a small number increases the magnitude of differences between logits. That makes the softmax distribution sharper: probability mass concentrates on the top tokens.

If temperature approaches 0, the distribution approaches an argmax decision rule. In practice many implementations treat temperature 0 as greedy decoding.

More deterministic means:

This does not guarantee factual correctness. It reduces variance. If the most probable continuation is wrong, low temperature will repeat that wrong continuation consistently.

➡ When should I use top_p instead of temperature, and why do many APIs recommend not tuning both at once?

top_p is nucleus sampling. Instead of allowing every token to be sampled, you restrict the candidate set to the smallest set of tokens whose cumulative probability is at least p. Then you sample only within that set.

When to use top_p:

Why not tune both at once:

Changing both makes it harder to reason about the effective sampling behavior. You can end up in confusing states such as:

A practical rule is to pick one main randomness knob:

➡ What is top_k good for, and what are common failure modes when it’s too small?

top_k restricts sampling to the K highest probability tokens at each step. It is a simpler hard cutoff than top_p.

What it’s good for:

Common failure modes when K is too small:

If you use top_k, keep it high enough that the model still has room to express the right term, and use it mainly as a tail-trimming tool.

➡ How do repetition penalties (presence or frequency or repetition_penalty) differ, and when should I use each?

Different providers expose different names, but the intent is to prevent degenerate loops and reduce repetitive phrasing.

Practical usage:

➡ How should I tune parameters differently for extraction, summarization, writing, and code generation?

Extraction and classification:

Summarization:

Writing and brainstorming:

Code generation:

A good default approach is to start conservative and only increase randomness when you need more diversity.

➡ What is the relationship between decoding mode (sampling vs beam search) and temperature or top_p?

There are two broad families:

In practice:

If you are optimizing for consistency, you can often get most of the benefit by using greedy decoding or low-temperature sampling without introducing beam search.

➡ How do max_tokens and stop sequences affect quality and cost, and how do I choose them safely?

Safe selection strategy:

If you see the model rambling, stop sequences are usually a better fix than lowering temperature.

Why this is often true:

Practical examples of stop sequences:

A good pattern is: define the output contract in the prompt, then enforce it with stop sequences and max_tokens. Use temperature only to control style and variance after the output shape is constrained.

➡ What is a practical tuning workflow (small eval set, metrics, iteration) that doesn’t waste time?

A practical workflow:

  1. Define the output contract
  1. Create a small evaluation set
  1. Start from a baseline
  1. Tune one knob at a time
  1. Track metrics that matter
  1. Lock the config and monitor drift

The goal is to treat decoding params like config, not like creative experimentation.

📚 References