Temperature, Top-p, Top-k: best practices, frameworks, and tuning for LLMs

Table of Contents

Why your LLM is unpredictable (and how to fix it)

Ever wondered why your chatbot sometimes gives creative gems, and other times, pure nonsense? The secret lies in three levers: temperature, top-p, and top-k. Mastering these means controlling creativity, reducing hallucinations, and making your LLM outputs reliable—whether for chatbots, assistants, or content generation.

This guide gives you the latest best practices, with hands-on advice and frameworks. For a deep dive (prompting, decoding, advanced scripts), see The Mechanics of LLMs:

Paperback on Amazon: https://www.amazon.com/Mechanics-LLMs-Architecture-Practice-Engineers/dp/B0GFTCY2K9
Kindle on Amazon: https://www.amazon.com/Mechanics-LLMs-Architecture-Practice-Engineers-ebook/dp/B0GFNYLTGS

The science behind the knobs: probability, not magic

LLMs don’t “choose words”—they output a probability distribution over the vocabulary at each step. The decoding strategy (how you sample from this distribution) is what makes your model creative, factual, or repetitive. Small changes early on can lead to wildly different results.

Decoding strategies: the three families

Greedy decoding: always pick the most likely token. Stable, but can get stuck in loops or be boring.
Beam search: keep the top K continuations, good for translation but too rigid for chat.
Sampling (with top-p/top-k/temperature): the gold standard for assistants and creative tasks. Adds controlled randomness.

Quick example: why sampling matters

Suppose the model predicts:

token	probability
hello	0.45
hi	0.35
hey	0.20

With greedy, you always get “hello”. With sampling, “hi” or “hey” can appear—this is where diversity comes from.

The three core levers: temperature, top-p, top-k

1) Temperature

Controls how “sharp” or “flat” the probability distribution is.

Low T (0.1–0.3): deterministic, stable, less creative
High T (0.8–1.2): more creative, but risk of nonsense

Formula: $$ ext{softmax}_T(i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)} $$ Raise T for more variety, lower T for more reliability.

2) Top-k

Keep only the k most likely tokens, then sample from them. Prevents rare/weird outputs, but too small a k makes text repetitive.

3) Top-p (nucleus sampling)

Keep the smallest set of tokens whose cumulative probability reaches p (e.g., 0.9). More adaptive than top-k, and the default in most frameworks (OpenAI, HuggingFace, etc.).

Why even temperature=0 isn’t always 100% reproducible

Even with greedy decoding, tiny floating-point differences (on GPU/TPU) can cause rare output changes. Don’t panic—this is normal in production.

Practical presets

Factual Q&A, compliance: T=0.1–0.3, top-p=0.8–0.9
General assistants: T=0.4–0.7, top-p=0.9–0.95
Creative writing, ideation: T=0.8–1.1, top-p=0.95–0.98

Tip: change one parameter at a time and measure quality, diversity, and error rate.

Step-by-step: tuning for your use case

Define your goal: factual (stable) or creative (diverse)?
Start safe: T=0.2, top-p=0.9, top-k=off/large
Tune in order: temperature → top-p → top-k
Test on real prompts: track errors, repetitions, style drift

Why this works (and when it doesn’t)

You’re balancing exploration (diversity) and exploitation (stability). More exploration = more ideas, but also more risk of drift or hallucination.

Common mistakes to avoid

Setting temperature or top-p too high: leads to incoherence
Using beam search for creative tasks: results are generic
Trying to fix hallucinations only with sampling: for grounded answers, use RAG or better prompting

Frameworks and hands-on tools

HuggingFace Transformers: exposes all sampling parameters
Guidance, LMQL: advanced control over generation
LangChain, LlamaIndex: orchestration, prompt templates, output parsers
Book script: 03_temperature_softmax.py (https://github.com/alouani-org/mecanics-of-llms)

FAQ

Top-p or Top-k? Start with top-p: it adapts better across prompts.

Why does my model repeat itself? Usually: temperature too low or decoding too deterministic. Raise temperature a bit.

Why do I still get hallucinations with low temperature? Sampling can’t fix everything—use clear prompts, output structure, and RAG for grounded answers.

For the full deep dive (prompting, decoding, advanced scripts, speculative decoding), see The Mechanics of LLMs:

Paperback on Amazon: https://www.amazon.com/Mechanics-LLMs-Architecture-Practice-Engineers/dp/B0GFTCY2K9
Kindle on Amazon: https://www.amazon.com/Mechanics-LLMs-Architecture-Practice-Engineers-ebook/dp/B0GFNYLTGS

Temperature, Top-p, Top-k: the hands-on guide to mastering LLM outputs (SEO, practical, book-opening)