Every token an LLM produces starts as a number for every word in its vocabulary — a logit per token, tens or hundreds of thousands of them. The model's job ends there. Turning that vector into one chosen token is a separate step called sampling, and it is governed by a handful of knobs — temperature, top-k, top-p, min-p, repetition penalty — that get treated in most tutorials as a menu of interchangeable "creativity" dials.
They are not interchangeable, and most of them are not even doing different things. Three of the four headline knobs do the same job. The useful way to understand sampling is to see what that job is, and where the real differences hide.
Temperature reshapes; the rest truncate
There is one genuine division here. Temperature changes the shape of the whole distribution; everything else cuts the tail off it.
Temperature divides the logits by a constant T before the softmax: softmax(logits / T). With T < 1 the gaps between logits stretch, so probability piles onto the likeliest tokens; with T > 1 the gaps compress and probability spreads toward the long tail; as T → 0 the top token's probability approaches 1 and you get greedy decoding (plain argmax). Crucially, temperature never removes a token — even a wildly unlikely word keeps some sliver of probability. It is a volume knob on randomness, not a filter.
Top-k, top-p, and min-p are filters. They all do the identical two-step move: discard the low-probability tail, renormalize what's left, then sample. They differ only in how they choose where the tail begins.
Top-k, top-p, and min-p are the same operation — truncate and renormalize. The only thing they disagree about is where the cutoff goes.
- Top-k (Fan, Lewis & Dauphin, 2018) keeps a fixed count: the
khighest-probability tokens, nothing else. Simple, and its flaw is exactly that fixedness. When the model is confident — one token at 0.9 — akof 50 hauls in 49 tokens it had all but ruled out. When the model is flat — fifty plausible continuations — that samekmay amputate good ones. The cutoff ignores the shape of the thing it's cutting. - Top-p, or nucleus sampling (Holtzman et al., ICLR 2020), fixes that by keeping the smallest set of tokens whose probabilities sum to at least
p. That set — the "nucleus" — is large when probability is spread out and small when one token dominates. Holtzman's paper named the problem it solves: maximization decoding (greedy, beam search) produces bland, repetitive degeneration, and the cure is "truncating the unreliable tail" with a dynamic nucleus. Top-p adapts to the distribution's shape. - Min-p (Nguyen et al., 2024) adapts to the model's confidence instead. Its threshold is
p_base × p_max— a fraction of the top token's probability. If the best token sits at 0.5 andp_base = 0.1, the cutoff is 0.05; if the best token is only 0.2, the cutoff drops to 0.02. The filter tightens when the model is sure and relaxes when it isn't. The pitch is that this stays coherent at high temperatures, where top-p starts admitting garbage.
So the real design axis across all three is one question: how much of the tail do you trust? Top-k answers with a count, top-p with a cumulative mass, min-p with a ratio to the peak.
The order matters, and it's fixed
These knobs aren't applied in parallel — they're a pipeline, and the order is the same in both vLLM and Hugging Face's generate: temperature, then top-k, then top-p, with the truncation happening on the already-temperature-scaled logits, before the final softmax. That ordering is why "high temperature but top-p 0.9" is coherent: temperature flattens the distribution, then top-p clips the newly-fattened tail back off. Stack them in your head in that sequence or the interactions look like magic.
The penalties are a separate family. Repetition penalty (from Keskar et al.'s CTRL, recommended around 1.2) divides the logits of already-seen tokens to discourage loops; OpenAI's frequency and presence penalties do a similar thing additively — frequency scales with how often a token already appeared, presence is a flat one-off nudge once it's appeared at all. These shape what gets penalized, not how much of the distribution survives.
The part the tutorials skip: this is not a quality ladder
It is tempting to read top-k → top-p → min-p as successive upgrades. The field's own literature says otherwise. Min-p arrived with strong benchmark claims and an ICLR 2025 oral, and then a 2025 critical re-analysis — bluntly titled Min-p, Max Exaggeration — argued the headline gains were fragile: sensitive to how many hyperparameters were tuned, with some baseline results dropped and LLM-as-judge scores reported inconsistently. The honest summary is that for typical use, a well-tuned top-p and a well-tuned min-p are hard to tell apart, and the "newer is better" framing oversells a real but modest idea. Treat the cutoff rule as a preference, not a leaderboard.
And here is the turn that matters for anyone building agents rather than chatbots: most of this conversation is about creative generation, and you are almost never doing that. For tool calling, routing, extraction, classification, and code, there is one correct-ish continuation and you want it every time — which means temperature 0, greedy, no truncation knobs at all. (Even then, "temperature 0" makes the sampler deterministic, not the system — batched GPU floating-point reductions and provider load-balancing still leak variance; see reasoning models vs standard LLMs for where that bites.) When your actual problem is "the model must emit valid JSON or a valid function call," the fix is not a sampling temperature — it's constrained decoding, which masks invalid tokens to zero probability before any of these knobs run. The whole sampling debate lives downstream of a question agents usually answer with "give me the single most likely token, and make it parse."



