Temperature Zero Was Never the Problem

A few weeks ago I started building a talk around a claim I was fairly smug about: if you want a language model to give you the same answer twice, set the temperature to zero. I’d written before about why deterministic inference would be a big deal for banks — reproducibility is the thing that lets you validate, monitor and audit a model, and it’s the thing LLMs conspicuously lack. My plan was to stand up, say “temperature zero, mostly sorted,” and move on to the governance.

Then I actually ran the experiment. Sixteen thousand inference calls later, I have a different talk — and the short version is that temperature zero was never the interesting part of the problem.

The part everyone gets right

Set temperature to zero, fix your seed, send one request at a time, and a modern inference stack will happily give you the same answer over and over. I ran Qwen2.5-7B on an H100 through vLLM, one prompt at a time, dozens of prompts, sixteen runs each. Identical outputs, every time. There’s a flag in vLLM called VLLM_BATCH_INVARIANT that’s supposed to be the determinism switch, and at this point it made no difference at all — on or off, I got perfect consistency.

Which should have been my first clue. If the magic determinism flag changes nothing, the thing it fixes isn’t the thing you’re testing.

Single-request determinism is basically free. Nobody runs an LLM that way in production.

The part everyone skips

In production you have one model server behind one endpoint, and a stream of requests arriving from all over the bank at once. The server doesn’t process them one at a time — it batches them together and runs them through the GPU as a group, because that’s the only way the economics work. The batch is whatever happened to arrive in the same few milliseconds. It’s different every time.

Here’s the uncomfortable bit. The maths the GPU does to your prompt depends on the shape of the batch it’s travelling in.

Floating-point addition isn’t associative — (a + b) + c doesn’t always equal a + (b + c) once you’re down in the rounding errors. When the GPU sums up numbers across a batch, the order and strategy of those sums changes depending on how big the batch is and what else is in it. So the logits for your prompt come out very slightly different depending on who you were queued up with. Usually that’s invisible. But every so often the model is genuinely torn between two next tokens, the tie tips the other way, and from that point the whole answer diverges.

So I built the test that actually matters: take one prompt, hold it constant, and run it inside batches of different sizes — 1, then 8, then 32 — and check whether the answer changes. Not “run it twice.” Vary the crowd it’s standing in.

It changes. A lot.

A number that should worry a regulator

The cleanest example I got is also the one I’d least want to explain to an examiner.

I asked Qwen2.5-7B — temperature zero, the VLLM_BATCH_INVARIANT flag on, supposedly deterministic — how Bank of America’s total assets changed between 2022 and 2024. With a light load on the endpoint (batches of one to eight) it told me 2022 assets were “approximately $4.97 trillion.” With a busier endpoint (batches of sixteen to thirty-two), the same question, same config, same everything: “approximately $511.2 billion.”

That’s not a rounding wobble. That’s the same model giving an answer that’s off by a factor of ten depending on how many other people happened to be using the service at that instant.

(For the record, both numbers are wrong. Bank of America’s 10-K puts total assets at roughly $3.05 trillion at the end of 2022 and $3.26 trillion at the end of 2024. The model bracketed reality without ever hitting it. But the hallucination is a separate problem — and at least a wrong-but-stable answer is one you can catch in testing. A wrong answer that’s also unstable depending on server load is the one that gets through every test you’d normally run and then surprises you in production.)

The flag is real. It’s also a trap.

So you turn the flag on. VLLM_BATCH_INVARIANT=1 swaps in kernels that do the reductions the same way regardless of batch shape, which is exactly what you want. Problem solved?

Sometimes. I ran the same twelve-prompt test across two open models, with the flag off and then on, and the results don’t tell a tidy story:

Gemma2-9B: 42% of prompts batch-invariant with the flag off, 100% with it on. The flag works perfectly.
Qwen2.5-7B: 8% with the flag off, 17% with it on. The flag is almost useless — ten of twelve prompts still drift with “determinism” enabled.

Same flag, same engine, same test, opposite outcomes. The reason is mundane and important: batch-invariant coverage in vLLM is implemented operation by operation, and it isn’t finished (the tracking issue is still open). Some of the operations Qwen leans on simply don’t have batch-invariant kernels yet. Gemma’s happen to be covered.

Three things follow from that, and they’re the whole point of this post.

First, the flag is off by default. Out of the box, on a current, fully-patched inference stack, you are non-deterministic under load and nothing warns you.

Second, whether it helps depends on your model. You cannot read the docs, set the flag, and assume.

Third — and this is the one that actually bites — it’s invisible to the obvious test. If you check for determinism the way any sensible engineer would, by sending the same prompt a few times and confirming you get the same answer, you’ll get the same answer. Every time. Because single-request determinism was never the problem. You’ll sign it off, and you’ll be wrong.

It gets worse at the frontier

The obvious response is “fine, use the newest, best model and the newest engine, and let the platform people sort it out.” So I tried the current flagships.

Qwen3.5-27B uses a newer attention mechanism (a gated linear-attention variant). With the flag off, it runs fine — about a third of prompts batch-invariant, same broad picture as before. Turn the flag on and the engine refuses to start at all:

RuntimeError: VLLM batch_invariant mode is not supported for GDN_ATTN.

You don’t get a worse determinism guarantee. You get no determinism path whatsoever — the moment you ask for it, the server won’t boot.

Gemma4-12B was simpler still: it doesn’t run on the current stable vLLM at all. The architecture is new enough that there’s no native implementation yet, the generic fallback crashes on a shape mismatch, and that’s that.

So the newest, most capable open models — the ones a bank would actually want for a hard extraction or classification task — are the ones where determinism support is missing or broken. This makes sense when you say it out loud: batch-invariant kernels are unglamorous infrastructure work that lands after a model is popular enough to justify it. Determinism support lags the model release cycle by months.

Which means your reproducibility guarantee isn’t really a property of your model at all. It’s a property of how long your inference engine has had to catch up to your model. That is a genuinely strange thing to have to put in a model risk assessment, and yet here we are.

What this means if you have to govern the thing

The FINOS AI Governance Framework does the right thing and names non-deterministic behaviour as an explicit operational risk (AIR-OP-6). What it can’t do — what no framework can do — is tell you whether the box in front of you is actually deterministic. That’s an empirical question about your model on your engine on your hardware, and the answer changes when any of those three change.

So the control that matters here is not “enable VLLM_BATCH_INVARIANT.” A control phrased as a config flag is a control that passes audit while the system quietly misbehaves. The control is an acceptance test that varies the batch — holds a representative set of prompts fixed and runs each one through a range of batch sizes, and fails the build if the answers move. That’s ordinary System Acceptance Testing (AIGF calls it MI-5), with one twist that turns out to be the whole game: you have to vary the batch, not just re-run the prompt.

And because the answer is model- and version-specific, it’s not a one-off. It belongs next to your version pinning (AIR-PREV-10): every time you bump the model, the inference engine, or the GPU, you re-run it, because any of those can silently giveth or taketh away your determinism.

None of this is hypothetical regulatory throat-clearing. Yes, SR 11-7 wrote itself out of the GenAI conversation in April — but reproducibility didn’t stop being demanded just because one US guidance letter stepped back. Reg B, the FCRA adverse-action regime, FFIEC third-party expectations, NYDFS 500, the EU AI Act, the PRA’s SS1/23 and FINRA’s recordkeeping rules all still, in their own dialects, require that you can say what your system did and show that it would do it again. An LLM that returns $4.97 trillion or $511 billion depending on server load cannot make that promise — and the fix is real, but it’s off by default, it doesn’t work on every model, and it doesn’t exist yet on the best ones.

Set temperature to zero by all means. Just don’t mistake it for an answer. The answer is a test you have to run yourself — and keep running.