Why Deterministic Language Models Would Be A Big Deal For Banks

Last week, a couple of researchers from IBM put out a paper which shows that (under limited conditions) they managed to get a couple of small LLMs (7-9bn parameters) to demonstrate deterministic inference across RAG, SQL generation and more general use cases. Today’s LLM inference involves multiple non-deterministic factors—GPU kernels, parallelism, sampling methods—so identical prompts can produce subtly or significantly different outputs. Theirs is not the only recent work in this area, so I thought I’d spend a bit of time explaining why a truly deterministic LLM would be a big deal for the finance industry.

SR 11-7: The Regulatory Context

When it comes to technology risk, banks are naturally conservative (as a customer of several I’m quite grateful for that). As we’ve talked about before, the big-bad when it comes to AI banking regulation in the US is SR 11-7. It’s a Federal Reserve guidance letter that was written in the wake of the 2008 financial crisis when it turned out that half of the banks were calculating capital adequacy in an Excel spreadsheet that got emailed around once a week and everyone was too scared to change.

It applies to “any method, system, or approach that processes input data to produce estimates, scores, classifications, or decisions”. That would include systems that produce numeric outputs, categorical outputs, or structured outputs that are used to make or support decisions, even if they’re not numerical. The word “quantitative” appears early on in the letter, and sometimes people get confused and assume that means “systems that do math”, but it’s not as simple as that (sadly for us).

Now that we know that, we can see pretty clearly which use cases fall inside and outside of the regulation: purely generative use cases like code generation, document summarisation, freeform chat assistants etc are clearly outside the regulation (banks tend to call these “Tier 3” use cases), but there are a bunch of use cases that represent approximately $920bn in annual operational efficiency gains according to Morgan Stanley estimates that bring us in scope. If you’re speaking to a regulator who’s had his coffee that morning, they might still ask you “does the output feed downstream business logic” or “could employees rely on the generated content for decisions”, so you still have to be careful.

Regulated Use Cases

Let’s spend a little more time digging into what kind of use cases would therefore be regulated, then we’ll finally get on to why deterministic LLMs would check one of the boxes we need to address them.

The first is systems that produce binary or categorical decision outputs. Think yes/no credit decisioning, systems that determine if a client is high risk, systems that determine if documentation satisfies KYC regulations etc. The system isn’t numeric, but it is a categorical classification which means it falls in scope.

Another example would be a system where an LLM extracts or computes features used in downstream scoring. For instance, if you use an LLM based system to extract information from documents (like income amounts, or employment verification results, risk-relevant features etc) and those features feed a downstream risk model, the whole system falls into scope. This one is a bit counter intuitive as it would seem the LLM is just doing natural language processing, but you have to remember that the “model boundary”—the complete scope of what regulators consider a single model for validation purposes—is NOT just the language or ML model, it’s the entire system including pre-processing, post-processing, and decision logic.

I could spend another 1000 words writing about additional potential use cases that would be regulated. There’s LLM based underwriting or risk assessments, automated compliance or AML classifications, LLM outputs that trigger workflow decisions, LLMs that produce narrative justifications but not the decision itself etc. etc.

Hopefully by now you’re convinced that there are some genuinely interesting use cases that are squarely regulated—so what’s such a big deal that banks currently won’t go near them?

Why SR 11-7 with LLMs is Uniquely Hard

Before I started to dig into this, I was in the camp of “I’ve got my LLM, I wrap it in guardrails, post-processing, decision logic etc, surely I can make something that’s ok?” Then I spent time with the regulations, and realised that there’s at least ten problems with that hypothesis. I’m going to try and keep this brief, but as far as I can see, the 10 requirements that are still hard are:

Clear Model Definition and Theory Of Operation - SR 11-7 expects a transparent model boundary (ok, fine), a documented theory of operation explaining how the model works (ah…), and identification of assumptions, limitations and intended use. So sadly a magic black box that gives different answers on subsequent runs doesn’t cut it. To be clear, SR 11-7 does not require source-level interpretability, but it does require a conceptual model that is stable, reproducible, and whose failure modes are understood.
Comprehensive model validation - SR 11-7 requires validation that includes conceptual soundness, ongoing monitoring, outcomes analysis and benchmarking and independent validation. Sadly we lack a well-established mathematical framework for LLM decision logic, and non-deterministic outputs make it very hard for us to perform an outcomes analysis when we get different answers every time. SR 11-7 assumes that statistical testing is meaningful, which is broken if outputs vary stochastically and distributions depend on subtle changes in the environment (like batch size).
Data Quality and Representativeness - banks need to understand the training data, its representativeness, potential biases and data limitations. This puts a big red line through any of the leading foundation models as the training data is opaque. Some vendors claim to provide high-level summaries, but regulators generally want curated provenances, not “trust me bro”.
Performance Testing Across the Full Expected Domain - that means testing with the full range of inputs, all material business conditions, known stress scenarios etc. If the input domain is unbounded (i.e. any text, any complexity) and the LLM can hallucinate outputs then you’re going to struggle with this.
Stability, Change Management & Version Control - foundation model vendors love silently updating models under the covers. This once again disqualifies proprietary models like ChatGPT, Claude, and similar services.
Governance, Use Limits & Controls - SR 11-7 expects clearly defined use cases, limits on extrapolation, guardrails that prevent misuse. The issue with LLMs is that, due to their general purpose nature, they can handle inputs far outside their intended use. You can probably solve this one with clever enough guardrails, though there is no reliable way today to enforce semantic constraints on LLM behaviour.
Independent Review of All Model Components - this is asking for independent review of model structure, independent testing and full transparency. You can possibly manage this with open-weight models, but it’s not going to happen with proprietary foundation models.
Documentation - detailed model design, clear explanation of algorithms, mathematical justifications and assumptions and limitations. Without being able to document the training data lineage and the intermediate reasoning steps to the final classification you’re potentially stuck on this one.
Robust monitoring - Banks need to track drift, bias, error rates, stability, threshold breaches, all that fun stuff. LLMs’ inherent output variance makes this challenging to track.
Explainability - this isn’t explicitly named, but it is implicit in the other requirements for conceptual soundness, model transparency etc. LLMs can’t provide reliable causal, mechanistic or mathematical explanations for why a given decision was made.

What Gets Better With Deterministic LLMs—And What Doesn’t

So now that we’re thoroughly depressed and thinking about just putting our LLMs in the bin, what would a truly deterministic LLM (where a given input always produces the same output under all conditions) get us? Through all this, we need to keep in mind that determinism eliminates variance, not errors, and we need to keep in mind that semantic drift from context changes, input formatting variations or prompt order will still remain.

Validation becomes more feasible - you can get stable, repeatable benchmarks, and independent validators can reproduce the same results. You can measure error rates, precision/recall, false positives etc, and regression testing becomes possible and meaningful. We’re still stuck with the challenges of defining representative test sets for an unbounded input space, creating domain coverage frameworks for natural-language inputs and managing failures like hallucinations or reasoning errors. Essentially we are removing random variation, not logical unpredictability.
Monitoring model drift becomes feasible - we could set up monitoring and pick up real behaviour changes, not just stochastic sampling noise. Our stability metrics would become meaningful, and you only detect drift due to actual model changes not nondeterminism. We still need to deal with the vendor update challenges and the immaturity of drift detection frameworks for LLMs, but this is a big improvement.
Change management becomes workable - we can do reliable before/after comparisons, which means we can apply our standard change management process to model updates.
Intended use gets easier, but isn’t solved - jailbreak attempts either always work or always fail, which at least makes them easier to test. Out of domain behaviour becomes predictable enough to reject inputs reliably. On the downside, LLMs still might generalise outside of their intended use case in semantic ways, and prompt injection still remains possible but is easier to test systematically.
Documentation improves, explainability does not - Unfortunately for us, determinism doesn’t do anything to reveal the internal causal mechanisms of the LLM. With sufficiently detailed logging of inputs, outputs, and decision context, we can at least create an audit trail that shows consistent behaviour patterns, even if we can’t explain the underlying reasoning.

Everything else we mention above (data lineage, conceptual soundness, risk of logical errors etc) remain issues.

Deterministic LLMs + Transparency = $$$ (saved)

You may have noticed a theme in the issues that remain even after we add determinism: most of them trace back to opacity. Opacity about the training data, opacity about the fine-tuning process, opacity about why the model generalises the way it does. The moment you introduce determinism and transparency—either because the bank trains the model itself or because you’re using an open-weight model with fully disclosed training data—the picture changes dramatically.

Suddenly you can treat an LLM like any other high-stakes model. Deterministic inference means the outputs are reproducible; transparent weights and documented datasets mean the model is no longer a black box. That doesn’t magically eliminate logical errors or hallucinations, but it does make the system tractable for SR 11-7 governance.

Concretely, this unlocks several things:

Valid SR 11-7 Model Definitions - You can now define a clear model boundary, describe the conceptual mechanics, document the training corpus, and justify why the model should behave sensibly in the intended domain. Regulators don’t need an internal causal map of every neuron—they need a defensible theory of operation and evidence that it holds.
Fully Reproducible Validation - Determinism removes sampling noise, and transparency removes the “trust us” factor. Validators can independently reproduce benchmarks, stress tests, failure analyses and fairness checks. This is the point where LLMs start looking like normal classification models again rather than generative heuristics.
Defensible Change Management - With deterministic outputs and a known training lineage, before/after comparisons become meaningful. You can quantify deltas, run full regression suites, and explain behavioural changes to auditors. Silent model updates from vendors stop being an existential threat to your framework.
SR 11-7-Safe High-Stakes Use Cases - This is the big commercial unlock. Once LLMs become predictable and reviewable, banks can apply them to workflows they currently avoid entirely. Examples include:
- Yes/No credit decisioning (actual automated adjudication, not just assisting analysts)
- Structured credit underwriting with deterministic extraction, scoring, or narrative justification
- SME and commercial credit assessments based on document analysis
- KYC/AML/Fraud classification with stable risk signals and repeatable decisions
- Document intelligence as part of regulated workflows, not just internal convenience tools
- Consistent interpretation of regulatory rules for compliance automation
- Deterministic data extraction in supervisory reporting pipelines
- Wealth, treasury and trading applications where predictability is non-negotiable
These shift from “absolutely not” to “governable with the usual MRM machinery.”
A More Bank-Friendly LLM Ecosystem - Once models are testable, reviewable, and reproducible, SR 11-7 stops being the immovable barrier it is today. Instead of wrapping proprietary, stochastic models in layers of guardrails and praying, banks can adopt (or train) models that behave like software rather than oracles. The result: lower validation cost, faster approvals, and the ability to automate entire classes of processes currently stuck in manual review.

In short: determinism gives you stability; transparency gives you defensibility. Together they finally allow LLMs to sit comfortably inside traditional model-risk frameworks—and that’s where the real economic upside lives.

What’s Still Left

Lest we get too carried away, I thought it’s worth summarising what’s left that isn’t solved by determinism, as a kind of “cheat sheet” for governance teams and vendors looking to fill the gap:

hallucinations are still a thing!
adversarial prompting / prompt-injection
conceptual opacity
difficulty defining the operational domain
brittleness to small linguistic changes
lack of causal grounding

The Future Is Open

This has been a long post (and could’ve been much longer), but hopefully it’s given you a good overview of how SR 11-7 and LLMs interact, and why there’s so much interest in deterministic LLMs. To me it seems clear that truly open models are the future of LLM usage in regulated workloads—the big AI labs won’t get a look in without radical changes to the way they disclose to clients and regulators. Determinism + open weights does not solve every SR 11-7 challenge, but it finally places LLMs inside the same regulatory frame as traditional models—testable, reviewable, and governable.