The Model Boundary Question: Why Your LLM Application is Actually a System Model

I was talking to a CRO at a mid-sized bank last month about their AI governance program. They’d built a loan underwriting assistant using GPT-4 - RAG system pulling from their policy documentation, some clever prompt engineering, a few business rules on the output. Pretty standard stuff. When I asked about their Model Risk Management (MRM) approach, they said “We can’t validate GPT-4, it’s OpenAI’s model, so we’re treating it as a vendor tool.”

This is the wrong answer, but I understand why they gave it. When you read SR 11-7 (the US banking regulator guidance on Model Risk Management from 2011), it talks about validating “quantitative models” - your credit scoring algorithms, your VaR calculations, your pricing models. It doesn’t say much about what to do when the “model” is actually a complex pipeline involving someone else’s API, your data, your prompts, your business logic, and your output validators all glued together.

But regulators won’t care that it’s complicated. They’ll ask: “Show me your model inventory. Show me the validation reports. Show me how you’re monitoring this.” And “we can’t validate the LLM” isn’t going to cut it.

What Counts as “The Model”?

The naive approach - and the one I see most banks taking - is to say the “model” is the foundation model itself. The model is GPT-4 (or Claude, or Gemini, or whatever). This makes a certain kind of intuitive sense. That’s where the AI lives, right? That’s the thing doing the predictions.

But it’s also useless for MRM purposes. You can’t validate a black-box API from OpenAI. You don’t have access to the training data, the model architecture, the fine-tuning process, or the weights. OpenAI (reasonably) isn’t going to give you a validation report that satisfies your regulators. So if the “model” is just GPT-4, you’re stuck.

The correct approach - and the one that’s starting to emerge as regulatory consensus - is that the “model” is the entire decision-making system. Not the LLM in isolation, but the whole pipeline:

Your prompt engineering layer (which encodes business logic and decision criteria)
Your RAG system and knowledge base (which determines what context the LLM sees)
The LLM API call itself (yes, GPT-4 or whatever)
Any agent orchestration or chain-of-thought reasoning
Tool calls and function execution
Your output validators and parsers
Business rules and post-processing logic
Guardrails and safety filters
Any human-in-the-loop checkpoints

All of it. The whole stack. That’s your model.

Why This Matters (And Why UK Banks Have It Easier)

If this seems overly broad, consider what actually drives the outputs of your loan underwriting assistant. Is it just GPT-4? Or is it GPT-4 plus the specific loan policies you’ve embedded in your RAG system plus the prompt that tells it to weight creditworthiness factors a certain way plus the business rule that caps all loan recommendations at 80% LTV plus the guardrail that blocks outputs containing PII?

Any of those components could change your model’s behavior. If your RAG system has outdated lending policies, you’ll make wrong decisions. If your prompt engineering shifts how risk factors are weighed, you’ll make different decisions. If your output validator has a bug, you’ll pass through bad recommendations.

In Model Risk Management terms, all of these things contribute to your model risk. You can’t just say “the AI part isn’t our responsibility” any more than you could say “Excel isn’t our responsibility” when your model was Janet’s spreadsheet that calculated loan reserves.

The UK’s Prudential Regulation Authority (PRA) actually figured this out when they updated their model risk guidance in 2023. SS1/23 - the UK equivalent of SR 11-7 - explicitly addresses “Post-Model Adjustments” (PMAs). Any manual overrides, business rules, or processing steps that happen after your “model” runs? Those are part of the model for governance purposes. The UK guidance is unambiguous: if it affects the decision, it’s in scope.

The US guidance (SR 11-7, written in 2011 when “model” usually meant a logistic regression in SAS) is less explicit about this. But regulators are increasingly interpreting it the same way. The OCC isn’t going to accept “the model is just GPT-4” when they examine your AI systems. They’re going to ask how you validate the entire decision-making process.

The Validation Problem (And Why This Is Actually Good News)

Once you accept that the “model” is the entire system, validation becomes possible. You can’t back-test GPT-4’s training process, but you can:

Validate your prompt engineering (does it correctly encode your business requirements?)
Test your RAG retrieval quality (are you pulling the right documentation?)
Evaluate hallucination rates on real customer queries
Measure bias across demographic groups in your actual data
Red-team your guardrails (can they be bypassed?)
Monitor drift in decision patterns over time
Validate that your business rules work correctly
Test failure modes (what happens when the API is down?)

All of these are standard Model Risk Management practices, just adapted to a GenAI system instead of a traditional statistical model. You’re not validating whether GPT-4’s neural architecture is sound (good luck with that). You’re validating whether your system - the one you built and deployed - behaves appropriately for its intended use.

This is why defining the model boundary correctly is actually good news for banks that want to deploy generative AI in production. It transforms an impossible problem (validate GPT-4) into a tractable one (validate your system that uses GPT-4).

What This Means for Implementation

If you’re building AI systems that will be subject to Model Risk Management - and in FSI, that’s basically any AI system that materially affects business decisions - you need to think about the model boundary from day one.

That means:

Documentation needs to cover the whole pipeline. Your model development documentation can’t just say “we use GPT-4.” It needs to explain your prompt design, your RAG architecture, your business logic, your guardrails - everything that contributes to decisions.

Your model inventory needs system-level entries. Don’t inventory “GPT-4” as your model. Inventory “Commercial Loan Underwriting Assistant” as the model, and document that it uses GPT-4 as one component.

Validation needs to be system-level. Testing GPT-4’s general capabilities isn’t validation. Testing your underwriting assistant’s behavior on real loan applications - that’s validation.

Governance applies to all components. When you update your RAG data, that’s a model change. When you tweak your prompts, that’s a model change. When you adjust your business rules, that’s a model change. All of these may require validation depending on their materiality.

Third-party risk management applies to the LLM API. Yes, GPT-4 is still a vendor component, and yes, you need TPRM (Third-Party Risk Management) processes for OpenAI. But that’s in addition to, not instead of, system-level model validation.

The FINOS Connection

This is where frameworks like FINOS AI Governance become valuable. The FINOS framework is built around system-level thinking - it’s designed to help you govern the entire AI pipeline, not just the model weights. The risk catalogue covers things like data quality (your RAG system), prompt injection (your input handling), model drift (your monitoring), and output validation (your guardrails).

When you map FINOS to SR 11-7 or SS1/23, the model boundary question essentially solves itself. You’re already thinking about governance at the system level. You’re already documenting the full pipeline. You’re already considering risks across all components. That’s what MRM requires.

The banks that are getting this right aren’t the ones trying to validate GPT-4. They’re the ones who’ve figured out that MRM for AI systems isn’t fundamentally different from MRM for traditional models - you just need to expand your thinking about what constitutes “the model.”

Where Most Banks Are Today

In my conversations with FSI risk teams, I’d estimate maybe 20% have really internalized this system-level view. The other 80% are still wrestling with the cognitive dissonance of “SR 11-7 says I need to validate my models, but I can’t validate GPT-4, so… now what?”

The answer is: stop trying to validate GPT-4. Validate your system. Define your model boundary at the system level. Document, test, and monitor the whole pipeline. Treat the LLM API as one component (a very important one, but still just one component) of a larger model that you can actually govern.

The UK banks that are following SS1/23 have a clearer path forward because the guidance is more explicit. US banks working with SR 11-7 are having to interpret 2011 guidance for 2025 technology. But the regulatory intent is the same: if you’re making material business decisions with a quantitative system, you need to govern it properly.

And that means getting the model boundary right.

If you’re wrestling with Model Risk Management for AI systems, I’d be interested to hear how your organization is defining model boundaries. You can find me on LinkedIn or email paul@paulmerrison.io.