Why AI Evals Aren't Optional: The Governance Case for Systematic Evaluation

“We’ve implemented all the governance stuff - the AI Ethics Board, the policies, the model inventory, the three lines of defense. But our validators keep saying they need ‘comprehensive evaluation infrastructure’ before they’ll approve our models. That’s an engineering problem, not a governance one, right?”

Sadly, this is wrong. And I get why this is confusing, because the word “evaluation” sounds technical. It conjures up images of ML engineers running test suites and measuring F1 scores. That feels like something that happens in the development process, not something the Model Risk Management function should care about.

But without evaluation, your governance program is theater. You’re going through the motions - documenting models, maintaining an inventory, holding governance committee meetings - but you have no actual evidence that your controls are working or that your AI systems are behaving appropriately.

Governance Without Measurement Is Just Hope

Model Risk Management rests on a simple principle: you need ongoing evidence that your models are performing as expected and not creating unacceptable risks. This isn’t controversial for traditional models. Nobody thinks you can validate a credit scoring model without testing it. Nobody thinks you can govern a pricing model without monitoring its predictions.

But somehow, when it comes to AI systems, I keep hearing variations of “we can’t really evaluate generative AI, it’s too non-deterministic” or “evaluation is the vendor’s responsibility” or “we’ll rely on user feedback.” These are all ways of saying “we’re not going to measure, we’re just going to hope.”

SR 11-7 - the US banking regulators’ guidance on Model Risk Management - requires three things: initial validation, ongoing monitoring, and governance oversight. Every single one of these requires evaluation capabilities. You cannot validate a model without testing it systematically. You cannot monitor ongoing performance without measuring it. You cannot provide meaningful governance oversight without quantitative evidence.

The UK’s SS1/23 is even more explicit. It requires “ongoing model performance monitoring,” “outcomes analysis” to verify predictions match reality, and validation of all components that affect model outputs. Not “nice to have” or “where feasible” - required.

What Happens When Evaluation Is Missing

I’ve seen what governance without evaluation looks like. It looks like:

The Model Risk Committee gets a quarterly report that says “AI systems are operating within acceptable parameters” with no actual data supporting that claim. The committee nods, files the report, and moves on. Nobody can answer “how do you know?” because nobody is actually measuring anything.

A vendor updates their foundation model. Your procurement team confirmed the SLA didn’t change. Your legal team confirmed the contract terms are fine. But nobody actually tested whether the new model version still meets your accuracy, bias, and safety requirements for your specific use case. You find out it doesn’t when users start complaining.

Model drift happens silently. Your RAG-based customer service assistant was great when you deployed it six months ago. Then your knowledge base got updated with new product information, customer query patterns shifted, and the LLM API provider made some backend changes. Performance has degraded 15%, but nobody notices because nobody is measuring it systematically. Users just think the AI is kind of useless now.

Bias goes undetected. Your resume screening system works fine in aggregate, but it has differential error rates across demographic groups that you’d never accept if you knew about them. You don’t know because you’re not measuring fairness metrics. You find out when someone files a discrimination complaint.

All of these are governance failures. And all of them happen because evaluation wasn’t treated as a core governance capability.

Evaluation Is What Makes Governance Real

When you think about the actual requirements of Model Risk Management, evaluation is everywhere:

Development and Implementation (SR 11-7’s first pillar): You need to test that your model works correctly before deployment. This requires pre-deployment evaluation - acceptance testing, bias testing, adversarial testing, validation against edge cases. Not one-off manual testing, but systematic evaluation with documented evidence.

Independent Validation (SR 11-7’s second pillar): Validators need to verify that the model is conceptually sound and performs adequately for its intended use. How do they do this? They evaluate it. They test it on representative data. They measure its error rates. They assess whether it meets the risk thresholds you’ve defined. Independent validation is literally “independent evaluation.”

Ongoing Monitoring (SR 11-7’s third pillar): You need to detect when model performance degrades, when input data distributions shift, when error patterns change, when risks materialize. All of this requires continuous evaluation of production behavior.

If you don’t have robust evaluation capabilities, you can’t actually do Model Risk Management. You can create the organizational structure and write the policies, but you can’t execute the fundamental requirements.

The Third-Party Problem

This matters even more for banks using third-party AI services, which is basically all of them. Nobody is training foundation models from scratch. You’re using OpenAI, Anthropic, Google, AWS, or similar vendors.

Your Third-Party Risk Management (TPRM) framework requires ongoing monitoring of vendor performance. When a vendor releases a new model version, you need to assess whether it still meets your requirements. When you’re choosing between GPT-4, Claude, and Gemini for a specific use case, you need some basis for the decision beyond “the sales engineer gave a good demo.”

All of this requires evaluation infrastructure. You need a representative test dataset. You need defined metrics. You need the ability to run systematic comparisons. You need version tracking so you can detect when changes happen.

I’ve seen banks struggle with this. They build an application on GPT-4 in March, it works great. OpenAI quietly updates the model in June. The bank’s application starts behaving differently. Users complain. The bank investigates. Eventually someone figures out the vendor changed something. Nobody knows whether the new version is better or worse because nobody has a systematic way to measure it.

That’s a vendor risk management failure caused by inadequate evaluation capability. You can’t manage what you can’t measure.

What This Actually Looks Like

I’m not arguing that banks need to become ML research labs. Evaluation for governance purposes is practical and achievable:

For pre-deployment validation: Create a golden evaluation dataset with expert-labeled examples covering normal cases, edge cases, and adversarial inputs. Test your system against it. Document the results. This is what User Acceptance Testing looks like for AI systems - you’re just being more systematic about it.

For ongoing monitoring: Sample production traffic. Run periodic evaluations. Track metrics over time. Set up alerts for when performance drops below thresholds. This is what operational monitoring looks like for AI - you’re measuring behavior, not just uptime.

For third-party models: Maintain a benchmark dataset for each use case. When evaluating vendors or versions, run them all against the same benchmark. Compare results. This is what vendor evaluation looks like when you have quantitative evidence.

For governance reporting: Show the Model Risk Committee actual data. “Customer service AI maintained 89% accuracy this quarter, within our 85-90% target range. Bias metrics across demographic groups all within 2% of baseline, meeting our fairness criteria. Three adversarial test cases failed - we’ve implemented additional guardrails and retest scheduled for next week.”

That’s what governance with evaluation looks like. The committee can actually oversee risk because they have evidence.

The Maturity Progression

Not every bank needs the same sophistication. A bank with three AI pilots doesn’t need the same evaluation infrastructure as a bank with fifty production AI systems. This should scale with maturity:

Early stage: Manual evaluation, basic test datasets, documented acceptance criteria. You’re establishing the practice of systematic testing.

Intermediate stage: Automated evaluation pipelines, ongoing monitoring dashboards, formal validation with independent review. You’re scaling evaluation to support multiple AI systems.

Advanced stage: Continuous evaluation in production, A/B testing frameworks, LLM-as-a-judge for automated quality assessment at scale. You’re treating evaluation as a production capability that enables rapid, safe AI deployment.

The sophistication should grow with your AI maturity. But even at the earliest stage, evaluation can’t be treated as optional. It’s the foundation that makes everything else work.

Why This Matters Now

AI governance is maturing from aspirational principles to concrete regulatory requirements. The EU AI Act has specific testing and monitoring obligations. NIST AI Risk Management Framework has “Measure” as one of four core functions. Banking regulators are starting to examine AI systems and asking hard questions about validation and monitoring.

When an examiner asks “how do you know this AI system is performing appropriately?”, the answer cannot be “we checked it manually a few times when we deployed it.” It needs to be “we have systematic evaluation processes with documented results that we review quarterly.”

The banks that figure this out now - that treat evaluation as a core governance capability, not an engineering afterthought - will have a massive advantage. They’ll be able to deploy AI systems faster because they can validate them systematically. They’ll satisfy regulators more easily because they have evidence. They’ll catch problems earlier because they’re measuring continuously.

The banks that keep treating evaluation as optional, or as someone else’s problem, will struggle. Their governance programs will be mostly documentation with little substance. Their Model Risk Management teams will keep blocking AI deployments because they can’t validate them. Their examiners will find gaps.

You can’t govern what you can’t measure. And you can’t measure without systematic evaluation. It’s not a nice-to-have. It’s the foundation that makes AI governance actually work.

If you’re building evaluation capabilities for AI governance at an FSI organization, I’d be interested to hear about your approach. You can find me on LinkedIn or email paul@paulmerrison.io.