AI System Observability: What to Log and Why

Your traditional logging strategy doesn’t work for AI systems. I learned this the hard way.

We had an AI system that started behaving weirdly. Users complained that responses changed in quality. Some queries worked great, others gave nonsensical answers. Performance metrics looked fine - latency was good, error rates were low, the system was “up.”

We had no idea what was actually going wrong because we weren’t logging the right things.

Traditional system logging tracks errors, performance metrics, user actions. That works for deterministic systems. AI systems are different - they’re non-deterministic, model versions matter, prompts are the “code” that drives behavior. If you’re not logging AI-specific information, you can’t debug, you can’t audit, and you can’t improve.

Why AI Logging is Different

Traditional web application: User submits form, system validates input, runs business logic, saves to database, returns response. You log user ID, timestamp, inputs, errors, database queries.

When something goes wrong, you can reproduce it. Run the same inputs through the same code, you get the same output. Debug, fix, deploy.

AI system: User submits prompt, system calls LLM API, gets response, returns to user. What do you log?

If you only log the traditional stuff (user ID, timestamp, errors), you’re missing critical information:

You can’t reproduce issues: The same prompt to GPT-4 can give different responses. Without logging the actual prompt and response, you can’t investigate “why did the AI say X to this user?”

Model versions matter: GPT-4 in January behaves differently than GPT-4 in July (OpenAI updates models continuously). If you don’t log which model version handled each request, you can’t correlate behavior changes to model updates.

Prompts are the logic: In traditional systems, code is the logic. In AI systems, the prompt is the logic. If your prompt changes (even slightly), behavior changes. Without logging prompts, you can’t debug “why did this stop working?”

Privacy complicates everything: Prompts often contain PII (customer names, account numbers, personal information). You need to log prompts to debug, but you can’t log raw PII to comply with privacy regulations. Tension.

What to Log: The Essential Set

The FINOS framework has guidance on this (MI-4: Observability). I’m going to walk through what actually matters, with real examples of why each piece is useful.

Log Prompts (With PII Masking)

You need the actual prompt sent to the LLM. Not just “user asked a question” but the full text - system prompt, user input, context, few-shot examples, everything.

Why? Because when users report “the AI gave me a weird answer,” you need to see what was actually sent to the model. The problem might be:

User input that triggers unexpected behavior
System prompt that’s ambiguous
Context that contradicts the user’s intent
Few-shot examples that confuse the model

Without the prompt, you’re guessing.

But: Prompts contain PII. Customer asks “what’s the balance for account 123456789 for John Smith?”, you can’t log that raw.

Solution: PII masking before logging. Replace sensitive entities with placeholders: “what’s the balance for account [ACCOUNT_ID] for [CUSTOMER_NAME]?”

You lose some debugging fidelity, but you maintain enough context to understand what happened. You can still correlate patterns (“oh, all the weird responses involve queries about account balances”).

Tools for this: Presidio (Microsoft), AWS Comprehend, or custom regex patterns for common PII types.

Log Responses (Also with PII Masking)

Log what the LLM actually returned. Not just “success” or “error” but the full response text.

Why? Because the prompt might be fine, but the response might be problematic:

Hallucination (model made up information)
Format violation (you asked for JSON, got prose)
Inappropriate content (model included something it shouldn’t)
Incomplete response (model hit token limit mid-sentence)

Comparing prompts to responses shows you what the model is actually doing. Patterns emerge: “whenever we ask about X, the model hallucinates Y.”

Same PII concern, same masking solution.

Log Model Versions

Every API call should log which model version handled it. Not just “gpt-4” but “gpt-4-0613” or “gpt-4-turbo-2024-04-09” (or whatever versioning scheme your vendor uses).

Why? Because when you suddenly see quality degradation, you need to know if it correlates with a model update.

I’ve seen this exact scenario: Bank uses Azure OpenAI with auto-update enabled (default behavior). Microsoft updates GPT-4 to a newer version. Suddenly the bank’s customer service chatbot starts giving different responses. Users complain. Nobody knows why until someone checks and realizes the model version changed.

Without version logging, you don’t catch this. With version logging, you can correlate: “response quality dropped on July 15, that’s when we started seeing version X instead of version Y.”

Also matters for reproducibility. If a regulator asks “how did you reach this decision on June 3?”, you need to know which model version made that decision.

Log Performance Metrics

Standard observability stuff, but AI-specific metrics matter:

Latency: How long did the API call take? LLM APIs can be slow and variable. You need to track p50, p95, p99 latencies to understand user experience.

Token counts: How many tokens in the prompt? How many in the response? This matters for cost (you’re billed per token) and for understanding model behavior (responses hitting token limits are getting truncated).

Cost: If you’re using vendor APIs, each call costs money. Log the cost per request so you can track spend and optimize expensive queries.

I’ve seen banks shocked by their LLM API bills. Turns out one poorly optimized prompt was generating 10K token responses (expensive) when 500 tokens would suffice. Without token count logging, they wouldn’t have caught it.

Log Error Conditions

Traditional error logging, but AI has specific error types:

Rate limits: You hit the API rate limit, request failed. Needs different handling than other errors (backoff and retry).

API failures: The vendor’s API is down or returning errors. Distinguish this from “AI gave a bad response” - this is infrastructure failure.

Timeouts: Request took too long, you killed it. Might indicate model struggling with complex prompt or vendor-side performance issues.

Content policy violations: The prompt or response violated the vendor’s content policy. Some vendors refuse to process certain inputs or return certain outputs. Log when this happens so you can tune prompts.

Aggregate these errors to understand patterns. Spike in rate limit errors? You need to optimize query frequency or upgrade your API tier. Spike in timeouts? Your prompts might be too complex.

Log User Context

Who made the request? Not just user ID but relevant context:

User role/permissions (for access control audits)
Session ID (to group related queries)
Source (web app, mobile app, API client)
Geographic location (for data residency compliance)

Why? Because when things go wrong, you need to understand: Was this isolated to one user? One role? One client? Or system-wide?

Also critical for security audits. “Show me all AI queries from users in the compliance team during Q2” - you can’t answer without user context logging.

Log Feedback Signals

Users provide feedback on AI responses - explicitly (thumbs up/down, ratings) or implicitly (did they use the response, did they rephrase and try again, did they escalate to a human).

Log all of it. This is your ground truth for quality monitoring.

I’ve seen systems where the technical metrics looked great (low latency, low error rate) but users hated it. The feedback signals showed: 60% thumbs down on responses, high escalation rate to human agents.

Without feedback logging, you don’t know if the AI is actually helping or just technically functional.

What NOT to Log

Privacy is the constraint that makes this hard. You need information to debug, but you can’t violate privacy regulations to get it.

Don’t log raw PII: Full customer names, social security numbers, account numbers, addresses - anything that identifies individuals. GDPR, CCPA, financial regulations all prohibit unnecessary collection and retention of PII.

Use PII masking (as discussed above). Replace sensitive entities with placeholders before logging.

Don’t log sensitive business data: Trade secrets, M&A plans, executive communications - anything that’s confidential even internally. Your logs might be accessible to more people than the original documents (developers, operations, security teams).

Implement data classification: what sensitivity level can be logged, what requires additional protection, what should never be logged.

Don’t retain logs forever: Storage costs money, and privacy regulations often require data minimization. Define retention policies:

Hot logs (recent, fast access): 30-90 days
Warm logs (archived, slower access): 1-2 years
Cold logs (compliance archives): 7 years (if required by regulation)

Delete logs that are no longer needed. Don’t hoard data “just in case.”

The FINOS framework addresses this (MI-4.14: privacy-preserving logging). Balance: enough data to be useful, not so much you violate privacy.

Building Your Observability Strategy

Start by defining objectives. What questions do you need to answer?

Debugging: “Why did the AI give this specific response?” Requires: prompts, responses, model versions, user context

Performance monitoring: “Is the system fast enough?” Requires: latency metrics, error rates, API availability

Security auditing: “Was there unauthorized access or data leakage?” Requires: user context, access patterns, anomaly detection

Cost management: “Are we spending too much on API calls?” Requires: token counts, cost per request, usage by user/team

Quality monitoring: “Is the AI getting better or worse?” Requires: feedback signals, response quality metrics, drift detection

Different objectives need different data. Design your logging schema around the questions you actually need to answer, not just “log everything and figure it out later.”

Structure your logs: Use structured logging (JSON) not text logs. Structured logs are queryable, aggregatable, analyzable.

Bad: "User 12345 queried AI at 2024-08-12T10:30:00" Good: {"user_id": "12345", "timestamp": "2024-08-12T10:30:00", "query_type": "customer_service", "latency_ms": 1250, "model_version": "gpt-4-0613", "tokens_used": 450}

Structured logs let you ask questions like “show me all queries with latency > 2 seconds in the past week” or “total tokens used by the customer service team this month.”

Centralized log aggregation: Don’t scatter logs across multiple systems. Use centralized log aggregation (Splunk, ELK stack, CloudWatch, Datadog - pick your poison).

FINOS MI-4.3 specifically calls this out. You need a single place to search across all AI system logs. Distributed logs are useless for debugging complex issues.

Dashboards for common questions: Build dashboards that answer your routine questions without manual log diving.

Error rate over time
Latency percentiles
Token usage and cost trends
User feedback ratings
Model version distribution

FINOS MI-4.4: real-time dashboards. When something goes wrong, you want to see it immediately, not discover it next week.

Retention policies: Define how long logs are kept at each access tier.

FINOS MI-4.13: retention policies must balance operational needs, compliance requirements, and privacy obligations. Document your retention decisions - regulators will ask.

Real-World Scenarios

Let me show you how good logging makes problems solvable.

Scenario 1: Debugging hallucination

User reports: “The AI told me our policy allows X, but our actual policy says Y.”

With good logging:

Pull up the exact prompt (shows what user asked and what context was provided)
See the response (confirms the AI did say X)
Check model version (maybe a new version hallucinates more)
Review similar queries (is this a one-off or a pattern?)

Without logging: “User says AI was wrong, we don’t know what actually happened, can’t reproduce, can’t fix.”

Scenario 2: Investigating cost spike

CFO asks: “Why did our AI API bill triple this month?”

With good logging:

Query token usage by user/team (find which team drove the increase)
Identify specific queries with high token counts (find the expensive prompts)
Review those prompts (discover someone is uploading entire documents as context)

Result: Optimize those specific use cases, costs drop back to normal.

Without logging: “I dunno, people are using it more?” (Not a helpful answer.)

Scenario 3: Audit trail for regulator

Regulator asks: “How do you ensure employees only use AI to access information they’re authorized to see?”

With good logging:

Show user context (who made requests, what their role/permissions were)
Show access controls (RAG filtered by user permissions)
Demonstrate audit trail (every query logged with user identity and timestamp)

Result: Regulator satisfied you have appropriate controls and visibility.

Without logging: “We trust our employees” or “The AI handles that” (Both terrible answers in a regulated environment.)

Observability is Not Optional

If you’re running AI in production, especially in financial services, observability isn’t a nice-to-have. It’s foundational.

You will have issues. Users will report problems. Systems will behave unexpectedly. Costs will spike. Regulators will ask questions.

Without proper logging, you can’t debug issues, you can’t audit for compliance, you can’t optimize performance, and you can’t improve quality.

Design observability from the start. Retrofitting logging into a production AI system is painful - you’re flying blind until you implement it, and you’ve lost historical data that would have been useful.

Balance is the challenge: log enough to be useful, not so much that you violate privacy or drown in data. PII masking, structured logging, retention policies, and clear objectives make this manageable.

The banks that get this right treat AI observability like any other production system observability - it’s a first-class concern, not an afterthought.

If you’re building AI systems without comprehensive logging, you’re not ready for production. Fix it before you launch, not after the first major incident.