Production RAG Evals Before Ship | Triaxo Engineering Notes

By Triaxo AI Engineering
March 12, 2026
14 min read

Production RAG: run evals before you ship to customers

Retrieval quality, citation coverage, and regression suites matter more than model choice. Here is the eval ladder we use before any copilot touches production traffic.

Most RAG pilots fail in production for predictable reasons: stale indexes, missing citations, permissive tool lists, or no regression suite when prompts change. Model upgrades are the easy part; proving the system behaves on your documents and permissions is what separates a demo from something support can trust.

Start with golden questions, not vibes

We build a golden set from real tickets, policy PDFs, and the top 50 questions operators actually ask. Each item specifies expected sources, forbidden behaviors (e.g. inventing policy numbers), and whether an action is allowed. Runs are automated on every index rebuild and prompt change.

Citation coverage: every factual claim links to a chunk ID users can open.
Refusal quality: out-of-scope questions decline cleanly instead of hallucinating.
Latency budget: p95 under your channel SLA with retrieval + generation.
Access control: users only see chunks their role may read.

Layer evals like you layer releases

Offline evals gate merges. Shadow mode compares new stacks to production without customer impact. Canary channels get human review queues for sensitive intents until scores stabilize. Only then do we widen traffic.

If you cannot replay last month's incidents against today's prompt, you are not ready for broad rollout.

Triaxo delivery playbook

Instrumentation is non-negotiable: log retrieval sets, tool calls, approval outcomes, and thumbs feedback tied to conversation IDs. That dataset becomes next quarter's golden set—closing the loop most teams skip.

Teams often benchmark RAG with a handful of cherry-picked questions, then wonder why the copilot invents refund policies in production. The gap is not model IQ—it is evaluation design tied to your documents, permissions, and update cadence.

What belongs in a golden set

Each golden item should specify: user role, input question (verbatim from tickets if possible), required source documents, acceptable paraphrase, and hard negatives (answers that must never appear). Include multilingual or OCR-noisy variants if your channel sees them.

Regression on index changes: re-run when embeddings, chunking, or ACL rules change.
Tool-use cases: separate sets for read-only Q&A vs actions that call APIs.
Adversarial prompts: jailbreaks and cross-tenant probing attempts.
Stale content: questions whose answers changed after a policy update.

Scoring that leadership understands

Translate technical metrics into operational ones: deflection rate with confidence intervals, average handle time delta for agents, percentage of answers with clickable citations, and escalation rate to humans. A 92% "accuracy" on a synthetic set means little if agents still rewrite every answer.

Operating RAG after launch

Assign owners for the knowledge base the way you assign owners for APIs. Drift happens when HR updates PDFs but nobody re-embeds. Automate ingestion from source systems with version tags on chunks so you can explain which policy version produced an answer in an audit.

When to narrow scope

If evals stay red after two iteration cycles, shrink the domain—one product line, one language, one intent cluster—until green, then expand. Shipping a narrow, trustworthy copilot beats a broad, flaky one that erodes trust in month two.

Treat prompts and indexes like production config: reviewed, versioned, and rolled back.

Triaxo AI Engineering

If you are planning a customer-facing assistant, start golden-set work in discovery—not the week before launch. We routinely run eval design workshops alongside architecture reviews so pilots do not stall on subjective "looks fine" sign-offs.

Contact Info

Production RAG: run evals before you ship to customers

Start with golden questions, not vibes

Layer evals like you layer releases

What belongs in a golden set

Scoring that leadership understands

Operating RAG after launch

When to narrow scope

Search

Categories

Recent Posts

Flutter vs React Native in 2026: When We Recommend Each for B2B Apps

How to Choose a School Management System: Features, Integrations, and Build vs Buy

ERP for Software Companies: Signs You've Outgrown Spreadsheets

Popular Tags

Solutions

Services

Explore

Contact Info

Follow Us

Production RAG: run evals before you ship to customers

Production RAG: run evals before you ship to customers

Start with golden questions, not vibes

Layer evals like you layer releases

What belongs in a golden set

Scoring that leadership understands

Operating RAG after launch

When to narrow scope

Search

Categories

Recent Posts

Flutter vs React Native in 2026: When We Recommend Each for B2B Apps

How to Choose a School Management System: Features, Integrations, and Build vs Buy

ERP for Software Companies: Signs You've Outgrown Spreadsheets

Popular Tags