RAG in the Real World: Lessons from Production Deployments

Zartis Team
AI

RAG has quickly become one of the most widely adopted patterns in applied AI. On paper, it feels like the missing piece: connect large language models to your internal data, ground responses in facts, and reduce hallucinations. For many teams, RAG looks like the bridge between impressive demos and trustworthy production systems.

In reality, that bridge is narrower, and shakier, than expected.

After working with RAG systems in real production environments, one pattern keeps repeating itself: RAG doesn’t fail because vector search is hard. It fails because production reality exposes every hidden assumption teams make during experimentation.

When “More Context” Breaks the System

One early production deployment still stands out.

The team had done everything “right.” They built a solid embedding pipeline, indexed thousands of documents, and tuned retrieval to maximise recall. When users asked a question, the system pulled in as much relevant material as possible. The thinking was simple: more context means more grounding, and more grounding means fewer hallucinations.

What happened next was unexpected.

Answers became less precise. The model started blending policies from different time periods, combining edge cases into confident but incorrect conclusions. When users challenged responses, every cited document was technically relevant, but no single document actually supported the answer.

Nothing was broken. The system was working exactly as designed.

The issue was conceptual. The team had treated context as an unqualified good, when in reality context is one of the most dangerous inputs you can give a model. In production, every extra paragraph competes for attention. Every loosely related document increases ambiguity. Instead of grounding the model, the system was overwhelming it.

That moment tends to be the turning point for most RAG teams: the realisation that retrieval alone does not equal reliability.

RAG Is Not About Search. It’s About Context Control.

In production systems, RAG is less about finding information and more about controlling what the model is allowed to see. The difference is subtle but crucial.

Successful RAG systems don’t aim to retrieve everything that might help. They aim to retrieve only what is necessary, in a form the model can actually reason over. This means aggressively scoping retrieval, filtering out near-duplicates, and structuring context so that the model understands what matters and what does not.

In practice, this turns RAG into a context engineering problem. Teams that succeed treat the context window like a scarce resource, not an infinite buffer. They design explicit rules around which sources can appear together, how conflicting information is handled, and when the system should refuse to answer rather than guess.

The surprising outcome is that high-quality RAG systems often retrieve less than early prototypes, and deliver far better answers as a result.

The Quiet Problem of Instability

Another production surprise comes from determinism.

Many teams expect that once answers are grounded in documents, variability disappears. But users quickly notice something else: the same question, asked twice, can still produce meaningfully different answers. Sometimes the wording changes. Sometimes the emphasis shifts. Occasionally, the conclusion itself changes.

This instability rarely comes from the model alone. It emerges from the system around it. Slight changes in retrieved documents, different ordering of context, or minor shifts in chunk selection can all cascade into different outputs.

In production, reliability is not something you prompt into existence. It’s something you engineer. Stable retrieval strategies, consistent document versioning, fixed context ordering, and constrained output formats all matter more than clever prompt wording. When those pieces are in place, RAG systems become testable and predictable. Without them, teams are left guessing why yesterday’s answer no longer appears today.

Cost Becomes the Constraint You Can’t Ignore

RAG also changes the economics of AI systems in ways that only show up at scale.

In many deployments, the vast majority of cost doesn’t come from generating answers. It comes from feeding the model context. Large documents, multiple retrieved chunks, and verbose system prompts quickly dominate token usage. At a small scale this is invisible. At production scale, it becomes the main bottleneck.

Teams that treat cost as an afterthought often discover too late that their “working” RAG system cannot be economically scaled. The teams that succeed design for efficiency from the start. They compress context, preprocess text, cache retrieval results, and route queries to the cheapest model that can do the job well enough.

The key insight is simple but uncomfortable: a system that is accurate but unaffordable is not production-ready.

RAG Doesn’t Remove Judgment, It Encodes It

Perhaps the most important lesson from real-world RAG deployments is that retrieval does not replace decision-making.

Models don’t know which document should take precedence when sources conflict. They don’t understand organisational risk tolerance. They don’t know when uncertainty is preferable to confidence. All of that judgment has to be encoded explicitly into the system.

The strongest RAG systems are not the ones that always produce an answer. They are the ones that know when not to. They escalate, defer, or surface uncertainty in ways users can understand. Instead of hiding ambiguity behind fluent language, they make limitations visible and manageable.

This is where RAG stops being a feature and starts being a system.

From Demos to Durable Systems

It’s easy to build a RAG demo that impresses. It’s much harder to build one that survives real users, real data drift, and real operational constraints.

Production RAG forces teams to confront uncomfortable truths: context can hurt as much as it helps, grounding does not guarantee correctness, and reliability is an architectural outcome, not a prompt-level trick.

The teams that succeed are the ones who accept this early. They stop asking how to make models smarter and start asking how to make systems more disciplined.

That’s where RAG delivers real value – not as a magic fix for hallucinations, but as a carefully engineered bridge between probabilistic models and the messy realities of business systems.