RAG system in Production: What Nobody Tells You Until You're Debugging It

RAG in Production: What Nobody Tells You Until You’re Debugging It

Every RAG tutorial follows the same arc. You load some documents, chunk them, embed them, drop them into a vector store, wire up an LLM, and ask a question. It works. The answer is accurate, the latency is acceptable, and you ship it.

Three months later, you’re in an incident call at 11pm because your system confidently told a customer the wrong account balance. Or hallucinated a contract clause that doesn’t exist. Or started returning answers from documents that were updated six weeks ago.

The tutorial didn’t cover that part.

This post is about the gap between a working RAG demo and a RAG system you can actually trust in production — the failure modes that don’t appear in notebooks, the decisions that are hard to reverse, and the instrumentation you should have built on day one.

 

The 90/10 Rule Nobody Mentions

Here’s the most important reframe for debugging RAG systems: 90% of production failures are retrieval problems, not LLM problems.

When something goes wrong, engineers instinctively reach for the prompt. They rewrite instructions, add constraints, tweak temperature. The LLM is the visible part of the system — it’s where the output comes from, so it must be where the failure is.

It usually isn’t.

If the retrieved context is wrong, incomplete, or irrelevant, no amount of prompt engineering rescues the answer. The LLM is doing exactly what it should do: generating a coherent response from bad input. “Garbage in, garbage out” is older than machine learning, but it applies with particular force to RAG.

A 2024 IEEE/ACM conference paper — [“Seven Failure Points When Engineering a RAG System”](https://arxiv.org/abs/2401.05856) — studied three real production deployments across research, education, and biomedical domains. Their conclusion is worth quoting directly: validation of a RAG system is only feasible during operation, and robustness evolves rather than being designed in at the start. Every failure point they documented traces back to retrieval quality, not generation.

Before you touch the prompt, check what context is actually reaching the LLM. Everything that follows is about making sure that context is worth trusting.

 

Failure #1: Chunking Is a Permanent Decision You Make in Hour One

The chunking strategy you choose when you first set up your pipeline shapes system quality for months. Most teams default to fixed-size chunking — split every 512 or 1024 tokens, add some overlap, move on. It’s fast to implement, easy to reason about, and it works well enough in demos.

In production, it quietly degrades your retrieval quality in ways that are hard to diagnose.

Fixed-size chunking doesn’t respect document structure. A chunk boundary might land in the middle of a table, split a numbered clause across two chunks, or separate a question from its answer in an FAQ. When those chunks get retrieved, the LLM receives partial context — enough to generate a confident-sounding answer, not enough to generate a correct one.

The NVIDIA Developer Blog published benchmarks comparing chunking strategies across real document types. Late chunking — which delays the split decision until after contextual embeddings are computed — showed a 10–12% improvement specifically on anaphoric references, the kind where “it” or “this regulation” refers to something defined earlier in the document. For Fintech and MedTech documents full of cross-references and defined terms, that’s not a marginal gain.

The practical hierarchy:

  • Fixed-size (e.g., 500–800 tokens with 20% overlap): Fast to implement. Acceptable for homogeneous, well-structured text. Avoid for PDFs, scanned documents, or anything with tables.
  • Semantic/paragraph-based: Split on natural boundaries — paragraphs, sections, sentences. Better context coherence. Slightly more complex to implement.
  • Hierarchical: Store chunks at multiple granularities (document → section → paragraph). Retrieve at the granularity that matches query scope. Most complex, highest quality for heterogeneous document collections.
  • Propositional: Decompose documents into atomic factual claims before chunking. Highest retrieval precision, significant preprocessing cost.

The decision matters because re-chunking means re-embedding, which means re-indexing your entire corpus. At scale, that’s an expensive operation. Choose deliberately on day one rather than optimizing under pressure six months later.

One tool worth knowing: [infiniflow/ragflow](https://github.com/infiniflow/ragflow) (42k+ GitHub stars) handles deep document parsing — PDFs, Word files, scanned images — with visual chunking inspection. If your corpus includes anything other than clean markdown or plain text, the naive chunking path will hurt you.

 

 Failure #2: Pure Vector Search Breaks on Real Data

Vector search is remarkable at semantic similarity. Ask “what is our refund policy” and it will surface the relevant section even if the document uses “returns” instead of “refunds.” That semantic flexibility is exactly why RAG works.

It is also why pure vector search fails quietly on the queries that matter most in regulated industries.

Ask “show me transactions for account ID FIN-2847-X” and vector search retrieves documents that are semantically *similar* to account queries — not the document that contains that specific account ID. Ask “what does clause 14.3(b) say” and you get contextually adjacent content, not the exact clause.

Product codes, contract clause numbers, regulatory references, patient IDs, policy numbers — these are exact-match queries. Fintech and MedTech systems are full of them. Pure vector search handles them poorly because exact strings don’t necessarily cluster in embedding space the way semantics do.

Hybrid retrieval — combining dense vector search with sparse BM25 — is not an optional enhancement. It is the production baseline.

BM25 is classical keyword matching. It’s fast, it’s precise on exact terms, and it complements dense retrieval where vector search is weakest. Combined with an appropriate weighting between the two (the alpha parameter), it handles both semantic queries and exact-match queries in a single pipeline.

The challenge is that a static alpha weight doesn’t work across all query types. A 2025 paper — [DAT: Dynamic Alpha Tuning](https://arxiv.org/html/2503.23013v1) — demonstrates that dynamically balancing dense and sparse retrieval per query, using LLM-based effectiveness score normalization, consistently outperforms static hybrid retrieval on heterogeneous query distributions. In practice, most teams don’t implement DAT immediately, but knowing that static weights will degrade on diverse query types is useful when you’re diagnosing retrieval failures.

For teams on PostgreSQL, pgvector supports both vector similarity search and standard text search. You can implement hybrid retrieval without adding a dedicated vector database — more on that below.

One counterintuitive production finding from a [March 2026 industry deployment study](https://arxiv.org/abs/2603.02153): RAG fusion (multi-query generation + reciprocal rank fusion) increases raw recall, but in real deployments with fixed retrieval depth and reranking budgets, the gains largely vanish after reranking and truncation. Fusion looked good in benchmarks. It didn’t beat single-query baselines in production. Test before you commit to the complexity.

 

Failure #3: RAG Adds 41% Latency and Nobody Budgets for It

This is the number that surprises most engineering teams: retrieval accounts for 41% of end-to-end RAG latency and 45–47% of time-to-first-token (TTFT). This comes from a systematic latency study published in late 2024 ([arXiv:2412.11854](https://arxiv.org/html/2412.11854v1)) analyzing production-scale RAG inference.

The LLM is not the bottleneck. The retrieval stack is.

For real-time applications — customer-facing chatbots, live document search, support tools with SLA requirements — this overhead needs to be budgeted and engineered around from the start.

Strategies that actually help:

  • Semantic caching. Cache embeddings and retrieved results for frequent queries. A customer support RAG system will see the same questions repeatedly — caching retrieval results for high-frequency queries eliminates the latency for 30–40% of requests in most deployments.
  • Pre-computation. For predictable query patterns (e.g., “summarize document X”), pre-compute and cache results rather than computing at request time.
  • Asynchronous retrieval. When multiple retrieval calls are needed (e.g., searching across multiple knowledge bases), run them in parallel rather than sequentially. This is obvious in retrospect but often implemented serially in first-pass RAG pipelines.

Reranking trade-off. Adding a cross-encoder reranker after initial retrieval typically improves answer quality by 15–25%. It also adds 50–300ms of latency. Whether that trade-off is worth it depends on your use case — a customer-facing chat interface has a different latency budget than an internal document search tool. Make this decision explicitly, not accidentally.

The enterprise-grade Higress-RAG framework ([arXiv:2602.23374](https://arxiv.org/html/2602.23374)) addresses this directly with semantic caching and adaptive routing, demonstrating that architecture-level decisions have more impact on latency than individual component tuning.

 

Failure #4: You’re Flying Blind Without Evaluation Infrastructure

Most teams discover RAG hallucinations through user complaints. A customer flags a wrong answer, someone screenshots a fabricated reference, a support ticket comes in with “your system told me X and X is wrong.” The failure was happening in production for weeks before anyone knew.

This is an instrumentation problem, not a model problem.

Setting up evaluation infrastructure should happen on day one, not after the first incident. The question isn’t whether your RAG system will hallucinate — it will. The question is whether you find out from your metrics or from your customers.

RAGAS ([explodinggradients/ragas](https://github.com/explodinggradients/ragas), ~9k stars) is the de facto standard for production RAG evaluation in Python. It provides reference-free metrics — you don’t need ground-truth annotations to use it — covering:

  • Faithfulness: Does the answer only contain information from the retrieved context?
  • Answer relevance: Is the answer relevant to the question asked?
  • Context precision: Is the retrieved context actually useful for the question?
  • Context recall: Did retrieval surface the relevant information?

 

A minimal RAGAS setup:

 

```python

from ragas import evaluate

from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall

from datasets import Dataset

 

# Log these from your RAG pipeline

data = {



    "question": [...],

    "answer": [...],

    "contexts": [...],   # list of retrieved chunks per question

    "ground_truth": [...] # optional, needed for context_recall

}

 

result = evaluate(

    Dataset.from_dict(data),

    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]

)

print(result)

```

 

The enterprise target, according to deepset (creators of Haystack), is faithfulness above 90% for production systems in regulated industries. Below that threshold, the system is producing enough ungrounded content to constitute a reliability risk.

For debugging when something goes wrong, RagChecker ([arXiv:2408.08067](https://arxiv.org/html/2408.08067v1)) is more surgical. It separates retriever failures (claim recall, context precision) from generator failures (context utilization, noise sensitivity, hallucination rate). When you’re diagnosing a quality regression, knowing which component failed saves hours of guesswork.

A 2025 interview study with 13 industry RAG practitioners ([arXiv:2508.14066](https://arxiv.org/abs/2508.14066)) found that evaluation at most companies is still predominantly manual. Teams review samples by hand. This does not scale, and it introduces sampling bias — engineers tend to check the cases they expect to work, not the edge cases that don’t.

 

Failure #5: Your Index Goes Stale and Nobody Notices

RAG systems degrade silently. This is their most insidious production characteristic.

Source documents change. Policies get updated, products get discontinued, regulations get amended. The LLM’s parametric knowledge also ages. But your embeddings don’t automatically update when any of this happens. The retrieval index reflects the state of your knowledge base at the moment of last indexing — and in most systems, there’s no automatic alert when that snapshot falls out of sync with reality.

A 2025 paper introducing RAGOps ([arXiv:2506.03401](https://arxiv.org/abs/2506.03401)) frames this as an operational discipline analogous to MLOps. RAGOps covers the data lifecycle of a production RAG system: monitoring for data drift, tracking query and response patterns over time, collecting user feedback signals, and managing the pipeline for continuously evolving source data.

The minimum viable RAGOps setup for a production system:

  • Change detection: Monitor source document stores for updates. Trigger re-embedding and re-indexing on document changes rather than on a weekly cron schedule.
  • Query logging: Log every query, retrieved chunks, and generated answer. This is your audit trail and your debugging surface.
  • Retrieval quality monitoring: Track retrieval metrics (context precision, context recall) over time. A sudden drop often indicates index staleness or a distribution shift in incoming queries.
  • User feedback signals: A thumbs up/down on responses is cheap to implement and expensive to ignore. Negative feedback clusters reveal retrieval failures faster than any automated metric.

This is not glamorous infrastructure. It also prevents the scenario where your RAG system confidently returns information from a document that was superseded six months ago — a scenario that in Fintech or MedTech contexts has compliance implications, not just quality ones.

 

The Stack Decision: pgvector vs. a Dedicated Vector Database

If you’re already running on AWS RDS, Amazon Aurora, or Azure Database for PostgreSQL, pgvector with HNSW indexing is the pragmatic production choice for most teams.

The alternative — adding a dedicated vector database (Pinecone, Weaviate, Qdrant, Chroma) — introduces an additional managed service, a new cost center, another failure domain, and additional operational complexity. For teams that already have PostgreSQL expertise and infrastructure, this overhead is rarely justified until retrieval volume crosses thresholds that most production systems don’t reach.

AWS provides an official reference implementation: [rag-with-amazon-bedrock-and-pgvector](https://github.com/aws-samples/rag-with-amazon-bedrock-and-pgvector). It demonstrates HNSW index configuration, embedding storage, and hybrid retrieval setup on a stack that most Fintech and MedTech engineering teams are already running.

When does it make sense to move to a dedicated vector database?

– When you need filtering by metadata at query time across very large corpora (hundreds of millions of vectors) and PostgreSQL query planning becomes a bottleneck

– When you need multi-tenancy with strict data isolation that’s more efficiently handled at the vector store level than through PostgreSQL row-level security

– When your retrieval SLA requirements push below 10ms at p99 and pgvector tuning alone can’t get you there

For most teams reading this, those thresholds are in the future. Start with what you have.

 

One More Thing: The Security Surface You Haven’t Mapped

Adding a RAG layer to your application opens attack vectors that most security teams haven’t encountered before and that standard security reviews don’t cover.

Prompt injection via retrieved documents. If an attacker can influence the content of your knowledge base — through a public-facing upload feature, a web-scraped data source, or a compromised document — they can embed instructions in documents that get retrieved and included in LLM context. The LLM may follow those instructions rather than your system prompt. This is not theoretical: it’s been demonstrated in production settings.

Data poisoning. Subtly incorrect information inserted into the knowledge base propagates into RAG answers without the obvious tells of direct model manipulation. In Fintech, this means wrong rates, wrong terms, wrong calculations delivered with high confidence.

Membership inference. In multi-tenant RAG systems, it’s sometimes possible to infer whether a specific document is in the index based on the content of answers. For systems handling sensitive client data, this is a compliance concern.

A 2025 paper on Privacy-Aware RAG ([arXiv:2503.15548](https://arxiv.org/abs/2503.15548)) proposes encrypting both document content and embeddings prior to storage, which addresses data leakage at rest. For teams operating under GDPR, HIPAA, or SOC 2 requirements, this area needs explicit security review — it will not appear in a standard application security assessment.

 

Where to Start

If you’re building RAG for the first time, or diagnosing a system that’s underperforming, here is the prioritized sequence:

  1. Set up evaluation before features. RAGAS takes an afternoon to integrate. Do it before you go to production, not after.
  2. Audit your chunking strategy. If you’re using fixed-size chunking on heterogeneous documents, test semantic chunking on a representative subset and measure retrieval quality before and after.
  3. Add BM25 to your retrieval. If you’re doing pure vector search, add keyword search as a parallel retrieval path and combine results. The quality improvement on exact-match queries will be immediate.
  4. Log everything. Queries, retrieved chunks, answers, latency per stage. You cannot debug what you cannot observe.
  5. Map your document update cycle. How often do source documents change? Build a re-indexing trigger proportional to that frequency.
  6. Security review. Brief your security team on prompt injection via retrieved context. It’s novel enough that it won’t appear on standard checklists.

RAG is not a technology that works out of the box and stays working. It is a system that needs to be operated. The teams that treat it that way — with monitoring, evaluation pipelines, and deliberate architecture decisions — are the ones whose systems are still accurate six months after launch.

Zartis helps engineering teams design and ship production AI systems. If you’re working through any of the challenges in this post, [get in touch](#).*

 

References

– Barnett et al. (2024). [Seven Failure Points When Engineering a RAG System](https://arxiv.org/abs/2401.05856). IEEE/ACM CAIN 2024.

– [RAGOps: Operating and Managing RAG Pipelines](https://arxiv.org/abs/2506.03401). arXiv:2506.03401, 2025.

– [Scaling RAG Fusion: Lessons from an Industry Deployment](https://arxiv.org/abs/2603.02153). arXiv:2603.02153, 2026.

– [RAG in Industry: Interview Study](https://arxiv.org/abs/2508.14066). arXiv:2508.14066, 2025.

– Es et al. (2023). [RAGAS: Automated Evaluation of RAG](https://arxiv.org/abs/2309.15217). arXiv:2309.15217.

– [RagChecker: Fine-grained Diagnostics for RAG](https://arxiv.org/html/2408.08067v1). arXiv:2408.08067, 2024.

– [RAG Inference Latency Trade-offs](https://arxiv.org/html/2412.11854v1). arXiv:2412.11854, 2024.

– [DAT: Dynamic Alpha Tuning for Hybrid Retrieval](https://arxiv.org/html/2503.23013v1). arXiv:2503.23013, 2025.

– [Privacy-Aware RAG](https://arxiv.org/abs/2503.15548). arXiv:2503.15548, 2025.

– [Higress-RAG Enterprise Framework](https://arxiv.org/html/2602.23374). arXiv:2602.23374, 2026.

– [RAG Best Practices: 100+ Technical Teams](https://www.kapa.ai/blog/rag-best-practices). kapa.ai, 2024.

– [RAG Failure Modes and How to Fix Them](https://snorkel.ai/blog/retrieval-augmented-generation-rag-failure-modes-and-how-to-fix-them/). Snorkel AI, 2024.

– [Finding the Best Chunking Strategy](https://developer.nvidia.com/blog/finding-the-best-chunking-strategy-for-accurate-ai-responses/). NVIDIA Developer Blog, 2024.

– [Measuring LLM Groundedness in RAG](https://www.deepset.ai/blog/rag-llm-evaluation-groundedness). deepset, 2024.

 

Share this post

Do you have any questions?

Newsletter

Zartis Tech Review

Your monthly source for AI and software related news.