Building RAG Systems That Don't Fall Apart in Production
The gap between the demo and the outage
Every RAG demo looks identical. Embed ten documents, pull the top three chunks, stitch them into a prompt, collect applause. It works. It always works at that scale.
Then you ship it.
Suddenly you have half a million documents, users asking questions in seven languages, tables that got chopped in half by your chunker, and a retriever that keeps pulling up chunks which are topically near the query but semantically wrong in ways that make the model lie with full confidence. The demo that ran on a laptop now hallucinates in production with a straight face.
I've walked through this cliff-edge more than once. What follows is the architecture I keep arriving at — the decisions that separate a RAG system that survives real traffic from one that quietly rots.
Chunking is where most systems die first
The default move — slice every document into 512-token windows with 50-token overlap — is fast to write and catastrophic in practice. It breaks sentences mid-thought. It slices tables in half. It treats a heading and its body as strangers. Fixed-size chunking is the vector-search equivalent of saying "we don't need transactions."
Two approaches work.
Semantic chunking uses an embedding model to find the points where the topic actually shifts. You embed sentences, compute the cosine similarity between neighbours, and cut when that similarity drops below a threshold. Variable-length output, but each chunk is about one thing.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_chunks(sentences: list[str], threshold: float = 0.75) -> list[str]:
embeddings = model.encode(sentences)
chunks, current = [], [sentences[0]]
for i in range(1, len(sentences)):
sim = np.dot(embeddings[i - 1], embeddings[i]) / (
np.linalg.norm(embeddings[i - 1]) * np.linalg.norm(embeddings[i])
)
if sim < threshold:
chunks.append(" ".join(current))
current = []
current.append(sentences[i])
if current:
chunks.append(" ".join(current))
return chunks
Hierarchical chunking is the trick that fixes the "chunk is right, context is missing" problem. Index two granularities — paragraph-level for matching, document-level for context. Retrieve at the paragraph level, then expand to the surrounding section before the model sees it. You get precise retrieval and enough surrounding narrative that the model isn't answering from a single orphaned sentence.
Pure vector search is not enough
Here's a failure mode I keep hitting. A user types "error code E-4041". The dense retriever, trained to love semantic similarity, confidently returns three beautifully-written overviews of "common error patterns in distributed systems." None of them contain E-4041.
Keyword queries need keyword retrieval. BM25 still wins on literal matches and rare tokens — exactly the queries where dense retrievers get poetic. So run both, and fuse the rankings.
Query ─┬─► Dense retriever (cosine over embeddings) ─┐
│ ├─► RRF fusion ─► Top-K ─► LLM
└─► Sparse retriever (BM25 / Elastic) ───────┘
Reciprocal Rank Fusion is the boring, unglamorous, works-everywhere way to merge two ranked lists without needing score calibration:
def rrf(rankings: list[list[str]], k: int = 60) -> list[str]:
scores: dict[str, float] = {}
for ranking in rankings:
for rank, doc_id in enumerate(ranking):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
return sorted(scores, key=scores.get, reverse=True)
On realistic corpora I've seen recall@10 jump 10–20% versus dense-only — and the gains come almost entirely from queries dense retrievers were fumbling silently.
Rerank, or pay for it later
After hybrid retrieval you have 20 to 50 candidates. Shovelling all of them into the prompt is wasteful and actively dangerous: the more topically-adjacent-but-wrong chunks you include, the more confidently the model hallucinates around them.
A cross-encoder reranker fixes this. Bi-encoders (what your vector store uses) score the query and each candidate independently, so they scale. Cross-encoders score the pair together, so they're much more accurate — and much slower, which is why you only run them on the 50 candidates, not the 500,000 documents.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, candidates: list[str], top_k: int = 5) -> list[str]:
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, candidates), reverse=True)
return [c for _, c in ranked[:top_k]]
Budget ~100ms. In return, the "confidently wrong" failure mode drops off a cliff. That's a trade you take every time.
The hardest failure: plausible answers with no grounding
Hallucination in a RAG system rarely looks like obvious gibberish. It looks like a well-written paragraph that happens to be unsupported by anything you retrieved. That's the one that ships to production and bites.
Three lines of defence I keep reaching for, in order of cost:
- Citation grounding. Force the model to cite chunk IDs for every claim. Post-process the output and run an entailment classifier over each
(claim, cited_chunk)pair. Claims that fail entailment get flagged or stripped. Cheap and surprisingly effective. - Self-consistency sampling. For high-stakes answers, generate 3–5 times at non-zero temperature and compare. If the factual claims diverge across samples, the model is guessing. Route to human review.
- LLM-as-judge faithfulness score. A second model reads the retrieved context plus the generated answer and rates faithfulness on a 1–5 scale. Drop anything below 3. More expensive, best saved for the tail.
None of these make hallucination impossible. They make it detectable, which is the honest goal.
The metrics that actually matter
I have watched teams instrument latency to six decimal places while having no idea whether their retriever is returning the right documents. Don't be that team. Log the stuff that tells you why the system is failing, not just how fast it's failing.
| Metric | How to measure |
|---|---|
| Retrieval recall@K | Labelled QA pairs; check whether the ground-truth chunk lands in top K |
| Answer faithfulness | LLM-as-judge or entailment model over (context, answer) pairs |
| Stage latency (p50/p95/p99) | Instrument retrieval, rerank, and generation separately |
| Chunk utilisation | Did the retrieved chunk actually show up in the answer? |
| Retrieval diversity | Do top-K embeddings span the query's intent, or cluster on one facet? |
Log every query, every retrieved chunk, every generated answer. Storage is cheap. The 3 a.m. debugging session where you're trying to reconstruct "what did the retriever see?" is not.
The whole stack, end to end
User query
│
▼
Query rewriting (optional: HyDE, step-back prompting)
│
├─► Dense retrieval (vector DB) ─┐
│ ├─► RRF fusion ─► Cross-encoder rerank ─► Top-5
└─► Sparse retrieval (BM25/Elastic)─┘
│
▼
Hierarchical context assembly
│
▼
LLM generation
│
▼
Faithfulness / citation check
│
▼
Answer + citations
Demo-grade RAG is an afternoon. Production-grade RAG is a year of sharp corners. The gap is invisible right up until the moment a real user asks a real question about a real document — and then it's the only thing that matters.
Stay in the loop
New articles on AI, Cybersecurity, and PKI — delivered to your inbox.