A RAG demo on a curated PDF set proves you can build a retrieval pipeline. A RAG system in production proves you can keep one running while the corpus, the users, and the threat model change underneath you. The two skills are unrelated.
Most retrieval-augmented systems we have inherited follow the same arc. Quarter one: someone clones a notebook from a vendor blog, loads a hundred documents, and the demo wins the room. Quarter two: the corpus grows to ten thousand files and the bug reports start. Quarter three: legal asks why a deleted contract still answers questions in Teams. Quarter four: the project is quietly shelved and the budget shifts to "AI strategy".
The pattern is not a failure of the technology. It is a failure to treat RAG as the production system it is — with versioning, an evaluation harness, an operating budget, and a clear ownership boundary between the model layer and the retrieval layer. This article is the playbook we use when we deploy DECKLOG Implementation for a client. None of it is theoretical. Every section comes from a system that took a real outage to teach us.
The demo-to-production gap
The gap between a working demo and a working deployment has six concrete failure modes. We score them on every new engagement, then build the system to fail safely in each.
- Permissions leakage. The vector store indexes a document, then the source-system ACL changes, then the agent answers a question using content the user is no longer entitled to see. Without ACL-aware retrieval, the agent ships a compliance incident every time HR moves a folder.
- Stale ground truth. A policy is updated in SharePoint. The vector store has the previous version. The agent confidently answers with the deprecated text. The user trusts the agent over their own search instinct. The error compounds.
- Hallucinated citations. The model fabricates a citation that looks like a real document ID. The user clicks. There is nothing there. Trust drops permanently after two of these.
- Cost runaway. Vector-store re-indexing fires on every save, embedding calls 10x forecast, the bill arrives, finance shuts down the API key.
- Latency cliff. Median response is 1.4 seconds at launch. Six months later the corpus is 20x larger, the cross-encoder reranker is unchanged, and median is 8 seconds. Users stop using the tool.
- Eval blindness. Quality has regressed for weeks. Nobody noticed because there is no golden set and no automated check. The discovery happens when an executive asks a question whose answer they already know.
Chunking is the most under-engineered decision
Every team optimises the embedding model. Almost no team optimises the chunking strategy. This is backwards. Retrieval quality is dominated by what you ask the embedding model to encode — and chunking is what defines that.
Fixed-window chunking
The vendor-blog default. Split documents every 800 tokens with 200 tokens of overlap. Works for technical documentation where sections are uniformly sized. Fails badly for policy documents where a single clause can be 50 tokens or 3,000 tokens. The 50-token clause gets indexed alongside enough surrounding noise to dilute the signal; the 3,000-token clause gets cut mid-sentence.
If the corpus is uniformly structured (engineering docs, API references, SOPs with consistent section sizes), fixed-window is fine. Anywhere else, it is the slow-burning cause of mediocre retrieval.
Semantic chunking
Use a small embedding model to score sentence-to-sentence similarity, then break at local minima. The chunk boundaries align with topic shifts. We run this with BAAI/bge-small-en-v1.5 as the semantic scorer because it is cheap enough to run on every chunk operation, and an additional reranker pass at retrieval time covers any boundary error.
Cost: roughly 2-3x ingestion compute vs fixed-window. Quality lift: measurable on every corpus we have tried, between 8% and 18% on Recall@5 on our golden sets.
Agentic / structure-aware chunking
The newest pattern and the one we now default to for high-value corpora. Use an LLM (Haiku-tier or smaller) to look at each section and decide where to break. Pass the document structure (h1, h2, table of contents) to the model as context. The model produces structured chunks that respect document semantics — a contract clause stays together, a code block is not split mid-function, a definition stays attached to its example.
Cost: 5-10x ingestion vs fixed-window. Quality lift: 15-30% on Recall@10 on complex corpora (contracts, regulatory filings, internal policy). For a 10k-document corpus indexed once a week, the additional cost is tens of euros per refresh. For a high-stakes use case it is the easiest ROI move in the entire pipeline.
Embedding model selection is a one-time bet you should make consciously
The default is text-embedding-3-large from OpenAI. It is a defensible default. It is not the only answer.
| Model | Dimensions | EU-resident | Strengths | Notes |
|---|---|---|---|---|
text-embedding-3-large | 3072 (truncatable) | Azure OpenAI EU | Strong general performance, fast | Matryoshka — truncate to 1024 or 1536 for cost |
BAAI/bge-m3 | 1024 | Self-host anywhere | Multilingual, sparse + dense + multi-vector | Best EU-multilingual choice we have tested |
Cohere embed-multilingual-v3 | 1024 | AWS Bedrock EU | Strong multilingual, integrated with Rerank | Single-vendor stack appeal |
nomic-embed-text-v1.5 | 768 (truncatable) | Self-host anywhere | Open weights, Matryoshka-trained | Punching above its size for cost-constrained jobs |
voyage-3 | 1024 | US-only as of writing | Top of MTEB on most benchmarks | Disqualified for EU-sovereign work for now |
The trap is benchmark chasing. MTEB scores measure performance on standard tasks, not your corpus. We have lost weeks chasing a 2-point MTEB improvement that produced no measurable benefit on a client's internal contracts corpus. Build the eval set first. Then benchmark on your data, not on the public scoreboard.
Hybrid retrieval beats dense-only retrieval. Every time.
Dense vectors (embeddings) excel at semantic similarity — finding documents that mean the same thing in different words. They are surprisingly bad at exact-match lookups. A user asks "what is the rate for SKU-42-XF?" and dense retrieval returns documents about SKU-37-XB because they share token-space proximity.
BM25 (the venerable sparse-retrieval algorithm built into PostgreSQL tsvector, OpenSearch, and every search engine since Lucene) is the opposite. It finds exact-token matches and weights by inverse document frequency. It is terrible at semantic recall and excellent at "this exact identifier appears in this exact document".
Hybrid retrieval is the simple combination: run both, take the top-N from each, deduplicate, then rerank. The cost is one extra round-trip. The recall improvement on heterogeneous corpora (mix of policy text and identifier-heavy reference material) is consistent and large.
Implementation patterns we use
For a PostgreSQL-backed stack (Directus, Supabase, plain Postgres), the recipe is:
-- Single query that returns both BM25 + vector hits
WITH dense AS (
SELECT id, 1 - (embedding <=> $1::vector) AS score
FROM chunks
WHERE tenant_id = $2
ORDER BY embedding <=> $1::vector
LIMIT 20
),
sparse AS (
SELECT id, ts_rank_cd(search_vector, query, 32) AS score
FROM chunks, websearch_to_tsquery('english', $3) query
WHERE tenant_id = $2 AND search_vector @@ query
LIMIT 20
)
SELECT id, MAX(score) AS combined_score
FROM (SELECT id, score * 0.6 AS score FROM dense
UNION ALL
SELECT id, score * 0.4 AS score FROM sparse) u
GROUP BY id
ORDER BY combined_score DESC
LIMIT 30;
That single SQL query gives you 30 candidates from two retrieval strategies in a single round-trip to a single store. You then pass those 30 candidates to a reranker before showing the top 5 to the LLM.
For Qdrant or Weaviate, both engines now support hybrid natively — set the alpha parameter and pass the same query in both forms. The mechanics differ; the principle is identical.
Reranking is the cheapest quality lift in the stack
The retrieval layer's job is to produce a recall-heavy candidate list. The reranker's job is to produce precision. Skipping the reranker is the single most common reason a RAG pipeline ships with frustrating retrieval quality.
Three reranker tiers we use:
- Cohere Rerank 3 (managed API). Drop-in REST endpoint. ~30ms latency for 30 candidates. €€ per million calls. Best-in-class quality on most corpora. The default we recommend when sovereignty allows.
- Open cross-encoder (
BAAI/bge-reranker-v2-m3). Self-hosted, ~50ms latency on a single A10 GPU, no per-call cost. Quality is 90-95% of Cohere on our internal benchmarks. The default when sovereignty requires on-prem inference. - LLM-as-reranker (Haiku-tier). For the highest-stakes queries where you can afford the cost, prompting an LLM to score each candidate yields the best results. We use this only on a small subset of queries flagged as high-value (legal interpretation, executive analysis).
The evaluation harness is the most under-built piece of every system
If you take one thing from this article, take this: the eval harness is not optional, it is the product. Without it, every retrieval change is a gamble and every model upgrade is a coin-flip. With it, every change is measurable and every regression catches itself.
What a working harness looks like
A production-grade RAG eval harness has four layers:
- Golden questions (100-500). Real questions from real users. Annotated with the documents that should appear in the top-N and the answer that should come out. Built collaboratively with subject-matter experts over the first 4 weeks of the engagement.
- Retrieval evals (automated). For each golden question, measure Recall@K and nDCG@K against the annotated set. Runs on every commit to the retrieval code.
- Answer evals (automated, LLM-judged). For each golden question, run the full pipeline end-to-end. Use an LLM (we use Claude Haiku for this — it is the cost-quality sweet spot) to score the answer on faithfulness, completeness, and groundedness. Calibrate the judge once against human annotations.
- User-feedback evals (continuous). Thumbs-up/down with a free-text field. Sample 1% of negative ratings into a queue an engineer reviews weekly. The new failure patterns become golden questions.
The harness lives in your CI. A pull-request that drops Recall@5 by more than 5% gets flagged automatically. A model upgrade that fails the answer eval cannot be merged. The discipline that the rest of your engineering team applies to unit tests gets applied to retrieval quality.
Tooling we have settled on
- Promptfoo for the harness orchestration. YAML-defined evals, CI-friendly, supports LLM-judged scoring out of the box.
- Langfuse (self-hosted) for trace storage and dataset management. We run it in every client's tenant — the golden eval set lives with the customer, never us.
- OpenTelemetry for production trace export. Every RAG call carries a trace ID that links back to retrieved documents, model used, latency per stage, and cost.
Cost modelling, before you regret it
Three lines of cost to model before the first deploy:
- Ingestion. Embedding calls on document upload + reprocessing on schedule. Use Matryoshka truncation. Cache embeddings by content hash so an unchanged document never re-embeds.
- Retrieval. Embedding call on every query + vector-store lookup + reranker call. The embedding call is the dominant cost; cache by query hash for FAQ-like patterns and you can cut retrieval cost 30-50% on user-heavy systems.
- Generation. The LLM call producing the answer. Almost always the biggest single line. Use prompt caching (Anthropic and Bedrock both support it) for the system prompt and the retrieved context — the cache savings on a hot system are 70-90%.
The update pipeline nobody talks about
Your corpus changes. Documents are updated, deleted, renamed, re-permissioned. Without an update pipeline, your vector store drifts away from ground truth one document at a time.
The architecture pattern that works:
- Source systems publish change events (SharePoint webhooks, Confluence webhooks, custom polling for systems without webhooks).
- A queue (Redis Streams, Kafka, or a simple Postgres advisory-lock pattern for small deployments) receives the events.
- An async worker fetches the document, re-chunks if changed, re-embeds changed chunks only, updates the vector store.
- An ACL synchroniser walks the same change event and updates the row-level permissions on every chunk derived from that document.
- A nightly reconciliation job compares the vector store contents against the source system to catch missed events.
The reconciliation job is the only honest defense against silent drift. We have caught corpus drift bugs three different ways in production: webhook delivery failures, ACL change events that were not processed, and (most embarrassing) a SharePoint API quirk where moved documents looked like new documents to the polling logic. Without nightly reconciliation, all three would have shipped silently.
Where we go next
The pipeline above is what shipping RAG looks like today. The space moves fast. Two patterns we are watching closely:
- Late-interaction retrieval (ColBERT, ColPali). Per-token embeddings that delay the similarity computation until query time. Measurably better recall on heterogeneous corpora. Cost is higher; tooling is maturing. We have shipped one production deployment and the pattern is real.
- Structured retrieval over knowledge graphs. For corpora with strong relational structure (regulatory citations, contract amendments, scientific literature), a graph layer on top of the vector store can answer multi-hop questions that pure RAG cannot. Microsoft GraphRAG and LlamaIndex's variants are still rough but the direction is right.
Neither is a default for the first deployment. Both are on the roadmap for the second. The point of the playbook above is to ship the boring, well-evaluated baseline first — and earn the right to experiment by having a system that does not regress when you touch it.
Recap, brutally short
- Build the eval harness before the second feature. Without it, every change is a gamble.
- Chunking dominates retrieval quality. Use semantic or agentic chunking for any corpus with structural variation.
- Hybrid retrieval (BM25 + dense) is the default. Dense-only is the demo configuration.
- Reranking is the cheapest quality lift available. Do not skip it.
- Cache embeddings, cache prompts, model the cost lines before shipping.
- Update pipeline + nightly reconciliation. Silent drift is the failure mode that does the most damage over the longest time.
If you want the full deployment pattern done for you — corpus inventory, ACL design, eval harness, deployment in your tenant, quarterly evaluation refresh — that is what DECKLOG Implementation is. If you want to build it yourself, the playbook above is everything we would have given you in the first week of an engagement.