A housing association deployed an LLM to answer tenant queries. Within the first week, it confidently told a resident that their property was covered under a repair scheme—one that had been discontinued three years earlier. The tenant showed up to an office expecting free repairs. That incident cost more in goodwill than the entire AI project budget.
This is why RAG exists. Retrieval-Augmented Generation grounds LLM responses in your actual data—policies, documents, knowledge bases—rather than relying on what the model "remembers" from training. Done right, it's the difference between a helpful assistant and a liability.
vs base LLM on domain questions
reduction with proper citations
retrieval overhead
Retrieval-Augmented Generation (RAG) is a technique that enhances LLMs by retrieving relevant documents from your data and including them in the prompt context. Instead of relying solely on training data, RAG gives the model access to current, specific information at query time.
The core insight: LLMs are excellent at reasoning and synthesis, but terrible at being databases. RAG plays to their strengths—you handle retrieval, they handle generation.
LLMs have two fundamental limitations that RAG addresses:
RAG solves all three: it provides fresh data, grounds answers in retrievable sources (enabling citations), and unlocks private knowledge bases.
RAG operates through a pipeline with four key stages:
We've audited dozens of RAG implementations. The failures are rarely about the LLM—they're almost always in the retrieval pipeline:
Fixed-size chunking (e.g., 500 tokens) splits sentences mid-thought and loses context. A chunk that starts "...continued from the previous section" is useless in isolation. Use semantic chunking that respects document structure—paragraphs, sections, logical units.
Vector similarity search returns semantically related chunks, but "related" isn't always "relevant to this specific question." A re-ranker (like Cohere Rerank or a cross-encoder) dramatically improves precision by scoring query-document pairs directly.
If users can't verify where an answer came from, trust erodes fast. Every RAG system should cite sources—not just for transparency, but for debugging. When something goes wrong, you need to know which retrieved chunk caused it.
The most common enterprise RAG failure: the system retrieves documents the user shouldn't have access to. If your vector database doesn't support metadata filtering, you need to implement access control in the retrieval layer.
Documents change. Policies update. If your embeddings are from six months ago, your RAG is answering with outdated information—exactly the problem you built RAG to solve.
| Option | When to Use | Trade-offs |
|---|---|---|
| RAG | Need access to fresh/private data, require citations, high volume of domain knowledge | Retrieval latency (100-500ms), chunking complexity, requires vector infrastructure |
| Fine-tuning | Need specific style, format, or behaviour; data is stable; privacy requirements prevent context sharing | Training cost, less flexible to updates, risk of catastrophic forgetting |
| Long Context | One-off analysis of specific documents, all relevant context fits in window | Cost scales linearly with context, no persistent knowledge, slower for repeated queries |
RAG relies on embeddings—dense vector representations that capture semantic meaning. These are stored in specialised vector databases optimised for similarity search.
RAG works best when you have domain-specific knowledge that changes over time:
Here's the practical implementation path we recommend:
Start simple. Use a hosted embedding API (OpenAI, Cohere), a managed vector DB (Pinecone, Supabase pgvector), and basic semantic search. Get something working in days, not months.
Once basic retrieval works, improve it: better chunking (respect document structure), add a re-ranker, implement hybrid search (vectors + BM25), tune retrieval parameters (top-k, similarity threshold).
Add access control, set up evaluation pipelines, implement monitoring (retrieval recall, answer quality scores), build data refresh pipelines for embedding updates.
Multi-hop retrieval (query → retrieve → re-query), agentic RAG (LLM decides what to retrieve), query rewriting, and answer verification chains.
Use semantic chunking that respects document structure
Fixed-size chunks split context and reduce retrieval quality
Implement a re-ranker for precision
Vector similarity alone returns 'related' not 'relevant'
Add citations to every response
Users lose trust when they can't verify answers
Implement access control in the retrieval layer
Most common enterprise failure: users seeing documents they shouldn't
Set up embedding refresh pipelines
Stale embeddings mean stale answers
Log queries, retrieved chunks, and responses
Without logs, you can't debug retrieval vs generation failures
RAG is the most practical way to give LLMs access to your organisation's knowledge. Done right, it transforms AI from a general-purpose chatbot into a domain expert that cites its sources.
The key is recognising that retrieval quality determines answer quality. Most RAG failures aren't model problems—they're chunking problems, ranking problems, or data freshness problems. Focus your effort there.
Building a knowledge base or RAG system? We help organisations design retrieval pipelines that actually work in production.
Learn about our method →