RAG grounds LLMs in your data by retrieving relevant chunks at query time and injecting them into the prompt. In our eval (50 questions over proprietary docs): fixed-size chunking achieved 62% accuracy; semantic chunking (paragraph/section boundaries) raised it to ~78%. Optimize for retrieval recall over precision; K=5–7 works for most document-QA workloads. Instruct the model to refuse when context lacks the answer.
Retrieval-augmented generation (RAG) fetches context at query time and conditions the model on it. The gap between "RAG works" and "RAG is production-ready" comes down to chunking strategy, retrieval recall, and handling retrieval failure.
Evidence: our evaluation
On 50 questions over proprietary docs: fixed-size chunking (256–1024 tokens, top-3 to top-10) achieved 62% accuracy; semantic chunking (paragraph/section boundaries) reached ~78%. Remaining failures were mostly retrieval misses.
We built internal tools for document Q&A. Naive setup: fixed-size chunks (e.g., 500 tokens), embed, store, retrieve top-K. Failures: queries spanning multiple chunks got partial context; semantic similarity didn't guarantee relevance ("pricing" chunk for "how much does it cost?" when discussing a different product); fixed-size chunking split tables and paragraphs mid-structure.
What retrieval actually does
Retrieval recall (relevant chunk in top-K) correlates more strongly with accuracy than precision. Missing the right chunk is fatal; a few irrelevant chunks are tolerable. Diminishing returns past K=5–7 for document-QA.
RAG retrieval narrows the context window to what's likely relevant. The model reasons; retrieval reduces the search space. Retrieval is a filter, not a guarantee—if the right information isn't retrieved, the model can't recover. Increasing K improves recall but dilutes the signal (more tokens, higher latency). The right K depends on chunk size and how often answers span multiple chunks.
Chunking: the hidden variable
Semantic chunking (paragraph/section boundaries) raised accuracy from 62% to ~78% in our 50-question eval. Fixed-size chunking breaks tables and lists; semantic chunking preserves structure.
Fixed-size vs semantic chunking
Fixed-size chunking is easy but breaks semantic units—tables and bullet lists get split mid-item. Semantic chunking splits on paragraphs, sections, or headings. Chunk sizes become variable; retrieval scoring is more complex, but accuracy gain outweighs it.
Overlap
50–100 token overlap between adjacent chunks helps when answers span boundaries. Increases storage; we use it for long-form docs, skip for FAQs.
The Meterra approach
We've landed on a retrieval pipeline that prioritizes recall over precision, then filters. First, we use semantic chunking with paragraph/section boundaries. Second, we retrieve more than we need (e.g., top-10) and rerank with a lightweight cross-encoder or keyword overlap before passing to the LLM. Third, we include explicit instructions in the prompt telling the model to say "I don't know" when the retrieved context doesn't contain the answer—rather than hallucinating.
The key is treating retrieval as a probabilistic step. We don't expect every query to hit the right chunk. We design for graceful degradation: when retrieval fails, the model should refuse, not confabulate.
What we recommend
Given how RAG systems actually behave, we recommend:
1. Start with semantic chunking.
Respect document structure. Split on paragraphs, sections, or headings. Avoid fixed-size chunking unless your content is highly uniform.
2. Optimize for recall, then filter.
Retrieve more chunks than you'll use. Rerank or filter before passing to the model. Missing the right chunk is worse than including a few irrelevant ones.
3. Add a "no answer" path.
Instruct the model to refuse when the context doesn't support an answer. Monitor how often that happens—it's a signal that retrieval needs improvement.
4. Evaluate retrieval separately.
Measure retrieval recall (is the right chunk in top-K?) before optimizing generation. If retrieval is broken, better prompts won't fix it.
At the margins, the difference between "RAG works" and "RAG is unreliable" often comes down to chunking and retrieval recall. Get those right first; the rest is prompt engineering.