RAG, chunking, embeddings: the practical guide for reliable chatbots (SEO, hands-on)
- Mustapha Alouani
- Ai , Llm , Architecture
- December 27, 2025

Table of Contents
Why your RAG still hallucinates (and how to fix it)
Many believe that plugging a document base into an LLM guarantees factual answers. In reality, a poorly designed RAG pipeline (naive chunking, generic embeddings, noisy retrieval, missing citations) leads to hallucinations and production failures.
This guide gives you the keys to building a robust, indexable RAG pipeline. For more (scripts, evaluation, advanced cases), see The Mechanics of LLMs.
RAG: the winning recipe
A modern RAG pipeline:
- Chunk your documents intelligently (structure, overlap)
- Encode each chunk with domain-adapted embeddings (BGE, E5, OpenAI, Cohere…)
- Retrieve relevant passages (vector search, re-ranking)
- Augment the LLM prompt with these passages, enforcing citations and structure if needed
RAG is the “open-book exam” for LLMs: the model no longer answers from memory, but grounds its output in verifiable sources.
Embeddings: the choice that changes everything
Not all embeddings are equal! For effective RAG, use models adapted to your domain: BGE, E5, OpenAI, Cohere, or specialized models (legal, medical…). Frameworks like LlamaIndex, Haystack, LangChain make it easy to test and integrate different encoders.
Chunking: the art of splitting without losing meaning
Bad chunking kills relevance: too large, retrieval is fuzzy; too small, you lose context. Use structure-aware chunking (titles, sections) with 30–50% overlap to maximize recall. Tools like LlamaIndex and Haystack offer advanced chunking modules.
Retrieval and re-ranking: double insurance
A good RAG pipeline first retrieves broadly (recall), then applies re-ranking (BM25, Cross-Encoder, Cohere Rerank…) to keep only the most relevant passages. Without re-ranking, LLMs can make bad context sound plausible—quality matters!
Chunk contextualization: the semantic boost
Add a title or mini-summary to each chunk before indexing: this improves relevance, especially for abstract queries. Modern frameworks integrate this step (LlamaIndex, Haystack, LangChain output parsers).
The 3 pitfalls to avoid
- Random chunking: too large = fuzzy, too small = loss of context
- Generic embeddings: ineffective on jargon, tables, acronyms
- No citations: impossible to distinguish grounded from hallucinated answers
Best practices for a reliable RAG
- Structure-aware chunking with overlap
- Specialized embeddings tested on your data
- Systematic re-ranking
- Enforced citations and output format (JSON, markdown, etc.)
- Evaluate retrieval independently from the LLM (recall@k, citation quality)
Beware of noise: too much context kills relevance
Injecting too many or the wrong chunks drowns the model and anchors generation on bad sources. Always measure recall@k and adjust the number of injected chunks.
Recommended frameworks and tools
- LlamaIndex: full RAG pipeline, advanced chunking, built-in evaluation
- Haystack: retrieval, re-ranking, multiple connectors
- LangChain: orchestration, output parsers, structured citations
- Weaviate, Qdrant, Milvus: high-performance vector databases
- Cohere Rerank, Cross-Encoder: state-of-the-art re-ranking
For hands-on practice: scripts and notebooks at https://github.com/alouani-org/mecanics-of-llms
Engineer’s method: diagnose and improve your RAG
- Separate retrieval and generation: is the answer in the retrieved chunks?
- Measure: recall@k, citation quality, injected noise
- Improve in order: structured chunking, domain embeddings, re-ranking, output/citation constraints
FAQ
Why does my RAG hallucinate even with documents injected? Because only relevant, readable, and well-used passages matter. Noisy or excessive context is ignored or misused.
How many chunks should I inject? Start small (3–5) with re-ranking, then adjust based on retrieval quality.
For more: vector databases, embedding selection, advanced indexing, evaluation, and pitfalls are detailed in The Mechanics of LLMs (Augmented Systems & RAG chapter).