Back to Blog
RAGLLMVector DBProduction

RAG Systems in Production: What Actually Works

Practical lessons on building production RAG systems — chunking strategies, embedding selection, reranking, hybrid search, and keeping hallucinations in check.

Published 2026-03-15|12 min

Why Retrieval Makes or Breaks Your RAG System

RAG is one of the most practical ways to ground LLMs in real, domain-specific knowledge. Instead of relying on whatever the model memorized during training, you pull in relevant documents at query time and feed them into the prompt. The payoff? Fewer hallucinations, fresher answers, and the ability to actually cite your sources — all things you absolutely need in production.

Here's the thing, though: going from a RAG prototype in a Jupyter notebook to a system that's reliable at scale is a completely different game. Retrieval quality drives everything downstream. Even small mistakes in how you chunk, embed, or rank documents can quietly wreck your entire pipeline. In this guide, I'll walk through the decisions and patterns that separate demos from production-grade systems.

Your RAG system is only as good as its retrieval. If the right context never reaches the LLM, no amount of prompt engineering will save you.

Chunking: Fixed, Semantic, and Recursive

The first big decision in any RAG pipeline is how you split your documents into chunks. Get this wrong and everything else suffers — retrieval precision drops, and the LLM gets noisy or incomplete context. There are three main approaches, each with real trade-offs.

Fixed-Size Chunking

This is the simplest approach: split text into segments of a set token or character count, usually with some overlap to avoid losing context at boundaries. It's fast, predictable, and easy to implement. The downside? It has zero awareness of meaning. A chunk boundary can land right in the middle of a sentence or split a key concept across two chunks, which hurts retrieval quality.

Semantic Chunking

Semantic chunking is smarter — it uses embedding similarity between consecutive sentences to find natural breakpoints. When the cosine distance between neighboring sentence embeddings crosses a threshold, you start a new chunk. This keeps topics together within chunks, but it comes with extra compute cost and a tuning parameter (the similarity threshold) that you'll need to calibrate for your specific corpus.

Recursive Chunking

Recursive chunking, popularized by LangChain's RecursiveCharacterTextSplitter, tries a hierarchy of separators — double newlines, single newlines, sentences, then words — and only falls back to the next level when a chunk gets too big. It strikes a nice balance between structure-awareness and predictability. In practice, recursive chunking at 512 tokens with 10-15% overlap is a solid default that works well for most production use cases. I'd start there unless you have a specific reason not to.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken

def create_chunks(documents: list[str], chunk_size: int = 512, overlap: int = 64):
    """Split documents using recursive chunking with token-based sizing."""
    tokenizer = tiktoken.encoding_for_model("gpt-4")

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=lambda text: len(tokenizer.encode(text)),
        separators=["\n\n", "\n", ". ", " ", ""],
    )

    chunks = []
    for doc in documents:
        splits = splitter.split_text(doc)
        chunks.extend(splits)

    return chunks


# Semantic chunking alternative using embedding distances
from sentence_transformers import SentenceTransformer
import numpy as np

def semantic_chunk(text: str, threshold: float = 0.3):
    """Split text at points where semantic similarity drops below threshold."""
    model = SentenceTransformer("all-MiniLM-L6-v2")
    sentences = text.split(". ")
    embeddings = model.encode(sentences)

    chunks, current_chunk = [], [sentences[0]]
    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i], embeddings[i - 1]) / (
            np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[i - 1])
        )
        if similarity < threshold:
            chunks.append(". ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    chunks.append(". ".join(current_chunk))
    return chunks

Recursive and semantic chunking — two approaches you'll likely compare

Picking the Right Embedding Model

Your embedding model defines how similarity works in your system. Pick the wrong one and your retrieval will quietly underperform — queries won't match the documents they should. Here's what to look at when choosing.

  • Dimensionality trade-offs: Higher-dimensional embeddings (like 1536 for OpenAI's text-embedding-3-large) capture more nuance but cost more to store and search. Something like all-MiniLM-L6-v2 at 384 dimensions hits a sweet spot for many workloads.
  • Domain fit: General-purpose models often struggle with specialized content (legal, medical, financial). Fine-tuning on even a few thousand domain-specific pairs can dramatically boost recall.
  • Multilingual needs: If you're serving multiple languages, use a multilingual model like multilingual-e5-large so semantically equivalent queries in different languages retrieve the same documents.
  • Matryoshka representations: Some newer models let you truncate embedding dimensions at inference time without retraining — handy for dynamically trading quality for speed.

MTEB is great for comparing models on standard benchmarks, but don't rely on it blindly. What I've seen work best is evaluating your top candidates against a representative sample of real queries with relevance judgments. A few hours of evaluation upfront can save you months of debugging down the road.

Hybrid Search and Reranking

Embedding-based search alone rarely cuts it for production. Two techniques make a huge difference: hybrid search and reranking.

Hybrid search combines dense vector retrieval with sparse lexical retrieval (usually BM25). Dense retrieval is great at semantic understanding — it knows "automobile" and "car" are related. Sparse retrieval handles exact keyword matches, acronyms, and rare domain terms that embeddings might miss. Combine both with Reciprocal Rank Fusion (RRF) and you consistently beat either method on its own. Most modern vector databases — Weaviate, Qdrant, Pinecone — support this out of the box.

Cross-Encoder Reranking

After your initial retrieval pulls back a candidate set (typically 20-100 documents), a cross-encoder reranker scores each one against the original query. Unlike bi-encoders that encode queries and documents separately, cross-encoders process the pair together, which enables much richer matching. This two-stage pattern — fast approximate retrieval followed by precise reranking — is the industry standard for high-quality RAG, and for good reason.

python
from sentence_transformers import CrossEncoder
import numpy as np

def retrieve_and_rerank(
    query: str,
    vector_store,
    reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-12-v2",
    top_k_retrieval: int = 50,
    top_k_final: int = 5,
):
    """Two-stage retrieval: vector search followed by cross-encoder reranking."""
    # Stage 1: Fast approximate retrieval via vector similarity
    candidates = vector_store.similarity_search(query, k=top_k_retrieval)

    # Stage 2: Precise reranking with a cross-encoder
    reranker = CrossEncoder(reranker_model)
    pairs = [(query, doc.page_content) for doc in candidates]
    scores = reranker.predict(pairs)

    # Sort by reranker score and return top results
    ranked_indices = np.argsort(scores)[::-1][:top_k_final]
    return [candidates[i] for i in ranked_indices]

Two-stage retrieval: fast vector search, then precise reranking

Tip

Reranking adds latency — roughly 50-200ms for 50 candidates. Keep your candidate set at 50 docs max and use a lightweight reranker. If latency is critical, distilled models like TinyBERT variants give you a good speed-accuracy balance.

Keeping Hallucinations Under Control

Even with great retrieval, LLMs can still generate claims that aren't backed by the context you gave them. In production — especially in healthcare, finance, or legal — hallucinations aren't just annoying, they're dangerous. Here are the strategies that actually help.

  1. Force citations: Tell the model to ground every claim in a specific retrieved chunk and cite it inline (e.g., [Source 1]). This makes it easy to verify outputs against context and pushes the model toward more disciplined generation.
  2. Set confidence thresholds: If no retrieved document scores above a minimum relevance threshold, have the system say "I don't have enough information" instead of guessing. A graceful "I don't know" always beats a confident hallucination.
  3. Use self-consistency checks: Generate multiple answers and check for agreement. If three independent generations disagree, the answer is probably unreliable — flag it for human review.
  4. Run faithfulness evaluation: NLI (Natural Language Inference) models can classify each generated sentence as supported, contradicted, or neutral relative to the context. Contradictions are strong hallucination signals.
  5. Manage your context window: Don't stuff the context with marginally relevant chunks. A curated set of 3-5 highly relevant chunks almost always outperforms dumping 20 loosely related passages in there.

The most effective approach isn't any single technique — it's layering them together. Strong retrieval, precise reranking, constrained prompting, and post-generation verification working as a system.

Monitoring and Evaluation

You can't improve what you can't measure. A production RAG system needs continuous monitoring across both retrieval and generation. Without it, quality can silently degrade for weeks before anyone notices.

Retrieval Metrics

Track standard IR metrics — Recall@k, MRR, and NDCG — against a curated eval set of query-relevance pairs. If Recall@10 drops from 0.85 to 0.72 after a corpus update, that's a clear signal that something in your chunking or embedding pipeline broke. You want to catch these regressions fast.

Generation Quality

On the generation side, you care about faithfulness (is the answer grounded in context?), relevance (does it actually answer the question?), and completeness (does it cover all aspects?). Tools like RAGAS and DeepEval automate these evaluations using LLM-as-judge techniques, so you can run continuous regression tests without manual annotation.

  • Log every query, retrieved context, and generated answer for offline analysis.
  • Set up alerts on retrieval score distributions — sudden shifts usually mean corpus or embedding issues.
  • Track user feedback (thumbs up/down, query reformulations) as implicit quality signals.
  • Run weekly automated evals against a golden dataset to catch gradual drift.

Key Takeaways

Building a production RAG system is an engineering discipline, not a one-time deployment. The decisions that matter most are chunking strategy, embedding model selection, and implementing hybrid search with reranking. Get these right and everything downstream benefits.

  1. Start with recursive chunking at 512 tokens with overlap. It's a strong general-purpose default that works across most document types.
  2. Evaluate embedding models on your actual domain queries, not just public benchmarks. A few hours of testing upfront saves months of debugging.
  3. Use hybrid search (dense + sparse) with cross-encoder reranking. The two-stage pattern is worth the complexity — you'll see it in every serious RAG deployment.
  4. Layer your hallucination defenses: constrained prompting, confidence thresholds, and post-generation faithfulness checks all working together.
  5. Build evaluation infrastructure early. Automated retrieval and generation metrics plus a curated golden dataset are essential for maintaining quality over time.
  6. Monitor everything. Query logs, retrieval scores, generation quality, user feedback — they form a continuous improvement loop that keeps your system healthy as data and usage evolve.

RAG isn't a solved problem. The field is moving fast with agentic retrieval, multi-hop reasoning, and knowledge graph integration all pushing boundaries. But the fundamentals covered here give you a solid foundation for building systems that are reliable, observable, and ready for real traffic.