Key Takeaways
- RAG = upload documents + retrieval + local LLM answering. No training required.
- Five steps: (1) Load documents, (2) chunk into 500-1000 token pieces, (3) generate embeddings, (4) store in vector DB, (5) retrieve on query.
- Best embedding model: nomic-embed-text (137M, runs locally, 768-dim vectors).
- Best vector DB: Chroma (simple, embedded) for <1M documents; Qdrant (distributed) for production.
- As of April 2026, local RAG is faster and cheaper than cloud APIs. Quality depends on retrieval accuracy and prompt engineering.
How Does RAG Work Step-by-Step?
- 1Document ingestion: Load PDFs, text files, or web pages.
- 2Chunking: Split documents into 500-1000 token chunks (overlap 20% to prevent context breaks).
- 3Embedding: Convert each chunk to a vector (768-1536 dimensions) using a local embedding model.
- 4Storage: Store vectors in a vector database (Chroma, Qdrant, Milvus) with metadata (document name, page, timestamp).
- 5Query time: Convert user question to embedding, search vector DB for top K similar chunks (k=5-10).
- 6Context assembly: Combine retrieved chunks into a prompt with instructions for the local LLM.
- 7Generation: Local LLM generates answer based on retrieved context.
- 8Attribution: Return which documents the answer came from.
What Is the Optimal Chunking Strategy?
Chunking strategy determines retrieval quality. Bad chunking = relevant information split across chunks, retrieval fails.
Semantic chunking (recommended): Split by sentences or paragraphs, preserving meaning. Example: each paragraph = 1 chunk.
Fixed-size chunking: 500 tokens per chunk, 20% overlap. Simple but may split sentences.
Recursive chunking: Split by paragraphs first, then by sentences if too large. Preserves hierarchy.
As of April 2026, semantic chunking with 500-1000 token chunks and 20% overlap is optimal for most use cases.
# Python: semantic chunking example
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200, # 20% overlap
separators=["\n\n", "\n", ".", " "] # Split on paragraph, then sentence
)
chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")Which Vector Database Should You Use?
| Database | Type | Capacity | Setup Effort | Best For |
|---|---|---|---|---|
| Chroma | Embedded | <1M docs | β | Prototyping, small RAG |
| Qdrant | Distributed | Unlimited | β | Production, scalable |
| Milvus | Distributed | Unlimited | β | Enterprise, massive scale |
| Weaviate | Graph + Vector | Unlimited | β | Complex queries, relationships |
| Pinecone (cloud) | Managed | Unlimited | β | Serverless, hands-off |
What Embedding Model Should You Choose?
| Model | Vector Dimensions | Speed | Quality | Recommendation |
|---|---|---|---|---|
| nomic-embed-text (local) | β | Fast | Excellent | Best for local RAG |
| bge-m3 (local) | β | Fast | Excellent | Multilingual support |
| OpenAI text-embedding-3 (cloud) | β | Very fast | Best in class | Hybrid approach |
| Cohere (cloud) | β | Fast | Excellent | Production cloud RAG |
How Do You Optimize Retrieval Quality?
Retrieval quality determines RAG success. Good retrieval = good answers. Bad retrieval = hallucinations.
- Top K selection: Retrieve k=5-10 chunks. Higher k = more context (slower), lower k = fewer distractions.
- Similarity threshold: Filter results by minimum similarity score (e.g., >0.75). Avoids low-relevance chunks.
- Reranking: Use a reranker (cross-encoder) to re-rank retrieved chunks by relevance. Small accuracy boost.
- Hybrid search: Combine semantic search (embeddings) with BM25 keyword search. Catches documents with exact keywords.
- Query expansion: Expand user query with synonyms or related terms. Improves recall.
How Do You Evaluate RAG Quality?
RAG quality has two dimensions: (1) retrieval quality (did we get relevant chunks?), and (2) generation quality (did the LLM answer well?).
Retrieval evaluation: Create test queries with known correct documents. Measure precision (how many retrieved are relevant?) and recall (did we get all relevant documents?).
Generation evaluation: Run LLM on retrieved chunks, manually score answers (0-5 scale) for accuracy and completeness.
As of April 2026, automated evaluation tools (like Ragas) can measure retrieval and generation metrics automatically.
Production RAG Patterns
For production services, use these patterns:
- Caching: Cache embeddings of frequently-queried documents to avoid recomputing.
- Incremental indexing: Add new documents without reindexing everything. Qdrant and Milvus support this.
- Monitoring: Track retrieval latency, cache hit rate, and user feedback on answer quality.
- Fallback: If retrieval fails (no relevant chunks), respond with "I don't have information about that" instead of hallucinating.
- Versioning: Keep document versions for audit trails. Store which version was used for each answer.
Common Mistakes in Local RAG Implementation
- Chunking documents wrong. Too many small chunks = retrieval noise. Too few large chunks = split information. Test chunk sizes empirically.
- Not evaluating retrieval. Building RAG without testing if retrieval works is like building a car without testing the engine. Always measure precision/recall.
- Using generic embeddings for domain-specific documents. Legal, medical, or technical documents may need fine-tuned embeddings. Consider domain-specific models.
- Forgetting about update frequency. If documents change weekly, your vector DB gets stale. Build a pipeline to re-embed and update.
- Expecting RAG to replace fine-tuning. RAG is context injection. Fine-tuning is model adaptation. For best results, combine both.
Common Questions About Local RAG
How many documents can local RAG handle?
Chroma handles 100K-1M documents on consumer hardware. Qdrant scales to billions with distributed setup. Beyond 1M, use Qdrant or Milvus.
What latency should I expect?
Embedding query (nomic-embed-text on CPU): 50-200ms. Retrieval (Chroma on disk): 10-50ms. LLM generation: 2-10 seconds (depends on model size). Total: 2-10 seconds per query.
Can RAG handle real-time document updates?
Yes. Add new documents to the vector DB dynamically. Indexing latency is 100-500ms per document, so real-time updates are practical.
Is local RAG cheaper than cloud APIs?
Yes. No per-token cost, no API calls to external services. One-time setup of embeddings, then free queries.
Can I use cloud embeddings with local LLMs?
Yes. Use OpenAI, Cohere, or other cloud embeddings for indexing, then use local LLMs for generation. Hybrid approach.
Sources
- LlamaIndex Documentation -- docs.llamaindex.ai
- LangChain RAG Guide -- python.langchain.com/docs/use_cases/question_answering
- Chroma Documentation -- docs.trychroma.com
- Qdrant Vector Search Engine -- qdrant.tech
- RAG Paper (Lewis et al.) -- arxiv.org/abs/2005.11401