Skip to main content
PromptQuorumPromptQuorum

Can You Run RAG on 2 GB RAM?

Quick Answer

Yes — but only for small personal document sets using Llama 3.2 1B (~750 MB) with MiniLM-L6-v2 embeddings (~80 MB) and an in-memory vector store fitting ~1.3–1.5 GB total on a 2 GB device. Larger models (7B+) and larger document sets (200+ pages) need 8 GB minimum.

  • Llama 3.2 1B Q4_K_M (~750 MB) + MiniLM-L6-v2 embeddings (~80 MB) fits 2 GB
  • Document set must be under ~200 pages to stay within RAM
  • 7B+ models or larger corpora need at least 8 GB RAM

Updated: 2026-05

Quick AnswersBeginner

Yes — But Only Tiny Setups Work

At 2 GB RAM, the only viable RAG pipeline uses a 1B-class LLM (Llama 3.2 1B or Phi-3 Mini) with a lightweight embedding model (MiniLM-L6-v2 at ~80 MB) and a flat-file or in-memory vector store. As of May 2026, this works — but only for small personal document sets (under ~200 pages).

The table below shows the RAM footprint of each RAG component at minimum viable settings.

ComponentMemory UseNotes
LLM (Llama 3.2 1B Q4_K_M)~750 MBSmallest usable instruction-tuned model
Embedding model (MiniLM-L6-v2)~80 MBRuns on CPU; no GPU required
Vector store (Chroma in-memory)~150 MBScales with corpus size
Python runtime + framework overhead~300 MBLangChain or bare llama-index
Total minimum~1.3–1.5 GBLeaves ~500 MB for OS on a 2 GB device

What Breaks at 2 GB

The most common failure is the LLM exceeding available RAM during context window expansion. At 2 GB, a 1B model context is capped at roughly 2k tokens before the OS starts swapping. Loading a 7B or larger model fails immediately — Llama 3 8B Q4_K_M requires ~5 GB alone.

The second failure mode is vector store growth. A Chroma database for 500 PDF pages uses approximately 400–600 MB depending on chunk size. Combined with the LLM and embedding model, total RAM exceeds 2 GB. The fix: limit ingestion to under 150 pages, use 256-token chunks, and prune the store after each session.

Quick Answers About RAG on 2 GB RAM

What's the smallest LLM that works for RAG?
Llama 3.2 1B Q4_K_M (~750 MB) is the smallest instruction-tuned model that produces coherent answers for retrieval-augmented tasks. Phi-3 Mini (3.8B) is a better choice if you have 3–4 GB available — its 4k context handles longer retrieved passages. Below 1B parameters, output quality degrades sharply for RAG-style question answering.
Can I use Ollama on 2 GB RAM?
Ollama's minimum recommended RAM is 8 GB. On 2 GB, Ollama itself loads but model serving fails or swaps heavily. For 2 GB devices, use llama.cpp directly via CLI or the llama-cpp-python bindings — these have a smaller resident memory footprint than the Ollama server process.
Will Raspberry Pi 5 (8 GB) run proper RAG?
Yes. A Raspberry Pi 5 with 8 GB RAM runs Llama 3 8B Q4_K_M (~5 GB) alongside a full embedding + vector store stack with room to spare. Speed is ~1–2 tok/s on the Pi 5 CPU — slow but functional for offline personal search use cases. See the best Ollama models for CPU-only inference for speed benchmarks.
Is local RAG worth doing on 2 GB RAM?
For small personal document sets (notes, a few PDFs), yes — the 1B + MiniLM pipeline is genuinely useful. For anything requiring precise retrieval over large corpora or complex multi-hop reasoning, 2 GB RAM is a hard constraint. Upgrade to at least 8 GB before expecting production-grade RAG quality.