Quick Answer
A full RAG pipeline requires at least 8 GB RAM. With only 2 GB, you can run a tiny LLM like TinyLlama or Phi-2 (both need ~1.5 GB), but the embedding model adds another 0.5–1 GB, leaving almost no room for the vector store or context. Results will be limited.
Updated: 2026-05
Key Takeaways
A complete RAG pipeline has three memory consumers: the LLM (1.5–5 GB depending on model size), an embedding model (~0.5 GB for all-MiniLM), and a vector store like ChromaDB (0.1–0.5 GB depending on index size). At 2 GB total RAM, you can load only one of these components at a useful quality level.
TinyLlama at 1.1B parameters uses approximately 1.5 GB at Q4 quantization, and Phi-2 at 2.7B uses approximately 2.0 GB. Both models leave almost no memory for an embedding model — and without embeddings, you cannot perform semantic similarity search, which is the core of any RAG system.
Attempting RAG on 2 GB RAM results in either out-of-memory crashes or extreme performance degradation. The operating system itself consumes 0.3–0.6 GB before any ML workload starts.
| RAM Available | What Fits | RAG Quality |
|---|---|---|
| 2 GB | TinyLlama only, no embeddings | Poor |
| 8 GB | Llama 3 8B + embeddings + ChromaDB | Good |
| 16 GB | 13B LLM + full RAG stack | Excellent |
If you must use a low-memory device, the most effective workaround is to offload the embedding step to a remote API. Services like OpenAI's ada-002 generate embeddings via API call — you send text, receive a vector, and store it locally in a lightweight vector store. This eliminates the ~0.5 GB local embedding model cost.
With remote embeddings, a 2 GB device can run TinyLlama locally for generation while using cloud embeddings for retrieval. The quality remains limited by TinyLlama's reasoning capabilities, but the pipeline becomes technically functional. Note that remote embeddings incur API costs and require an internet connection.
For a full guide on setting up a local RAG system that actually works well, see the local RAG setup guide covering minimum hardware and model selection.