Question 1

How much VRAM do you need for a local LLM?

Accepted Answer

4 GB VRAM handles Phi-4 Mini and Gemma 2B comfortably with safe headroom for context expansion. 6 GB runs Llama 3 8B at Q4. 12 GB fits Qwen 14B Q4 efficiently. 16+ GB is needed for 70B models at Q4.

Question 2

What is Q4_K_M quantization?

Accepted Answer

Q4_K_M means 4-bit quantization using k-quant (K) compression at medium (M) quality. It is the best default for most models: better quality than Q4_0, smaller than Q8_0.

Question 3

Q4_K_M vs Q8_0: which should you pick?

Accepted Answer

Use Q4_K_M if you have 8 GB VRAM or less. Use Q8_0 if you have 12+ GB. Q4_K_M delivers 95% of Q8_0 quality at roughly half the file size.

Question 4

Best Ollama models for RTX 3060 12 GB?

Accepted Answer

The best Ollama models for an RTX 3060 12 GB are **Qwen3 7B** (general tasks, 7 GB VRAM), **Phi-4** in Q4_K_M (reasoning, ~9 GB VRAM), and **Mistral Nemo 12B** (8 GB VRAM). All run at 30–50 tokens/second on this GPU.

Question 5

Best Ollama models for 4 GB VRAM?

Accepted Answer

4 GB VRAM is tight but usable with small models like Phi-4 Mini Q4 at ~3.2 GB, Gemma 2 2B at ~1.5 GB, and SmolLM 1.7B at ~1.0 GB for flexible allocation. Llama 3 8B will not fit.

Question 6

How much RAM does a 7B model need?

Accepted Answer

A 7B model at Q4 quantization needs 5–6 GB of VRAM or RAM for efficient inference performance. Rule of thumb: model parameters in billions × 0.7 = approximate GB needed at Q4. GPU delivers ~25 tok/s; CPU delivers ~5 tok/s on same memory.

Question 7

How much VRAM for a 70B model?

Accepted Answer

A 70B model at Q4_K_M needs approximately 40 GB of VRAM. Consumer options: dual RTX 3090 (48 GB total), M5 Max with 128 GB unified memory, or cloud GPU rental.

Question 8

Best local LLM for 6 GB VRAM?

Accepted Answer

With 6 GB VRAM, Llama 3 8B Q4_K_M is the top pick at ~5.5 GB with excellent chat and coding capabilities at ~20 tok/s. Phi-4 Q4_K_M and Mistral Small Q4_K_S are solid alternatives.

Question 9

What is the latest Ollama version?

Accepted Answer

Check ollama.com or the Ollama GitHub releases page for the current version. On Linux, run the install script to update. On Mac/Windows, download the latest installer.

Question 10

Best Ollama models right now?

Accepted Answer

As of May 2026, the top general Ollama model is Llama 3 8B Q4_K_M, fitting in 6 GB VRAM at ~20 tok/s with excellent instruction following. For coding, Qwen 3 Coder 14B leads. For compact use, Phi-4 Mini is excellent. This page updates monthly.

Question 11

Best Ollama models for CPU only?

Accepted Answer

Without a GPU, Phi-4 Mini at Q4 is the best balance of quality and speed on CPU, delivering reasoning quality close to Llama 3 8B while needing only 4 GB RAM. Llama 3 8B Q4 works with 8+ GB RAM. Gemma 2B is the fastest CPU option.

Question 12

Can you run Qwen 3 on Ollama?

Accepted Answer

Yes — Ollama supports all Qwen 3 model sizes from 0.6B to 72B with native tool calling via the standard API, needing only a single command like ollama run qwen3:8b. The 8B model needs ~6 GB VRAM at Q4.

Question 13

Which Ollama models support vision?

Accepted Answer

Ollama supports several vision models: LLaVA, Gemma 3 multimodal, and Qwen-VL. Run ollama run llava for the easiest start. All accept images via the Ollama API.

Question 14

Which Ollama models support 128K context?

Accepted Answer

Llama 3.3 8B supports 128K context on Ollama. Qwen 3 14B reaches 1M tokens. Note: running full context dramatically increases VRAM — a 128K window needs 3–4× more VRAM than the default 4K window.

Question 15

Qwen Coder vs DeepSeek Coder: which is better?

Accepted Answer

Qwen 3 Coder wins for Python and TypeScript. DeepSeek Coder V2 has broader language support. Both require ~10 GB VRAM at 14B Q4. For most developers, Qwen 3 Coder is the better default.

Question 16

Ollama vs LM Studio: which should you pick?

Accepted Answer

If you use a terminal and build with APIs, choose Ollama. If you prefer a GUI and just want to chat with models, use LM Studio. Both are free and run models locally.

Question 17

Jan vs LM Studio: which is better?

Accepted Answer

Jan is fully open source with an extension system. LM Studio has a more polished UI and a larger built-in model library. For power users who want customization, choose Jan. For ease of use, choose LM Studio.

Question 18

What is the best local LLM app for Android in 2026?

Accepted Answer

For most people, MLC Chat is the best local LLM app for Android in 2026 — it installs from Google Play in under a minute, uses preoptimized models, and runs fully offline without any technical setup. Pocketpal is the upgrade for users who want to load custom GGUF models; Termux + Ollama is for developers who want the full Ollama CLI on their phone.

Question 19

Best frontend for Ollama?

Accepted Answer

Open WebUI is the best Ollama frontend for most users, offering free access to a feature-rich interface with Docker deployment and RAG support. It is free, feature-rich, and runs as a Docker container. SillyTavern is better for roleplay. Jan adds a local model manager.

Question 20

Qwen 14B vs Llama 3 8B: which runs better locally?

Accepted Answer

Llama 3 8B fits in 6 GB VRAM and runs faster. Qwen 3 14B needs 10+ GB but scores higher on benchmarks. If you have 12 GB VRAM, Qwen 14B wins on quality.

Question 21

Best 14B model for coding?

Accepted Answer

Qwen 3 Coder 14B is the top 14B coding model for local use, scoring 78.4% on HumanEval and running in 10 GB VRAM at Q4_K_M quantization. It fits in 10 GB VRAM at Q4_K_M and scores highest on HumanEval among 14B models. DeepSeek Coder 14B is a strong alternative with similar VRAM requirements.

Question 22

Best mini PC for local LLM?

Accepted Answer

Three mini PCs stand out for local LLM inference: Mac Mini M4 delivers ~18 tok/s with unified memory and zero VRAM bottleneck, Minisforum UM790 Pro scales to 64 GB DDR5 for larger models, and Beelink SER8 offers value at ~8 tok/s with Ryzen 9 8845HS. All three run 7–13B Q4 models without a discrete GPU.

Question 23

Best MoE models for local coding?

Accepted Answer

Mixtral 8x22B and DeepSeek V2 are the top MoE coding models for local use, activating only a fraction of total parameters per token to deliver better quality per VRAM than dense models. Both require at least 16 GB VRAM at Q4, with Mixtral at ~26 GB and DeepSeek V2 at ~16 GB.

Question 24

Best local LLM for coding with 12 GB VRAM?

Accepted Answer

Qwen 3 Coder 14B Q4_K_M is the best coding model for 12 GB VRAM GPUs, achieving the highest HumanEval scores among 14B models while using ~10 GB VRAM on RTX 3060 and RTX 3080 Ti. It uses ~10 GB VRAM and scores highest on HumanEval among models that fit this constraint. DeepSeek Coder 14B is a strong alternative.

Question 25

Best LLM for AMD 5700X + RTX 3070 Ti?

Accepted Answer

With an RTX 3070 Ti (8 GB VRAM), Llama 3 8B Q4_K_M and Mistral Small Q5_K_M are the best local LLMs, both using ~6 GB VRAM and running at ~22-25 tok/s for fast inference. The AMD Ryzen 7 5700X handles fast tokenization as a CPU fallback.

Question 26

Can you run local LLMs on a Radeon RX 6800M?

Accepted Answer

Yes. The Radeon RX 6800M has 12 GB GDDR6 VRAM and can run local LLMs. On Linux, use ROCm for GPU acceleration. On Windows, use llama.cpp with Vulkan or CPU fallback. Llama 3 8B Q4_K_M runs at ~12 tok/s on Linux with ROCm.

Question 27

Can you run RAG on 2 GB RAM?

Accepted Answer

Yes — but only for small personal document sets using Llama 3.2 1B (~750 MB) with MiniLM-L6-v2 embeddings (~80 MB) and an in-memory vector store fitting ~1.3–1.5 GB total on a 2 GB device. Larger models (7B+) and larger document sets (200+ pages) need 8 GB minimum.

Question 28

Best local LLM for a 16 GB RAM laptop?

Accepted Answer

For a 16 GB RAM laptop without a dedicated GPU, Qwen3 8B (Q4_K_M) is the best all-rounder — it uses ~6 GB and runs ~8–15 tok/s on a modern CPU. Gemma 3 12B is the strongest model that still fits (tighter and slower); Phi-4-mini (~3.5 GB) is best for weaker machines; Llama 3.1 8B is a balanced alternative, and Qwen3-Coder is the pick for coding. Apple Silicon laptops (M-series) run these 3–4× faster via unified memory. With 32 GB RAM you can step up to 14B models.

Question 29

What is the CO-STAR prompt framework?

Accepted Answer

CO-STAR is a six-part prompt structure for consistent LLM output: Context (background), Objective (task), Style (writing style), Tone (emotional register), Audience (who reads it), Response (output format). It helps produce targeted outputs by making every constraint explicit and reduces ambiguity in instructions.

Question 30

Best LLM right now?

Accepted Answer

For cloud coding tasks, Claude Opus 4.8 achieves 87.6% on SWE-Bench, while GPT-5.5 Instant leads general chat with 52.5% fewer hallucinations than prior versions. For cloud use: Claude Opus 4.8 leads on coding and long documents, GPT-5.5 Instant on general chat, Gemini 2.5 Pro on multimodal tasks. For local use: Llama 4 Scout if you have 24 GB VRAM; Qwen 3 14B for 12 GB VRAM.

Question 31

Is Qwen GDPR compliant?

Accepted Answer

Qwen run locally on your own hardware is GDPR-compliant because no prompt data leaves your infrastructure and no Article 44 third-country transfer occurs. The Qwen API via Alibaba Cloud is a different story — it requires Standard Contractual Clauses and a Transfer Impact Assessment like any non-EU cloud provider.

Question 32

Is DeepSeek GDPR safe to use?

Accepted Answer

DeepSeek API poses the highest GDPR risk of any major LLM because servers are subject to Chinese data-access law (PIPL), there is no EU adequacy decision for China, and the Terms of Service explicitly reserve the right to share data with Chinese authorities. DeepSeek local open-weight models carry a different, lower risk profile.

Question 33

Can a local LLM help with GDPR compliance?

Accepted Answer

Yes — running an open-weight model locally eliminates the Article 44 third-country data transfer that makes cloud AI legally complex under GDPR, meaning your prompts and responses never leave your server. Local models like Qwen 3 14B or Llama 4 Scout can handle HR, legal, and medical text entirely on-premises.

Question 34

What is the best GPU under $300 for running local LLMs?

Accepted Answer

Used RTX 3060 12GB at ~$200-250 is the best GPU under $300 for local LLMs — 12 GB VRAM runs all 7B and most 14B models.

Question 35

What is the best GPU under $600 for local LLMs?

Accepted Answer

RTX 4060 Ti 16GB at ~$424 is the sweet spot — 16 GB VRAM handles 14B models at Q5 quantization with room to spare.

Question 36

What SSD gives the fastest local LLM model loading?

Accepted Answer

Samsung 990 Pro 2TB at 7,450 MB/s loads a 7B Q4 model in under 2 seconds. For those with a PCIe 5.0 motherboard slot, the Samsung 9100 Pro (~$350) now matches the 990 Pro on price while doubling the read speed.

Question 37

Is the Mac Mini M4 good for running local LLMs?

Accepted Answer

Yes — Mac Mini M4 Pro with 24 GB unified memory runs Llama 3 8B at ~36 tok/s via MLX. Best value Apple option at $1,599.

Question 38

RunPod vs Vast.ai — which is cheaper for cloud GPU rental?

Accepted Answer

Vast.ai is cheaper for spot instances (RTX 4090 at ~$0.30-0.55/hr vs RunPod ~$0.69/hr). RunPod is more reliable with guaranteed uptime.

Question 39

How much does a cloud GPU cost per hour in 2026?

Accepted Answer

RTX 4090: $0.15-0.44/hr. A100 80GB: $1.10-2.00/hr. H100: $2.89-4.00/hr. Cheapest for inference: Vast.ai spot.

Question 40

Which VPN should I use for downloading large AI models?

Accepted Answer

ProtonVPN (Swiss, free tier) for audited privacy. Mullvad (€5/mo flat) for maximum anonymity. NordVPN for 9,300+ RAM-only servers across 110+ countries. Surfshark (~$2/month) for the lowest price. ExpressVPN for fastest download speeds on large model files.

Question 41

MLX vs Ollama vs llama.cpp: which inference engine should you use?

Accepted Answer

On Apple Silicon, use MLX — it runs ~65 tok/s versus ~35 tok/s for Ollama on an M5 Pro with an 8B model. On NVIDIA GPUs, use Ollama for simplicity or llama.cpp for maximum control. Ollama uses llama.cpp under the hood and adds an API layer on top.

Question 42

How do you convert an Ollama model to MLX format?

Accepted Answer

You cannot directly convert Ollama models to MLX. Instead, download the original GGUF or SafeTensors weights from Hugging Face, then convert with mlx-lm convert. For most popular models (Llama 3, Qwen, Mistral), pre-converted MLX versions already exist on Hugging Face under the mlx-community organization.

Question 43

Does Ollama support MLX on Apple Silicon?

Accepted Answer

No. Ollama uses llama.cpp with Metal GPU acceleration on Apple Silicon — not MLX. Metal acceleration is fast but not as optimized as native MLX. For MLX-speed inference, use mlx-lm directly or LM Studio, which supports both MLX and llama.cpp backends.

Question 44

What quantization level is best for 6 GB VRAM?

Accepted Answer

Q4_K_M is the sweet spot — 7B/8B models at Q4_K_M use 4.7–4.9 GB, leaving 1.1 GB for the KV-cache. Q5_K_M fits but requires limiting context to 2k tokens. Avoid Q6_K and above on 6 GB cards.

Question 45

Mistral Small 24B vs Qwen 3 14B vs Llama 3.3 8B: which should I run locally?

Accepted Answer

Pick by VRAM: Llama 3.3 8B (4.9 GB), Qwen 3 14B (9.3 GB), Mistral Small 3.1 24B (14.4 GB). Qwen 14B wins at 12 GB VRAM. Mistral Small 24B wins above 16 GB on reasoning tasks.

Question 46

Does Strix Halo (Ryzen AI Max) work with Ollama via Vulkan?

Accepted Answer

Yes — Ryzen AI Max (Strix Halo, RDNA 3.5) runs Ollama via Vulkan on Linux. With 96 GB unified memory on the MAX 395, it fits Qwen 32B and even Llama 70B Q4_K_M — models no single desktop GPU can hold.

Question 47

Best Qwen model for coding?

Accepted Answer

Qwen3-Coder 32B is the best Qwen coding model if you have 24 GB VRAM (91.5% HumanEval). At 8 GB VRAM, the 7B version scores 79.7% and runs at 8–15 tok/s. The 14B is the sweet spot for most developers at 12 GB VRAM.

Question 48

Can you run DeepSeek V3 locally?

Accepted Answer

DeepSeek V3 is a 671B MoE model. Running it locally at Q4_K_M requires approximately 400 GB of RAM — well beyond any consumer hardware. The practical alternative is DeepSeek-R1-Distill-Qwen-32B (20.5 GB VRAM, consumer-viable) which delivers strong reasoning at a fraction of the size.

Question 49

Is it better to prompt local LLMs in Chinese or English?

Accepted Answer

It depends on the model and task. For Qwen3 and DeepSeek-R1-Distill models, Chinese prompts use 30–50% fewer tokens (CJK tokenisation is denser) and produce more natural Chinese output. English prompts produce stronger step-by-step reasoning chains on most models. The best practice: write instructions in English, let the model respond in Chinese.

Question 50

Best model for Chinese roleplay in SillyTavern?

Accepted Answer

Qwen3-72B Q4_K_M is the best local model for Chinese roleplay — native Chinese training, rich vocabulary, and 128K context. Yi-34B excels at emotional character depth. For users with 8 GB VRAM, Qwen3-7B runs well at 8–12 tok/s.

Question 51

Which VPN works best for AI development tools from China in 2026?

Accepted Answer

NordVPN (obfuscated servers, works reliably for HuggingFace and GitHub) and ExpressVPN (Lightway protocol, fastest for model file downloads) are the two most reliable options. Surfshark works as a budget alternative. Mullvad often fails GFW bypass. Free VPNs are blocked.

Question 52

What are the best local LLM apps for Android in Japan?

Accepted Answer

MLC Chat, PocketPal AI, and Ollama via Termux are the best options for Android users in Japan. Japanese models like Rinna 3.6B and ELYZA-7B run fully locally and support the Japanese Play Store.

Question 53

Which local LLM models support Japanese best?

Accepted Answer

The best Japanese local LLM depends on your task. For conversation: Rinna 3.6B (runs on 4 GB RAM). For instruction following: ELYZA-7B. For coding with Japanese: Qwen3-Coder. All run via Ollama.

Question 54

Can you run a local LLM on an Xperia phone?

Accepted Answer

Yes — the Xperia 1 VI (12 GB RAM, Snapdragon 8 Gen 3) runs Rinna 3.6B and Phi-4 Q4 via MLC Chat. The Xperia 5 V (8 GB) handles lightweight models. The Xperia 10 VI (6 GB) is limited to TinyLlama and Gemma 2B.

Question 55

What is the best mini PC for local LLMs available in Japan?

Accepted Answer

The best mini PC for local LLMs in Japan is the Beelink SER7 (Ryzen 7 7840HS, 32 GB DDR5 RAM) at ~¥70,000 on Amazon.co.jp. Ollama runs out of the box; the AMD Radeon 780M iGPU supports Vulkan acceleration.

Question 56

What is the best value GPU for local LLMs in Japan?

Accepted Answer

The RTX 3060 12GB at ~¥40,000 new (¥25,000 used) is the best value GPU for local LLMs in Japan. 12 GB VRAM runs every 7B model at 20–25 tok/s with zero CUDA setup friction.

Question 57

What are the current AI model knowledge cutoff dates?

Accepted Answer

Verified cutoffs: GPT-5.5 August 2025 (ChatGPT searches Bing by default; GPT-4o legacy Oct 2023); Claude Opus 4.8 January 2026 (reliable cutoff); Grok 4.3 November 2024 (searches X); Gemini 3.1 Pro January 2025 (native Google Search); DeepSeek-V3 July 2024; Gemma 3 27B August 2024; Phi-4 June 2024; Qwen2.5 December 2023. Several major models — including Mistral Large, Llama 4, and Qwen3 — have not publicly disclosed exact cutoff dates. Local LLMs have no web search and their cutoff is absolute.

Question 58

How much VRAM does each DeepSeek-R1 distill need?

Accepted Answer

At Q4_K_M (Ollama default): 1.5B ≈ 4 GB, 7B ≈ 5.5 GB, 8B ≈ 6 GB, 14B ≈ 9.5 GB, 32B ≈ 20.5 GB, 70B ≈ 42 GB. Q8_0 is about 2× the Q4_K_M size and FP16 about 4×, so the 32B at FP16 needs a 64 GB-class setup.

Question 59

Which DeepSeek-R1 distill should I run on my GPU?

Accepted Answer

Find your card: RTX 3060 12GB → 7B, RTX 4060 Ti 16GB → 14B, RTX 4070/4080 → 14B or 32B, RTX 4090 → 32B, dual-GPU/48 GB → 70B. For the best small model on 8 GB, run DeepSeek-R1-0528-Qwen3-8B. Each runs with one Ollama command at Q4_K_M.

Quick Answers to Local LLM Questions

AQuantization & VRAM

BOllama

CTool Comparisons

DModel Comparisons

EHardware-Specific

FQuick Answers

GPrompt Engineering

HPrivacy & Compliance

VRAM	Best Model (May 2026)	Quantization	Use Case
4 GB	Phi-4 Mini	Q4	Basic chat, small tasks
6 GB	Llama 3 8B	Q4_K_M	Daily chat and coding
8 GB	Mistral 7B	Q5_K_M	Quality + speed balance
12 GB	Qwen 14B	Q4_K_M	Coding and reasoning
16 GB	Qwen 32B	Q4_K_M	Complex multi-step tasks
24 GB	Llama 70B	Q4_K_M (partial)	Near-production quality
48+ GB	Llama 70B	Q5_K_M or higher	Full precision models