Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/Ollama June 2026 Update: v0.30.8 + Top 10 Open-Source Models
Best Models

Ollama June 2026 Update: v0.30.8 + Top 10 Open-Source Models

·9 min read·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

The current Ollama version is v0.30.8 (June 12, 2026). The newest models added this month are MiniMax M3 (open-weight, 1M-token context + native vision), NVIDIA Nemotron 3 Ultra, and DeepSeek V4 Pro. Best overall on consumer hardware is Qwen 3.6 27B (77.2% SWE-bench, fits 24 GB at Q4). Other top picks: Kimi K2.6 (frontier coding), gpt-oss:20b (best small / 16 GB), qwen3:30b (balanced all-round), DeepSeek-R1 (reasoning), Gemma 4 (vision/tool calling), and Llama 4 Scout (long-context 10M / multimodal). Most downloaded overall remains the Llama family.

The current Ollama version is v0.30.8 (June 12, 2026), adding broader GGUF hardware support and an upgraded Apple Silicon MLX engine. The newest models this month are MiniMax M3 (1M-token context + native vision), NVIDIA Nemotron 3 Ultra, and DeepSeek V4 Pro. Best overall on consumer hardware is Qwen 3.6 27B (77.2% SWE-bench, fits 24 GB at Q4).

Slide Deck: Ollama June 2026 Update: v0.30.8 + Top 10 Open-Source Models

The slide deck below covers: top 10 Ollama models by download count, performance comparison (60-74% HumanEval), best models by use case (chat, coding, reasoning, vision), DeepSeek-R1 chain-of-thought reasoning, and exact pull commands. Download the PDF as your Ollama model selection reference card.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Best overall on consumer hardware: Qwen 3.6 27B (77.2% SWE-bench, fits 24 GB at Q4). Balanced all-round: qwen3:30b.
  • Most downloaded: Llama 3.2 3B (tutorials) and the Llama family -- widest tool support.
  • Best reasoning: DeepSeek-R1 (chain-of-thought) and gpt-oss:20b (adjustable reasoning, ~o3-mini level).
  • Best coding: Kimi K2.6 (frontier MoE), Qwen 3.6 27B (best dense), Devstral Small 24B (best agentic), qwen3-coder:30b (completion) -- highest benchmarks at their sizes.
  • Best small / 16 GB: gpt-oss:20b. Best vision/multimodal: Gemma 4 (E4B+). Best long-context (10M) / large multimodal: Llama 4 Scout (~55 GB).
  • As of June 2026, the Ollama library contains 4,500+ models. All are available via `ollama pull <name>`.

📍 In One Sentence

The best Ollama model in June 2026 is Qwen 3.6 27B (77.2% SWE-bench, fits 24 GB at Q4); best for coding is Kimi K2.6.

💬 In Plain Terms

Ollama is a free tool that lets you run AI models on your own computer — no internet or API key needed. These are the top 10 models you can download with a single command.

What Is New in Ollama in June 2026?

Current Ollama version: v0.30.8 (released June 12, 2026). This is the latest stable release, available via ollama.com/download. Update with `curl https://ollama.ai/install.sh | sh` (macOS: `brew upgrade ollama`), then confirm with `ollama --version`.

What changed in the v0.30 series (May–June 2026): Ollama v0.30 broadened GGUF model compatibility through llama.cpp, extending hardware support beyond Apple Silicon, and the MLX engine was upgraded on June 11, 2026 for its fastest Apple Silicon inference yet — higher-quality output using less memory. Point releases through v0.30.8 added Gemma 4 QAT weights (June 5), Hermes Desktop (June 7), improved prompt/KV-cache reuse, and Windows config-path fixes. Full notes: github.com/ollama/ollama/releases.

Newest models added this month (June 2026):

  • MiniMax M3 (MiniMax, June 1, 2026) — Newest open-weight flagship: the first model to combine frontier coding (SWE-Bench Pro 59.0), a 1M-token context window, and native image/video input. Rolling out to the Ollama library — confirm availability with `ollama pull minimax-m3`.
  • NVIDIA Nemotron 3 Ultra (NVIDIA, June 4, 2026) — Built for high-throughput reasoning and long-running agent workflows. NVIDIA Open Model License. Pull: `ollama pull nemotron3-ultra`
  • DeepSeek V4 Pro (DeepSeek, April 23, 2026) — Algorithmic-coding specialist, 93.5% LiveCodeBench, MIT license. Budget sibling DeepSeek V4 Flash for lighter hardware. Pull: `ollama pull deepseek-v4-pro`
  • Kimi K2.6 (Moonshot AI, April 20, 2026) — Frontier coding model, SWE-Bench Pro 58.6, SWE-bench Verified 80.2%. MoE architecture (32B active / 1T total). Modified MIT license. Pull: `ollama pull kimi-k2.6`
  • Qwen 3.6 27B (Alibaba, April 16, 2026) — Best overall on consumer hardware, 77.2% SWE-bench, Apache 2.0, fits 24 GB at Q4. Also Qwen3.6-35B-A3B (MoE, 73.4 SWE-bench). Pull: `ollama pull qwen3.6:27b`
  • GLM-5.1 (Z.ai, April 7, 2026) — 744B / 40B active MoE, MIT license, SWE-Bench Pro 58.4. Structured code generation leader. Pull: `ollama pull glm-5.1`
  • gpt-oss (OpenAI, 2026) — Open-weight MoE: gpt-oss:20b (21B total / 3.6B active, runs in 16 GB, ~o3-mini level, adjustable reasoning) and gpt-oss:120b (80 GB). Pull: `ollama pull gpt-oss:20b`
  • Gemma 4 (Google, April 2, 2026) — Multimodal sizes E2B / E4B / E12B (26B MoE) / E27B (31B dense), all with vision and tool calling. QAT weights added June 5, 2026. E4B runs in ~6 GB VRAM. Pull: `ollama pull gemma4:e4b`
bash
# Update Ollama to the latest version (v0.30.8)
curl https://ollama.ai/install.sh | sh

# Or on Mac: brew upgrade ollama

# Check your current version
ollama --version  # outputs: ollama version 0.30.8

# Pull the newest June 2026 models
ollama pull minimax-m3
ollama pull deepseek-v4-pro
ollama pull kimi-k2.6

Which Ollama Models Work Best for Your Use Case?

The quality of a model's output depends heavily on how you prompt it. For structured techniques that work across all local models — including chain-of-thought, few-shot examples, and output formatting — see the prompt engineering guide. For reasoning tasks, chain-of-thought prompting significantly improves DeepSeek-R1 and Qwen3 output quality. To understand quantization tradeoffs for these models, see the quantization guide →. For determining how much VRAM each model needs, see the VRAM requirements guide →. For agent workflows with Gemma 4, see Tree-of-Thought and ReAct. For hardware requirements to run these models, see the hardware guide →. Once a tool-calling model from this list is wired into a multi-step loop with file and database access, see Local AI Agents With MCP for the open-source orchestration pattern.

📍 In One Sentence

For general chat use Qwen 3.6 27B, for coding use Kimi K2.6 or Qwen3-Coder, for reasoning use DeepSeek-R1, for vision use Gemma 4 E4B.

💬 In Plain Terms

Different AI models excel at different tasks — like how a calculator beats a word processor at math. This section matches the right model to each job.

  • General chat (beginner): `ollama run llama3.2:3b` -- most documentation, best-supported first model.
  • General chat (best overall): `ollama run qwen3.6:27b` -- 77.2% SWE-bench, best overall on consumer hardware, fits 24 GB at Q4. Balanced all-round: `ollama run qwen3:30b`. For 8 GB machines, keep `ollama run llama3.2:3b`.
  • Long-context / multimodal: `ollama run llama4:scout` -- 10M-token context + multimodal, MoE (17B active/109B total). Needs ~55 GB VRAM at Q4 (fits 24 GB only at 1.78-bit ~20 tok/s).
  • Best small / 16 GB: `ollama run gpt-oss:20b` -- 21B total / 3.6B active MoE, ~o3-mini level, adjustable reasoning. Larger: `ollama run gpt-oss:120b` (80 GB).
  • Coding on 8 GB: `ollama run qwen3:8b` -- Best local coding model for 8 GB VRAM machines. 76% HumanEval, 5 GB used, multilingual.
  • General inference on 8 GB (if not coding): `ollama run mistral:7b` -- Fastest general-purpose model at 8 GB, 40-60 tok/sec.
  • Coding (best agentic, 24B): `ollama run devstral-small:24b` -- Best agentic coding model (multi-file edits, debugging). 16 GB RAM. By Mistral AI.
  • Coding (best dense, 27B): `ollama run qwen3.6:27b` -- 77.2% SWE-bench. Best dense coding model. 22 GB VRAM.
  • Coding (frontier MoE): `ollama run kimi-k2.6` -- SWE-Bench Pro 58.6 (ties GPT-5.5), top tier. MoE (32B active/1T total). Modified MIT license. Needs quantization for consumer hardware.
  • Agent tasks and tool calling: `ollama run gemma4:e4b` -- Released April 2, 2026. Built-in tool calling + vision support. Recommended for local agents, function calling, and structured output. 6 GB RAM.
  • Reasoning and math: `ollama run deepseek-r1:7b` -- chain-of-thought model, best local math performance at 7B.
  • Multilingual: `ollama run qwen3:7b` -- 29+ native languages, strongest non-English support, 76% HumanEval.
  • Image understanding: `ollama run gemma4:e4b` -- vision + tool calling (June 2026). Or `ollama run llama3.2-vision:11b` for dedicated vision.
  • Fast and lightweight: `ollama run gemma2:2b` -- fastest CPU inference, 1.7 GB RAM.
  • High quality (16 GB RAM): `ollama run mistral-small3.1` -- near-70B quality at 14 GB RAM.
  • Embedding generation: `ollama run nomic-embed-text` -- 137M parameter embedding model for RAG pipelines.
  • Document Q&A (RAG): `ollama run llama3.2` with Open WebUI's RAG feature -- best-supported combination.
  • Home automation / wake word AI: `ollama run phi4-mini` — Phi-4 Mini (3.8B, ~3 GB VRAM) handles Home Assistant voice queries at 20–25 tok/sec on a mini PC without a discrete GPU. See Home Assistant + Ollama integration guide →.
Ollama model selection by use case: pick qwen3.6:27b (best overall, 77.2% SWE-bench) for chat and coding, kimi-k2.6 for frontier coding, gpt-oss:20b on 16 GB, deepseek-r1:7b for math.
Ollama model selection by use case: pick qwen3.6:27b (best overall, 77.2% SWE-bench) for chat and coding, kimi-k2.6 for frontier coding, gpt-oss:20b on 16 GB, deepseek-r1:7b for math.

Which Models Were Added to Ollama in June 2026?

These are the newest models in the Ollama library as of June 2026, newest first. Confirm availability with `ollama pull <model>` before building workflows — new models appear at ollama.com/library within days of release.

ModelReleasedBest ForOllama Command
minimax-m3June 1, 2026Newest flagship: frontier coding (SWE-Bench Pro 59.0), 1M context, native visionollama run minimax-m3
nemotron3-ultraJune 4, 2026NVIDIA — high-throughput reasoning + long-running agentsollama run nemotron3-ultra
deepseek-v4-proApril 23, 2026Algorithmic coding, 93.5% LiveCodeBench, MITollama run deepseek-v4-pro
kimi-k2.6April 20, 2026Frontier coding (SWE-Bench Pro 58.6), MoE (32B/1T), Modified MITollama run kimi-k2.6
qwen3.6:27bApril 16, 2026Best overall on consumer hardware, 77.2% SWE-bench, fits 24 GB Q4ollama run qwen3.6:27b
qwen3:30b2026Balanced all-round; qwen3-coder:30b for code completionollama run qwen3:30b
gpt-oss:20b2026Best small / 16 GB, ~o3-mini, adjustable reasoning (also gpt-oss:120b)ollama run gpt-oss:20b
glm-5.1April 7, 2026Z.ai, 744B/40B active MoE, MIT, SWE-Bench Pro 58.4ollama run glm-5.1
gemma4:e4bApril 2, 2026Vision + tool calling (E2B/E4B/E12B/E27B)ollama run gemma4:e4b
deepseek-v4-flashApril/May 2026Budget coding (78/100 real-world)ollama run deepseek-v4-flash
qwen3:7b2026HumanEval 76% at 7B, multilingualollama run qwen3:7b

What Is DeepSeek-R1 and Why Is It Different?

DeepSeek-R1 is a reasoning model -- unlike standard chat models that generate answers directly, DeepSeek-R1 generates explicit chain-of-thought reasoning before its final answer. This significantly improves performance on math, logic puzzles, and step-by-step problem solving.

DeepSeek-R1 7B scores 52% on MATH (competition math) vs 28% for Mistral Small at the same size. It is slower than standard models (more tokens per response) but significantly more accurate on tasks where reasoning matters.

bash
# Pull and run DeepSeek-R1
ollama run deepseek-r1:7b

# Larger variants for better quality
ollama run deepseek-r1:14b   # 10 GB RAM
ollama run deepseek-r1:32b   # 20 GB RAM
DeepSeek-R1 7B vs Mistral Small: 52% vs 28% on MATH. Chain-of-thought reasoning model -- slower, significantly better accuracy.
DeepSeek-R1 7B vs Mistral Small: 52% vs 28% on MATH. Chain-of-thought reasoning model -- slower, significantly better accuracy.

Which Ollama Models Support Image Input?

As of June 2026, these models on Ollama support image input (multimodal): Gemma 4 supports both vision AND tool calling — unique among vision models on Ollama.

ModelRAMImage SupportOllama Command
llama3.2-vision:11b~8 GBYesollama run llama3.2-vision:11b
llama3.2-vision:90b~55 GBYesollama run llama3.2-vision:90b
gemma3:9b (vision)~6 GBYesollama run gemma3:9b
minicpm-v:8b~5.5 GBYesollama run minicpm-v
gemma4:e4b~6 GBYes + Tool Calling ✓ollama run gemma4:e4b
5 Ollama vision models for image input. Gemma 4 E4B (6 GB) now includes tool calling. Llama 3.2 Vision 11B (8 GB) for dedicated vision. All run locally.
5 Ollama vision models for image input. Gemma 4 E4B (6 GB) now includes tool calling. Llama 3.2 Vision 11B (8 GB) for dedicated vision. All run locally.

What Are the Top 10 Open Source Models on Ollama?

Download counts still favor Llama 3.x due to tutorial prevalence. For new projects in June 2026, prefer Qwen 3.6 27B (best overall on consumer hardware), Kimi K2.6, gpt-oss:20b, and qwen3:30b.

#ModelBest ForRAMHumanEval
1Qwen 3.6 27BBest overall on consumer hardware24 GB (Q4)77.2% SWE-bench
2Kimi K2.6Frontier coding, MoE (32B/1T), Modified MITQuantized58.6 SWE-Bench Pro
3gpt-oss:20bBest small / 16 GB, adjustable reasoning16 GB~o3-mini
4qwen3:30bBalanced all-round; qwen3-coder:30b for code~18 GBstrong
5Devstral Small 24BAgentic coding (multi-file)16 GB80%
6deepseek-r1:7bReasoning, math5 GB
7gemma4:e4bVision + tool calling (multimodal)~6 GB
8Llama 4 ScoutLong-context 10M + multimodal, MoE~55 GB (Q4)85%
9mistral-small3.1Quality on 16 GB14 GB74%
10Llama 3.2 3BFirst model, general chat2.5 GB60%
Top Ollama models June 2026: Qwen 3.6 27B (best overall, 24 GB Q4), Kimi K2.6, gpt-oss:20b. Llama 4 Scout for 10M-token context (~55 GB).
Top Ollama models June 2026: Qwen 3.6 27B (best overall, 24 GB Q4), Kimi K2.6, gpt-oss:20b. Llama 4 Scout for 10M-token context (~55 GB).

How Do You Browse the Ollama Model Library?

There are two ways to work with Ollama models. Switch installed models: In the Ollama Mac app, click the model dropdown button at the bottom of the chat input (shows the current model name, e.g. "gemma3:1b") to switch between any locally installed model. Find and download new models: Visit ollama.com/library to browse 4500+ models by category, then use the CLI commands below to pull and manage them.

bash
# List all locally downloaded models
ollama list

# Search for a model and pull it
ollama pull qwen2.5-coder:32b

# See all available tags for a model
ollama show qwen2.5

# Remove a model to free disk space
ollama rm llama3.2:3b

How Do Regional Privacy Rules Affect Your Ollama Model Choice?

EU / GDPR + Licence Compliance. For EU organizations deploying Ollama models in production, licence choice matters as much as performance. Apache 2.0 (fully open, commercial use permitted): Mistral Small, Mistral Small 3.1, Qwen3 7B, Qwen 3.6 27B, Devstral Small 24B, Gemma 2 2B. Meta Llama Community Licence (commercial use restricted above 700M monthly active users): Llama 3.3 8B, Llama 3.2 3B, Llama 3.2 Vision 11B. MIT (commercial use permitted): DeepSeek-R1 7B, DeepSeek-R1 14B. Modified MIT (commercial use permitted with attribution clause): Kimi K2.6. For EU enterprises in regulated sectors, Mistral models (France, Apache 2.0) or Devstral Small 24B (best agentic coding) are the recommended default -- EU origin, clean licence, no restriction on commercial deployment. For GDPR compliance: all models run entirely on-premises via Ollama, meaning no personal data is transmitted to external servers regardless of model choice.

Japan (METI). For Japanese enterprise Ollama deployments, Qwen3 / Qwen 3.6 is the recommended model family -- native Japanese tokenization processes Japanese text 30-40% more token-efficiently than Llama or Mistral, directly reducing inference time and KV cache requirements. For Japanese coding workflows: Qwen 3.6 27B (77.2% SWE-bench) handles Japanese code comments natively and is the top dense coding model in 2026. METI AI governance documentation requires noting the exact model version. Use `ollama show <model>` to get the full model specification including parameter count, quantization level, and context length for compliance records.

China. Under China's CAC Generative AI Measures (2023), organizations providing AI services to end users must register the models used. Qwen3 / Qwen 3.6 (Alibaba, Apache 2.0) is the recommended choice for Chinese enterprise Ollama deployments -- Chinese model origin, Apache 2.0 licence, best performance on Chinese-language tasks, and top benchmarks. Kimi K2.6 (Moonshot AI, Modified MIT license, 32B active/1T total MoE) is also available as a top-tier coding option with Chinese origin. Pull commands: `ollama run qwen3.6:27b` for best quality, `ollama run qwen3:7b` for speed. DeepSeek-R1 (DeepSeek, MIT licence) is appropriate for reasoning tasks. For data processed locally via Ollama, China's PIPL cross-border data transfer requirements do not apply -- inference stays on-premises.

What Are the Common Mistakes When Choosing Ollama Models?

Pulling the largest model tag by default without checking RAM

Running `ollama pull llama3.3` without specifying a tag downloads the default variant, which is typically the largest standard quantization. On a machine with 8 GB RAM, pulling llama3.3 (70B at ~40 GB) will fail or cause severe swap usage. Always specify the variant: `ollama pull llama3.2:3b` for 8 GB machines.

Using a general model when a task-specific model exists

For coding tasks, `qwen2.5-coder:7b` scores 72% HumanEval while the general `qwen2.5:7b` also scores 72% -- but `qwen2.5-coder` includes FIM support for code completion. For reasoning/math, `deepseek-r1:7b` scores 52% MATH vs 28% for `mistral:7b`. Task-specific models exist in the Ollama library for a reason.

Not verifying a model is available before building a workflow

The Ollama library changes over time -- models are added and occasionally removed. Before building a production pipeline around a specific model, confirm it is in the library (`ollama list` locally, or check ollama.com/library). Pin specific model versions in production workflows: `ollama pull llama3.1:8b-instruct-q4_K_M`.

Not specifying a quantization tag for large models

Running `ollama pull qwen2.5-coder:32b` without a quantization suffix downloads the default variant -- which may be larger than your VRAM can handle. For 16 GB VRAM, pull the explicit Q4_K_M variant: `ollama pull qwen2.5-coder:32b-instruct-q4_K_M`. Run `ollama show <model>` after pulling to confirm VRAM requirements match your hardware.

Expecting DeepSeek-R1 to be as fast as standard chat models

DeepSeek-R1 generates explicit chain-of-thought reasoning tokens before its final answer -- this is why it outperforms standard models on math and logic, but it produces 3-5x more tokens per response. For quick chat or one-line answers, use `llama3.1:8b`. Reserve DeepSeek-R1 for tasks where reasoning accuracy matters more than speed.

Next steps

Frequently Asked Questions

How many models are in the Ollama library?

As of June 2026, the Ollama library contains approximately 4,500+ models (curated + community contributions) with official support. Hugging Face hosts thousands of additional GGUF models that can be loaded via Ollama using custom Modelfiles.

Can I use models from Hugging Face directly in Ollama?

Yes. Download a GGUF file from Hugging Face and create a Modelfile: `FROM ./model.gguf`. Then run `ollama create mymodel -f Modelfile`. This works for any GGUF file including fine-tunes and models not in the official Ollama library.

Which Ollama model is best for building a local chatbot?

For a general-purpose local chatbot: `qwen3.6:27b` (best overall on consumer hardware, fits 24 GB at Q4), or `llama3.2:3b` on 8 GB RAM (easiest entry point). For 16 GB machines: `gpt-oss:20b` (~o3-mini level) or `mistral-small3.1`. For a coding assistant chatbot: `qwen3.6:27b` (77.2% SWE-bench), `kimi-k2.6` (frontier MoE), or `devstral-small:24b` (agentic coding). Pair with Open WebUI for a web-based interface that connects to Ollama's API at localhost:11434.

Are all Ollama models truly open source?

Not all. The Ollama library includes models with varying licences. Llama 3.x/4.x use the Meta Llama Community Licence (not OSI-approved open source -- restricts commercial use above 700M monthly active users). Mistral Small, Qwen3, Qwen 3.6, Devstral, and Gemma models are Apache 2.0 (fully open source). Kimi K2.6 is Modified MIT licensed (commercial-friendly with an attribution clause). Always check the licence before commercial deployment.

Which embedding model should I use with Ollama for RAG?

`nomic-embed-text` is the standard choice -- a 137M parameter model that generates 768-dimensional embeddings, runs at milliseconds per document, and is specifically designed for retrieval tasks. Pull it with `ollama pull nomic-embed-text`. Use with Open WebUI's built-in RAG, LangChain's OllamaEmbeddings, or LlamaIndex.

How often does the Ollama library get updated with new models?

The Ollama team adds new models within days to weeks of major releases. MiniMax M3 (June 1, 2026), NVIDIA Nemotron 3 Ultra (June 4), Kimi K2.6 and Qwen 3.6 all appeared within days of their releases. The current Ollama version is v0.30.8 (June 12, 2026). Follow the Ollama GitHub repository (github.com/ollama/ollama) or the Ollama Twitter/X account for new model announcements.

What is the difference between `ollama pull` and `ollama run`?

`ollama pull` downloads the model file to local storage (one-time operation). `ollama run` starts an interactive session immediately after pulling, or reuses the already-pulled model if available. You can pull once and run multiple times without re-downloading.

Can I run multiple models simultaneously on the same machine?

Yes, if your hardware has sufficient VRAM. Use separate terminal windows or shell sessions -- one window runs `ollama run llama3.2` while another runs `ollama run qwen2.5:7b`. Ollama automatically manages VRAM sharing. Monitor `nvidia-smi` or system activity to avoid overload.

How do I update a model to the latest version?

`ollama pull [model-name]` checks for updates and downloads the latest version if available. To revert or use specific versions, use version tags: `ollama pull llama3.1:8b` or `ollama pull llama3.1:8b-instruct-q4_K_M`. Check available versions with `ollama show [model-name]`.

Are open source models on Ollama truly free to use commercially?

Most are, but not all. Llama 3.x (Meta Llama Community Licence) restricts commercial use above 700M monthly active users. Mistral Small, Qwen3, and Gemma models use Apache 2.0 (fully commercial-friendly). Always verify the licence before enterprise deployment -- check the model's Hugging Face page or Ollama library entry.

What are the best new Ollama models in June 2026?

The latest additions are Kimi K2.6 (Moonshot AI, Modified MIT -- frontier MoE coding, SWE-Bench Pro 58.6 tying GPT-5.5, 32B active/1T total), Qwen 3.6 27B (Alibaba -- best overall on consumer hardware, 77.2% SWE-bench, fits 24 GB at Q4), GLM-5.1 (Z.ai -- 744B/40B active MoE, MIT, SWE-Bench Pro 58.4), and gpt-oss:20b (OpenAI -- best small / 16 GB, ~o3-mini, adjustable reasoning). Gemma 4 (Google, April 2, 2026; sizes E2B/E4B/E12B/E27B) added vision and tool calling. Pull commands: ollama run qwen3.6:27b, ollama run kimi-k2.6, ollama run gpt-oss:20b, ollama run glm-5.1, ollama run gemma4:e4b.

Sources

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs