Key Takeaways
- Nine layers, 87 projects, one map. Runtimes, desktop apps, web UIs, coding assistants, RAG systems, agent frameworks, voice/multimodal, mobile clients, and specialized productivity plugins β almost every popular project in 2026 fits in exactly one layer.
- Pick a runtime first. Ollama is the right default for ~95% of readers; llama.cpp is the foundational engine underneath most other tools; vLLM is the production-serving pick for multi-user setups.
- Most layers above the runtime are optional. A desktop app OR a web UI is enough for chat. Add a coding harness only when you want IDE integration; add a RAG system only when you want to chat with your own documents; add an agent framework only when one-shot calls stop being enough.
- Licence matters for commercial use. MIT and Apache 2.0 dominate the ecosystem. AGPL appears on a handful of UIs (text-generation-webui, KoboldCpp, Jan, SillyTavern) β fine for personal use, more deliberate for commercial deployments. The "License" column below names every one explicitly.
- Multi-tool stacks are normal. Ollama + Open WebUI + AnythingLLM + Continue.dev is a single-machine setup that covers chat, RAG, and coding without compromise. The "Common Real-World Stacks" table below names the recipes that actually work in 2026.
1. Local LLM Runtimes & Inference Engines
A runtime is the engine that loads model weights into memory and turns prompts into tokens. It is the first decision in a local-LLM stack and the one that constrains everything above it β every desktop app, web UI, and coding harness ultimately calls a runtime. Ollama dominates user-facing share in 2026 because it ships an OpenAI-compatible API and a one-command install; llama.cpp is the C++ engine underneath most of the others; vLLM is the right pick when you need to serve concurrent users on a real GPU.
| Tool | Link | Description | License |
|---|---|---|---|
| Ollama | ollama.com | Easiest overall β one-command install, OpenAI-compatible API, huge model library | MIT |
| llama.cpp | github.com/ggml-org/llama.cpp | Foundational C++ engine behind most other tools, runs anywhere including Apple Silicon | MIT |
| vLLM | github.com/vllm-project/vllm | High-throughput serving for multi-user GPU deployments | Apache 2.0 |
| LocalAI | localai.io | Drop-in OpenAI API replacement supporting multiple backends | MIT |
| TensorRT-LLM | github.com/NVIDIA/TensorRT-LLM | NVIDIA-optimized inference for enterprise GPU rigs | Apache 2.0 |
| MLC LLM | mlc.ai/mlc-llm | Mobile and edge device deployment runtime | Apache 2.0 |
| SGLang | github.com/sgl-project/sglang | Structured inference serving for agent pipelines | Apache 2.0 |
| ExLlamaV2 | github.com/turboderp-org/exllamav2 | Fast quantized inference optimized for RTX GPUs | MIT |
| KoboldCpp | github.com/LostRuins/koboldcpp | Lightweight llama.cpp wrapper with built-in UI | AGPL 3.0 |
| Llamafile | github.com/Mozilla-Ocho/llamafile | Single-file portable LLM execution by Mozilla | Apache 2.0 |
| MLX-LM | github.com/ml-explore/mlx-examples | Apple Silicon-native runtime by Apple research | MIT |
Deeper guide: llama.cpp vs Ollama vs vLLM
2. Desktop GUI Apps
Desktop apps wrap a runtime in a chat interface and a model browser. They are where most non-technical users start because there is no terminal step β download, click, chat. LM Studio, Jan, and GPT4All hold the bulk of the user base in 2026; AnythingLLM doubles as a desktop app and a RAG layer; Open Interpreter is the outlier that lets a local model drive your computer.
| Tool | Link | Description | License |
|---|---|---|---|
| LM Studio | lmstudio.ai | Most polished GUI, built-in HuggingFace model browser, server mode | Free (closed) |
| Jan | jan.ai | Privacy-focused offline ChatGPT clone, fully open-source | AGPL 3.0 |
| GPT4All | nomic.ai/gpt4all | Beginner-friendly with strong CPU-only support | MIT |
| AnythingLLM | anythingllm.com | RAG and document chat with built-in vector store | MIT |
| Msty | msty.app | Clean consumer UX, multi-provider support | Free (closed) |
| Cherry Studio | cherry-ai.com | Multi-provider desktop AI with extensive customization | Apache 2.0 |
| Faraday | faraday.dev | Character chat and roleplay desktop client | Free (closed) |
| Enchanted | enchantedlabs.ai | Native macOS/iOS minimal Ollama client | MIT |
| h2oGPT | github.com/h2oai/h2ogpt | Enterprise-feature-heavy desktop and server | Apache 2.0 |
| Open Interpreter | github.com/OpenInterpreter/open-interpreter | Lets local LLM control your computer and execute code | AGPL 3.0 |
Deeper guide: LM Studio vs Jan vs GPT4All
3. Web UIs & Browser Frontends
Web UIs are self-hosted ChatGPT clones β same conversational surface, but you point them at a runtime running on your own machine or LAN. They are the natural choice when you want multi-device access (laptop, phone, tablet hitting one server) or team usage. Open WebUI dominates the self-hosted segment in 2026, with LibreChat as the team-features alternative and SillyTavern as the dedicated roleplay UI.
| Tool | Link | Description | License |
|---|---|---|---|
| Open WebUI | openwebui.com | Most popular self-hosted ChatGPT-like UI with built-in RAG | BSD 3-Clause |
| LibreChat | librechat.ai | Multi-model ChatGPT alternative with team features | MIT |
| text-generation-webui | github.com/oobabooga/text-generation-webui | Power-user UI with extensive plugin ecosystem | AGPL 3.0 |
| SillyTavern | github.com/SillyTavern/SillyTavern | Roleplay and character chat with lorebooks | AGPL 3.0 |
| LobeChat | lobehub.com | Modern polished UI with plugin marketplace | MIT |
| Big-AGI | github.com/enricoros/big-AGI | Advanced multi-provider frontend with personas | MIT |
| NextChat | github.com/ChatGPTNextWeb/NextChat | Lightweight web chat, simple deployment | MIT |
| Page Assist | github.com/n4ze3m/page-assist | Browser sidebar AI for Chrome and Firefox | MIT |
| Chatbox | chatboxai.app | Cross-platform desktop and web client | GPLv3 |
Deeper guide: SillyTavern vs Agnai vs RisuAI
4. Coding Assistants & IDE Integrations
Coding assistants connect a local LLM to your editor or terminal via OpenAI-compatible APIs. The choice is mostly about workflow primitive: autocomplete-in-editor (Continue.dev), autonomous agent edits (Cline, OpenHands), or git-native diff edits at the terminal (Aider). All three patterns work against any runtime that speaks the OpenAI Chat Completions protocol β Ollama is the most common backend in 2026.
| Tool | Link | Description | License |
|---|---|---|---|
| Continue.dev | continue.dev | VS Code and JetBrains autocomplete and chat with local models | Apache 2.0 |
| Aider | aider.chat | Terminal pair programmer with multi-file edit support | Apache 2.0 |
| Cline | cline.bot | Autonomous coding agent for VS Code | Apache 2.0 |
| Tabby | tabby.tabbyml.com | Self-hosted GitHub Copilot alternative | Apache 2.0 |
| CodeGPT | codegpt.co | IDE integrations across multiple editors | MIT |
| OpenHands | github.com/All-Hands-AI/OpenHands | AI software engineer agent (formerly OpenDevin) | MIT |
| Cursor (local mode) | cursor.com | AI-first code editor with local model support | Free (closed) |
| Twinny | github.com/twinnydotdev/twinny | Free Copilot alternative for VS Code | MIT |
Deeper guide: Continue.dev vs Cline vs Aider
5. RAG & Document Chat Systems
RAG (Retrieval-Augmented Generation) systems combine a local LLM with an embedding model and a vector store so the model can answer from your own documents.** The split is between turn-key apps (AnythingLLM, PrivateGPT, Quivr, Khoj) that "just work" and framework libraries (LlamaIndex, Haystack, txtai) that you build on. RAGFlow has gained share in 2026 specifically for documents that need citation-grade retrieval.
| Tool | Link | Description | License |
|---|---|---|---|
| AnythingLLM | anythingllm.com | Easiest all-in-one personal RAG with workspaces | MIT |
| PrivateGPT | github.com/zylon-ai/private-gpt | Fully offline enterprise-leaning RAG | Apache 2.0 |
| Quivr | github.com/QuivrHQ/quivr | Self-hosted personal knowledge assistant | Apache 2.0 |
| Khoj | khoj.dev | Personal AI second brain, syncs with Obsidian and Notion | AGPL 3.0 |
| Dify | dify.ai | AI workflow builder with RAG and agent support | Modified Apache 2.0 |
| Flowise | flowiseai.com | Visual LangChain workflow builder | Apache 2.0 |
| Langflow | langflow.org | Visual AI orchestration with RAG components | MIT |
| LlamaIndex | llamaindex.ai | RAG framework / Python library β foundation for custom builds | MIT |
| Haystack | haystack.deepset.ai | Search and RAG framework by deepset | Apache 2.0 |
| RAGFlow | ragflow.io | Deep document understanding for RAG with citation extraction | Apache 2.0 |
| txtai | github.com/neuml/txtai | Embedded vector + LLM database in one library | Apache 2.0 |
Deeper guide: AnythingLLM vs PrivateGPT vs Open WebUI
6. Agent Frameworks & Orchestration
Agent frameworks turn one-shot LLM calls into multi-step workflows β plan, act, observe, repeat. LangChain remains the general-purpose default; CrewAI and AutoGen specialise in role-based multi-agent setups; LangGraph is the right pick when state management matters across long-running flows. All eight frameworks below run cleanly against a local Ollama backend.
| Tool | Link | Description | License |
|---|---|---|---|
| LangChain | langchain.com | General-purpose LLM application framework | MIT |
| LlamaIndex | llamaindex.ai | RAG-focused agent and data framework | MIT |
| CrewAI | crewai.com | Multi-agent role-based workflows | MIT |
| AutoGen | github.com/microsoft/autogen | Microsoft multi-agent orchestration framework | CC-BY-4.0 / MIT |
| Semantic Kernel | learn.microsoft.com/semantic-kernel | Microsoft enterprise orchestration SDK in C#/Python/Java | MIT |
| LangGraph | langchain-ai.github.io/langgraph | Stateful graph-based agent workflows | MIT |
| Letta (formerly MemGPT) | letta.com | Long-term memory agents | Apache 2.0 |
| Pydantic AI | ai.pydantic.dev | Type-safe agent framework built on Pydantic | MIT |
Deeper guide: Local AI Agents With MCP
7. Voice, Speech & Multimodal
Voice and multimodal stacks extend a local LLM beyond text β speech in (STT), speech out (TTS), and vision. Whisper.cpp and faster-whisper own the local STT layer; Piper and Coqui share the TTS layer with XTTS v2 dominating voice cloning; LLaVA and Ollama vision models cover the vision side. A fully-offline voice assistant is buildable from this layer plus a small chat model.
| Tool | Link | Description | License |
|---|---|---|---|
| Whisper.cpp | github.com/ggerganov/whisper.cpp | Local speech recognition, runs on CPU or GPU | MIT |
| faster-whisper | github.com/SYSTRAN/faster-whisper | Fast Whisper transcription via CTranslate2 | MIT |
| Piper TTS | github.com/rhasspy/piper | Lightweight local text-to-speech | MIT |
| Coqui TTS | coqui.ai | Open-source voice synthesis with multiple model options | MPL 2.0 |
| XTTS v2 | docs.coqui.ai/en/latest/models/xtts.html | Voice cloning with multilingual support | CPML |
| Bark | github.com/suno-ai/bark | Generative voice with non-speech sounds | MIT |
| StyleTTS 2 | github.com/yl4579/StyleTTS2 | High-quality natural-sounding TTS | MIT |
| LLaVA | llava-vl.github.io | Local vision + language model | Apache 2.0 |
| Ollama vision models | ollama.com | Local vision via Ollama (Llama 3.2 Vision, Llava, etc.) | Various |
Deeper guide: Build a Local Voice Assistant on Your Phone
8. Mobile & Edge Clients
Mobile clients run a quantised model directly on the phone using Apple Neural Engine, Qualcomm NPU, or pure CPU inference. The MLC LLM project is the foundational layer; consumer apps (PocketPal AI, Private LLM, LLM Farm, Layla) wrap it with a chat UI. Flagship phones in 2026 run 2-4B models at usable speeds (8-15 tokens/sec); 7B is on the edge of feasibility for top-tier hardware.
| Tool | Link | Description | License |
|---|---|---|---|
| MLC Chat | mlc.ai/mlc-llm | Cross-platform mobile LLM runtime | Apache 2.0 |
| PocketPal AI | github.com/a-ghorbani/pocketpal-ai | Free iOS and Android local LLM client | MIT |
| Private LLM | privatellm.app | Polished iOS and macOS local LLM app | Paid (closed) |
| LLM Farm | github.com/guinmoon/LLMFarm | iOS local LLM with model browser | MIT |
| Layla | layla-network.ai | Android-first local LLM app | Free (closed) |
| Maid | github.com/Mobile-Artificial-Intelligence/maid | Open-source Flutter mobile LLM app | MIT |
| Enchanted | enchantedlabs.ai | Native iOS/macOS Ollama client | MIT |
| Chapper | prevolut.uk | Native Ollama and LM Studio mobile client | Free |
| RikkaHub | github.com/rikkahub/rikkahub | Open-source Android local AI | MIT |
| AnythingLLM Mobile | anythingllm.com | Remote access to your local AnythingLLM workspace | MIT |
Deeper guide: Best Local LLM Apps for iPhone in 2026
9. Specialized & Productivity Tools
Specialized tools embed local LLMs into apps you already use β note-taking platforms (Obsidian, Logseq, Joplin), autonomous task agents (AutoGPT, BabyAGI, MetaGPT), and roleplay frontends (Agnai, RisuAI). These are not generic chat surfaces; they are workflow-specific integrations that assume you already have a host app and a runtime.
| Tool | Link | Description | License |
|---|---|---|---|
| Smart Connections | github.com/brianpetro/obsidian-smart-connections | Obsidian semantic search and chat plugin | GPL 3.0 |
| Copilot for Obsidian | github.com/logancyang/obsidian-copilot | Obsidian local LLM chat plugin | AGPL 3.0 |
| Text Generator | github.com/nhaouari/obsidian-textgenerator-plugin | Obsidian content generation plugin | MIT |
| logseq-copilot | github.com/logancyang/logseq-copilot | Logseq plugin for local and cloud LLM chat, same author as Obsidian Copilot | AGPL 3.0 |
| BMO Chatbot | github.com/longy2k/obsidian-bmo-chatbot | Obsidian chatbot with local LLM | MIT |
| Joplin AI | joplinapp.org | Joplin notes with local AI integrations | MIT |
| AutoGPT (local) | github.com/Significant-Gravitas/AutoGPT | Autonomous task agent with Ollama support | MIT |
| BabyAGI | github.com/yoheinakajima/babyagi | Lightweight autonomous agent | MIT |
| MetaGPT | github.com/geekan/MetaGPT | Multi-agent software company simulation | MIT |
| Agnai | agnai.chat | Roleplay frontend with character cards | MIT |
| RisuAI | github.com/kwaroran/RisuAI | Mobile-friendly roleplay frontend | GPL 3.0 |
Deeper guide: Local LLM With Obsidian in 2026
Common Real-World Stacks
For readers who do not want to read nine categories, pick the closest stack and copy it. Each row pairs a real goal with a tested combination and the hardware floor it actually runs on.
| Goal | Stack | Hardware floor |
|---|---|---|
| Just chat casually | LM Studio standalone | 16 GB RAM, no GPU |
| Best balance for power users | Ollama + Open WebUI | 16 GB RAM, optional GPU |
| Document chat | Ollama + AnythingLLM | 16 GB RAM, optional GPU |
| Coding | Ollama + Continue.dev | 16 GB RAM + GPU recommended |
| Roleplay / creative | KoboldCpp + SillyTavern | 16 GB RAM, GPU recommended |
| Privacy-first business | Ollama + Open WebUI + PrivateGPT | 32 GB RAM + 12 GB VRAM |
| Mobile / on-the-go | MLC Chat or PocketPal AI | iPhone 13+ / Pixel 7+ |
| Apple Silicon | Ollama (MLX backend) or LM Studio | M2/M3/M4/M5 with 16+ GB unified |
| Multi-user team | vLLM + Open WebUI | 32+ GB RAM + multi-GPU |
How This Directory Stays Current
This directory is reviewed every six months (next refresh: November 2026). Inclusion criteria: project is actively maintained (commits in the last 90 days), has a verifiable open-source licence or a clear commercial-use statement, and either holds meaningful user share in 2026 or fills a layer that would otherwise be empty. Projects that go inactive for more than two release cycles are removed; new entrants that pass the criteria are added at the next review. To suggest a project for inclusion, open an issue or PR against the PromptQuorum repository β include the project URL, licence, and a one-sentence description in the format above.
Sources
- ggml-org/llama.cpp GitHub β primary source for runtime architecture and supported models.
- Ollama Library β official model catalogue and runtime documentation.
- LM Studio Documentation β feature reference for the dominant desktop GUI.
- Open WebUI Documentation β feature reference for the dominant self-hosted web UI.
- Hugging Face Hub β primary location for downloading model weights consumed by every runtime listed above.
- awesome-local-llm GitHub list β community-maintained inventory used as a sanity check for project inclusion.
FAQ
What is the difference between a local LLM runtime and a desktop app?
A runtime (Ollama, llama.cpp, vLLM) is the engine that loads model weights and serves an API β typically OpenAI-compatible. A desktop app (LM Studio, Jan, GPT4All) is a chat UI that calls a runtime under the hood. Some apps bundle their own runtime (LM Studio embeds llama.cpp), others require you to install a runtime separately (Open WebUI calls Ollama). The runtime decides what is possible; the app decides what is convenient.
Can I use multiple tools from this list at the same time?
Yes β most stacks combine 2-4 tools. A common setup: Ollama as the runtime, Open WebUI for chat, AnythingLLM for document chat, and Continue.dev for coding β all four run against the same Ollama instance on a single machine. The "Common Real-World Stacks" table above lists the recipes that work without conflict.
Which tools work fully offline with no telemetry?
Ollama, llama.cpp, vLLM, Jan, GPT4All, Open WebUI, AnythingLLM, PrivateGPT, Continue.dev, Aider, KoboldCpp, Llamafile, MLX-LM, and most of the AGPL/MIT-licensed apps in this directory work fully offline once the model is downloaded. LM Studio and several closed-source tools have optional analytics that can be disabled in settings β verify by running a packet capture once after install. Browser-based UIs (Open WebUI, LibreChat) are local-only when configured to use a local backend.
Are any of these commercial-licensed (not free for commercial use)?
A handful: LM Studio, Msty, Faraday, Layla, and Cursor are closed-source β generally free to use but not redistributable, and commercial terms vary. Private LLM is paid. AGPL-licensed tools (Jan, KoboldCpp, text-generation-webui, SillyTavern, Khoj, Open Interpreter, Copilot for Obsidian) are free for any use including commercial, but the AGPL terms require source disclosure if you modify and host them publicly. Apache 2.0 and MIT projects (the majority) are usable in any context including commercial without attribution constraints beyond the licence text.
Which tools support Apple Silicon (M-series chips) natively?
Ollama, llama.cpp, MLX-LM, LM Studio, Jan, Enchanted, GPT4All, MLC Chat, AnythingLLM, and most Electron/Tauri apps run natively on Apple Silicon and use the Metal backend. MLX-LM is Apple-specific and the fastest for large models on M-series. vLLM, TensorRT-LLM, and ExLlamaV2 are NVIDIA-focused and either do not run or run poorly on Apple Silicon β for Apple users, Ollama with the Metal backend is the default.
Do all these tools support GGUF model format?
GGUF is the native format for llama.cpp and any tool that wraps it (Ollama, LM Studio, Jan, GPT4All, KoboldCpp, Llamafile). vLLM and TensorRT-LLM use their own optimised formats (typically AWQ or FP16) for higher throughput. ExLlamaV2 uses EXL2 quantisation. MLX-LM uses MLX-converted weights. Most listed tools accept GGUF; a few (vLLM, TensorRT-LLM, ExLlamaV2, MLX-LM) require a one-time conversion step from the original Hugging Face weights.
Which tools are best for users with no coding experience?
GPT4All has the simplest install (one click, runs on 8 GB RAM). LM Studio is the most feature-rich without requiring a terminal. Jan is the most privacy-conscious of the no-code options. For document chat without command-line work, AnythingLLM is the easiest. All four are listed in the Desktop GUI Apps category above.
Can I run these tools on a server and access them remotely?
Most server-capable tools (Ollama, vLLM, LocalAI, Open WebUI, LibreChat, PrivateGPT, AnythingLLM) expose an HTTP API and bind to a network interface configurable in settings. Standard pattern: run Ollama on a home server or VPS, run a UI on your laptop or phone pointing at the server's IP. Treat the API like any web service β bind to localhost behind a reverse proxy, or to a private network with proper authentication. Open WebUI ships with multi-user support out of the box.
Which tools support multi-user / team setups?
Open WebUI, LibreChat, h2oGPT, AnythingLLM (with admin features enabled), and Dify are designed for multi-user use, with role-based access and per-user conversation history. vLLM is the right serving layer underneath when concurrent inference matters β it batches requests across users for throughput unattainable on Ollama at concurrency above ~3.
How often does this directory get updated?
Every six months β the next scheduled refresh is November 2026. Mid-cycle changes (a project goes inactive, a new tool gains meaningful share, a licence changes) get patched into the existing entry. Entirely new categories or layers wait for a refresh to keep the structure stable. The "Sources" section above lists the community indexes used to spot-check what the ecosystem is actually doing between refreshes.