Skip to main content
PromptQuorumPromptQuorum
Home/Power Local LLM/Run Local AI Behind a Firewall: Offline Deployment Guide 2026
Coding Assistants

Run Local AI Behind a Firewall: Offline Deployment Guide 2026

Β·12 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Download Ollama, your chosen model at Q4_K_M, and any tokeniser files on an internet-connected machine first. Transfer everything to the offline environment via USB, SSD, or internal network share. No internet is needed after the initial download.

Running a local AI behind a corporate firewall or in an air-gapped environment requires downloading every dependency before you lose internet access. One missed file β€” a tokeniser config, a prompt template, a quantised model shard β€” breaks the setup silently. This guide gives you a complete pre-flight checklist, a step-by-step offline workflow for Ollama and llama.cpp, and hardware recommendations for organisations in China or any environment governed by data-residency law.

Slide Deck: Run Local AI Behind a Firewall: Offline Deployment Guide 2026

Interactive slide deck for this article.

Browse the slides below or download as PDF for offline reference. Download Reference Card (PDF)

Key Takeaways

  • Download everything on a connected machine: Ollama binary, GGUF model, tokeniser configs, and any RAG dependencies
  • Transfer via USB SSD, internal network share, or air-gapped laptop β€” never rely on cloud sync
  • Set OLLAMA_MODELS env variable to point to your offline model directory
  • Qwen2.5 14B at Q4_K_M (9.5 GB) is the recommended offline default β€” wide enough capability, fits on 16 GB unified memory
  • NAS sizing: plan 20 GB per 7B model, 50 GB per 14B model, and 100 GB per 32B model at Q4_K_M
  • China Data Security Law: local inference satisfies data residency requirements regardless of model provenance

Pre-flight Checklist β€” Download Before You Go Offline

Check off every item on a connected machine before moving to the air-gapped environment.

  1. 1
    Ollama binary β€” download from ollama.com for your OS (Linux x86_64, macOS arm64, Windows). Version β‰₯0.3.0 recommended.
  2. 2
    Model GGUF file β€” pull via ollama pull qwen2.5:14b-instruct-q4_K_M on the connected machine. Models cache to ~/.ollama/models/.
  3. 3
    Tokeniser + chat template β€” Ollama bundles these with the model manifest; no separate download needed if you use Ollama.
  4. 4
    llama.cpp binary (if using llama.cpp) β€” download a pre-built release from github.com/ggerganov/llama.cpp/releases.
  5. 5
    Embedding model (for offline RAG) β€” ollama pull nomic-embed-text or mxbai-embed-large.
  6. 6
    Vector DB binary (for offline RAG) β€” Chroma standalone, Qdrant binary, or SQLite+sqlite-vss (no Python install required).
  7. 7
    Python wheels (if using Python tooling) β€” download .whl files via pip download with --no-deps and transfer them.
  8. 8
    Verification hash β€” run sha256sum on each GGUF file before transfer to detect corruption.

Download Commands for the Connected Machine

Run all of these on the internet-connected machine before transfer. Replace model tags as needed.

  • ollama pull qwen2.5:14b-instruct-q4_K_M β€” 9.5 GB, recommended default
  • ollama pull qwen2.5:7b-instruct-q4_K_M β€” 5.5 GB, for lower-VRAM machines
  • ollama pull nomic-embed-text β€” 274 MB, for offline RAG embeddings
  • ollama pull deepseek-r1:7b β€” 5.5 GB, if math/reasoning is the primary use case
  • Model files location: ~/.ollama/models/ on Linux/macOS, %USERPROFILE%\.ollama\models on Windows
  • For llama.cpp: download GGUF directly from HuggingFace and verify SHA256 before transfer

Ollama Air-Gap Workflow

After transferring files to the offline machine:

  1. 1
    Copy the entire ~/.ollama/ directory from the connected machine to the same path on the offline host.
  2. 2
    Install the Ollama binary: chmod +x ollama && sudo mv ollama /usr/local/bin/
  3. 3
    Set the model directory: export OLLAMA_MODELS=/path/to/offline/ollama/models
  4. 4
    Start the server: ollama serve β€” verify it starts without network calls in the logs.
  5. 5
    Test offline: ollama run qwen2.5:14b β€” should respond immediately without hitting any external URL.
  6. 6
    Bind to all interfaces for LAN access: OLLAMA_HOST=0.0.0.0:11434 ollama serve

llama.cpp Air-Gap Workflow

llama.cpp is fully self-contained after the binary + GGUF are present β€” no runtime dependencies needed.

  • Transfer the pre-built binary and your GGUF file to the offline machine.
  • Run: ./llama-server -m ./qwen2.5-14b-instruct-q4_K_M.gguf --port 8080
  • The --no-mmap flag disables memory-mapped I/O if running from a network share.
  • Use --n-gpu-layers 35 to offload layers to GPU on NVIDIA; --n-gpu-layers -1 offloads all on Apple Silicon.
  • OpenAI-compatible API available at http://localhost:8080/v1 β€” drop-in for any OpenAI SDK.

NAS Storage Sizing for Offline Model Libraries

A model library for a small team typically holds 3–6 models at different sizes. Plan storage before purchase.

  • Recommended NAS for model storage: Synology DS923+ with 4Γ— 4 TB drives in RAID 5 (~12 TB usable)
  • Minimum for a 2–3 model library: 2 TB SSD (portable drive works for single-machine deployments)
  • NFS mount the NAS to the inference server; set OLLAMA_MODELS to the NFS path

China Data Security Law and CAC Compliance

China's Data Security Law (DSL, 2021) and Cybersecurity Law (CSL) require that important data processed in China be stored domestically. The Cyberspace Administration of China (CAC) additionally requires that AI systems providing public-facing services complete a security assessment before launch.

  • Data residency: Local inference means data never leaves your hardware. This satisfies DSL Article 31 (important data stored in China) regardless of model origin.
  • Model provenance: Qwen2.5 (Alibaba) simplifies internal compliance documentation for enterprises β€” the model vendor is a PRC company. DeepSeek (DeepSeek AI, Hangzhou) is also PRC-origin.
  • Public-facing AI services: If your deployment is user-facing (not purely internal), CAC's Algorithm Security Assessment rules require filing. Internal/offline deployments used by employees only are generally out of scope.
  • Network isolation verification: Use iptables or a firewall rule to confirm no outbound connections from the inference server β€” document this for compliance records.
  • Audit logs: Log prompt-response pairs locally (not to cloud) if required by internal data-governance policy. Ollama does not log by default; add middleware if needed.

Offline RAG Setup

Retrieval-Augmented Generation (RAG) fully offline requires: a local LLM + a local embedding model + a local vector store.

  1. 1
    Embedding model: Pull ollama pull nomic-embed-text on the connected machine. Transfer with the rest of the Ollama models directory.
  2. 2
    Vector store: Chroma can run as a standalone binary (no Python needed); alternatively use Qdrant binary release or SQLite with the sqlite-vss extension.
  3. 3
    Document ingestion: Use LangChain or LlamaIndex offline (install wheels before going offline). Point the document loader to local files β€” no web crawling.
  4. 4
    Query flow: Document β†’ embed via local nomic-embed-text β†’ retrieve top-k chunks from local vector DB β†’ pass to local Qwen2.5 β†’ response. Zero external calls.
  5. 5
    Testing: Confirm with tcpdump -i any -n port 443 that zero HTTPS traffic is generated during a full RAG query cycle.

FAQ

Does Ollama make any network calls when running offline?

By default, Ollama does not make network calls when serving a locally cached model. It contacts ollama.com only to pull or update models. Running OLLAMA_MODELS pointed at a local cache with ollama serve makes no outbound calls.

Can I run Qwen2.5 72B on a NAS-mounted path?

Yes, but expect slower load times (10–30 seconds) due to NFS latency during model loading. Once loaded, inference performance depends only on GPU/CPU VRAM β€” not storage speed.

What is the smallest model that handles Chinese text well offline?

Qwen2.5 7B at Q4_K_M (5.5 GB VRAM). It handles Chinese with native tokenisation and produces coherent responses at 50–80 tok/s on an RTX 3060.

Do I need a CAC security assessment for an internal offline deployment?

Generally no. CAC's Algorithm Security Assessment rules target public-facing AI services. Internal deployments accessible only to employees are out of scope. Consult a compliance professional for your specific situation.

Can llama.cpp run without any system dependencies?

On Linux, the pre-built binary requires GLIBC 2.28+ (standard on Ubuntu 20.04+). On macOS arm64, the binary is self-contained. On Windows, the CUDA build requires CUDA runtime DLLs.

How do I update models in an air-gapped environment?

Download the updated GGUF on a connected machine, verify the SHA256 hash, transfer via USB/SSD, and replace the old GGUF in your model directory. Restart the Ollama server to pick up the new file.

← Back to Power Local LLM