Yes β Ryzen AI Max (Strix Halo, RDNA 3.5) runs Ollama via Vulkan on Linux. With 96 GB unified memory on the MAX 395, it fits Qwen 32B and even Llama 70B Q4_K_M β models no single desktop GPU can hold.
βΈLinux: Ollama detects Strix Halo Vulkan automatically; set OLLAMA_FLASH_ATTENTION=1 for long context sessions
βΈRyzen AI Max 395 (96 GB): fits Llama 70B Q4_K_M (~41 GB) and Qwen 32B Q4_K_M (~19 GB) simultaneously in memory
βΈWindows Vulkan path for Strix Halo is experimental; Linux is the stable platform for GPU-accelerated Ollama
Updated: 2026-05
Hardware-SpecificIntermediate
Key Takeaways
βRyzen AI Max 395 (Strix Halo, 40 RDNA 3.5 CUs, 96 GB LPDDR5X) uses the Vulkan backend in Ollama on Linux β the correct GPU path when ROCm iGPU support is unavailable
βThe 96 GB unified memory pool is the key advantage: it fits Llama 70B Q4_K_M (~41 GB) β a model that requires multiple desktop GPUs in other setups
βSpeed on Ryzen AI Max 395: Llama 3.1 8B ~22 tok/s, Qwen 2.5 14B ~13 tok/s, Qwen 2.5 32B ~7 tok/s via Vulkan
βWindows support for Strix Halo in Ollama is maturing; Linux via Vulkan is the stable path as of mid-2026
How to Run Ollama with Vulkan on Strix Halo
On Linux, installing the standard Ollama binary is sufficient β it uses llama.cpp with the Vulkan backend, which supports RDNA 3.5 (gfx1150) out of the box. No additional ROCm installation is required for the Vulkan path. Run `curl -fsSL https://ollama.com/install.sh | sh` as usual.
After installation, set the flash attention flag for better memory efficiency on long sessions: `OLLAMA_FLASH_ATTENTION=1 ollama run qwen2.5:14b`. This reduces KV-cache memory usage and is particularly important when running 32B+ models that approach the full 96 GB pool.
To verify that Ollama is using the GPU (not CPU), run `ollama ps` while a model is active. The output shows "GPU" in the PROCESSOR column and a non-zero VRAM value. If you see "CPU", the Vulkan backend did not initialize β check that the `vulkan-icd-loader` package is installed on your Linux distribution.
Model
VRAM at Q4_K_M
Speed (MAX 395 Vulkan)
Fits 96 GB?
Llama 3.1 8B
4.9 GB
~22 tok/s
β
Qwen 2.5 14B
9.3 GB
~13 tok/s
β
Qwen 2.5 32B
19.4 GB
~7 tok/s
β
Llama 3.3 70B
~41 GB
~3 tok/s
β
Qwen 2.5 72B
~43 GB
~3 tok/s
β
Strix Halo vs RTX 4090: Memory Wins, Speed Loses
The Ryzen AI Max 395 trades GPU speed for memory capacity. An RTX 4090 runs Llama 3.1 8B at ~45 tok/s versus ~22 tok/s on Strix Halo Vulkan. For 7B and 14B models, the RTX 4090 is faster. But the RTX 4090 is capped at 24 GB VRAM β Strix Halo MAX 395 holds 96 GB, enabling model sizes that are simply impossible on a desktop GPU.
The practical use case for Strix Halo is running 32Bβ70B models locally without cloud APIs. Qwen 2.5 32B at Q4_K_M (~19 GB) runs at ~7 tok/s β slow for interactive chat but fine for batch summarization, document processing, or overnight fine-tuning jobs. Llama 3.3 70B at Q4_K_M (~41 GB) is achievable at ~3 tok/s, suitable for high-quality single queries.
On Windows, Ollama for Strix Halo falls back to CPU inference by default as of mid-2026, since ROCm iGPU support for gfx1150 is not yet complete in the official Ollama Windows build. The Vulkan path requires building llama.cpp from source with `-DGGML_VULKAN=ON`. Linux is recommended for GPU-accelerated Strix Halo inference until the Windows ROCm path matures.
For comparison with other Apple Silicon APU hardware, see the Mac Mini M4 for local LLMs bite, which covers the alternative unified-memory approach on macOS.
Quick Answers About Strix Halo and Ollama Vulkan
Does AMD Strix Halo support ROCm in Ollama?βΎ
Not fully as of mid-2026. ROCm support for gfx1150 (RDNA 3.5) is in progress but not yet stable in the official Ollama builds. The Vulkan backend is the currently reliable GPU acceleration path on Linux. Check the Ollama GitHub releases page for updates on ROCm iGPU support.
Can I use Ollama with Strix Halo Vulkan on Windows?βΎ
Experimentally, yes. The official Ollama Windows build does not expose the Vulkan backend by default for Strix Halo β it falls back to CPU. You can build llama.cpp from source with -DGGML_VULKAN=ON on Windows to enable it, but this requires a manual build process. Linux is the recommended platform for Strix Halo Vulkan inference.
What is the largest model that fits on Ryzen AI Max 395?βΎ
With 96 GB of unified memory, the Ryzen AI Max 395 fits Llama 3.3 70B at Q4_K_M (~41 GB) or Qwen 2.5 72B at Q4_K_M (~43 GB), each with memory to spare. For very large models, Qwen 2.5 72B at Q5_K_M (~55 GB) also fits, though speed drops to approximately 2 tok/s. Models requiring over 90 GB (e.g., 70B at Q8_0) exceed the available pool.
How does Strix Halo compare to Mac Studio M4 Ultra for Ollama?βΎ
Mac Studio M4 Ultra has 192 GB unified memory and uses Metal acceleration via llama.cpp β significantly faster than Strix Halo Vulkan on a per-token basis (~12 tok/s on 70B Q4_K_M vs ~3 tok/s on Strix Halo). For large-model inference quality and speed, M4 Ultra wins. Strix Halo is competitive only in the 8Bβ32B range and runs a standard Linux workflow.