Key Takeaways
- Phi-4 Mini (3.8B) is the smartest small model in 2026. Best for flagship phones with 8 GB+ RAM β runs at ~13β18 tokens/sec on iPhone 17 Pro and ~10β15 on iPhone 16 Pro. Strongest reasoning per parameter of any sub-4B model.
- SmolLM 2 1.7B is the fastest tokens-per-second on every tested phone. ~26β32 tok/sec on iPhone 17 Pro, ~20β28 on Galaxy S25 Ultra. Best when responsiveness matters more than answer depth (snappy chat, autocomplete-style tasks).
- Qwen 2.5 1.5B is the strongest multilingual mobile model. Trained on 35+ languages including Chinese, Japanese, Arabic, and German with native-quality output. Best choice for translation, non-English drafting, and travel use.
- Gemma 3 4B is the balanced default. Slightly slower than Phi-4 Mini on the same hardware but matches it on chat and summarisation. Best when Phi-4 Mini is unavailable in your app or you want Google's training-data mix.
- Gemma 3 1B is the lightweight pick for older phones. Fits in 4 GB RAM (iPhone SE 3rd gen, older Android). Limited multi-step reasoning but produces coherent 1β2 paragraph responses faster than any other model on weak hardware.
- Llama 3.2 3B is the most-tested 3B workhorse. Best tool-calling support among the six, broadest app compatibility, strongest community fine-tunes. Slightly behind Phi-4 Mini on raw quality but more reliable in edge cases.
- Q4_K_M is the standard mobile quantisation in 2026. Preserves ~95% of original quality at one-quarter the file size. Use Q5_K_M or Q6_K only on 12 GB+ phones (iPhone 17 Pro Max) and only if the app supports it.
Quick Facts
- Models tested: Phi-4 Mini 3.8B, Gemma 3 4B, Gemma 3 1B, SmolLM 2 1.7B, Qwen 2.5 1.5B, Llama 3.2 3B (all Q4_K_M GGUF).
- Test devices: iPhone 17 Pro (A19 Pro), iPhone 16 Pro (A18 Pro, 8 GB), Galaxy S25 Ultra (Snapdragon 8 Elite), Pixel 9 Pro (Tensor G5), OnePlus 13 (Snapdragon 8 Elite).
- Inference engines: llama.cpp via PocketPal AI / LLM Farm (default), MLC LLM via MLC Chat (Metal-accelerated on iPhone), Ollama via Termux (Android).
- Memory footprint (Q4_K_M): Phi-4 Mini ~2.7 GB, Gemma 3 4B ~2.9 GB, Llama 3.2 3B ~2.2 GB, Qwen 2.5 1.5B ~1.0 GB, SmolLM 2 1.7B ~1.1 GB, Gemma 3 1B ~720 MB.
- Minimum RAM (active): 6 GB phone for 1.5Bβ1.7B models; 8 GB phone for 3Bβ4B models; 4 GB phone for Gemma 3 1B only.
- Fastest tokens/sec on iPhone 17 Pro: Gemma 3 1B ~35β45, SmolLM 2 ~26β32, Qwen 2.5 ~24β32, Llama 3.2 3B ~16β22, Phi-4 Mini ~13β18, Gemma 3 4B ~10β13.
- Source quantisation: all six available as Q4_K_M GGUF on Hugging Face and via PocketPal AI / MLC Chat / LM Studio.
Which Mobile Model Should You Pick?
For most flagship phones (iPhone 16 Pro / 17 Pro, Galaxy S25 Ultra, OnePlus 13), pick Phi-4 Mini (3.8B Q4_K_M). It is the smartest sub-4B model and runs at usable conversational speed. Pick a different model only when you have a specific need it does not cover β speed (SmolLM 2), multilingual (Qwen 2.5), or older-phone compatibility (Gemma 3 1B).
π In One Sentence
Pick Phi-4 Mini for flagship 8 GB+ phones (smartest), SmolLM 2 1.7B for speed, Qwen 2.5 1.5B for multilingual, Gemma 3 1B for 4 GB phones, Llama 3.2 3B for tool calling, and Gemma 3 4B as the balanced default when Phi-4 Mini is unavailable.
π¬ In Plain Terms
There is no single best mobile model β the right pick depends on your phone and what you do with it. If your phone is from the last two years and has 8 GB or more RAM, install Phi-4 Mini. If you mostly chat in a non-English language, install Qwen 2.5. If you want the fastest replies even at the cost of some quality, install SmolLM 2. If your phone is older or has only 4 GB RAM, install Gemma 3 1B. The differences are real but small enough that any of these will produce coherent answers β none are cloud-quality.
Decision: Which Mobile Model?
Use a local LLM if:
- β’Flagship phone with 8 GB+ RAM (iPhone 16 Pro/17 Pro, Galaxy S25 Ultra, OnePlus 13) β Phi-4 Mini 3.8B
- β’Need fastest tokens/sec on any phone β SmolLM 2 1.7B
- β’Non-English use (translation, multilingual chat) β Qwen 2.5 1.5B
- β’Need broad app compatibility, tool calling, or RAG β Llama 3.2 3B
- β’Older phone with 4 GB RAM β Gemma 3 1B
- β’Phi-4 Mini unavailable in your app, need 4B-class quality β Gemma 3 4B
Use a cloud model if:
- β’Multi-step reasoning, complex code generation, or long-document analysis β use cloud or remote-connect to a home machine running 70B+
- β’Vision-language tasks (image input, OCR) β cloud apps (mobile vision models in 2026 are limited and slow)
- β’Long-form creative writing where coherence over 3,000+ tokens matters β cloud or 8B+ on a desktop
Quick decision:
- βDefault for most users: Phi-4 Mini 3.8B
- βFastest on every device: SmolLM 2 1.7B
- βBest multilingual: Qwen 2.5 1.5B
π‘Tip: If unsure, start with Phi-4 Mini on a flagship phone or SmolLM 2 1.7B on a mid-range phone β both download in under 5 minutes on a fast connection and are reversible. Try one prompt you actually care about (a real email to summarise, a real question to answer). If the quality feels acceptable, you have your default. If not, swap to a sibling model in 30 seconds via PocketPal AI or LM Studio.
Mobile Model Comparison Table
The four-column table below is the fast extraction layer β pick a row by phone tier or use case. Tokens/sec figures assume Q4_K_M quantisation on iPhone 17 Pro using PocketPal AI (llama.cpp). Numbers are 15β25% lower on iPhone 16 Pro and roughly 10β20% lower on Galaxy S25 Ultra running the same Q4_K_M GGUF via MLC Chat or Termux+Ollama.
π In One Sentence
Phi-4 Mini is the smartest, SmolLM 2 1.7B is the fastest, Qwen 2.5 1.5B is the best multilingual, Gemma 3 1B is the smallest viable, Llama 3.2 3B is the strongest 3B workhorse, and Gemma 3 4B is the balanced default.
π¬ In Plain Terms
Read this table top-to-bottom in size order, or jump to the row that matches your phone tier. The "Best for" column is what to optimise for β pick the row whose strength matters most to you and ignore the others.
| Model | Size | Tokens/sec (17 Pro) | Best for |
|---|---|---|---|
| Phi-4 Mini | 3.8B | ~13β18 | Smartest small model β flagship default |
| Gemma 3 4B | 4B | ~10β13 | Balanced default when Phi-4 Mini unavailable |
| Gemma 3 1B | 1B | ~35β45 | Older phones (4 GB RAM) |
| SmolLM 2 | 1.7B | ~26β32 | Fastest tokens/sec, snappy chat |
| Qwen 2.5 | 1.5B | ~24β32 | Best multilingual (35+ languages) |
| Llama 3.2 | 3B | ~16β22 | Strongest 3B option, tool calling, RAG |
Note on speed-quality trade-off: Tokens/sec scales inversely with parameter count when running on the same chip β a 1B model is roughly 3β4Γ faster than a 3.8B model on identical hardware. Quality scales with parameters but not linearly: Phi-4 Mini (3.8B) reasoning quality is closer to a 7B model than a 1.7B model thanks to Microsoft's training-data mix. Use the table to balance: faster model = quicker reply, smarter model = better answer for hard questions.
π‘Tip: iPhone 16 Pro tokens/sec is roughly 15β25% lower than iPhone 17 Pro for every model in this table β A18 Pro vs A19 Pro Neural Engine difference. Galaxy S25 Ultra (Snapdragon 8 Elite) is roughly 10β20% lower than iPhone 17 Pro on the same Q4_K_M GGUF, mostly because Termux+Ollama on Android does not yet leverage the Snapdragon Hexagon NPU the way MLC Chat leverages Apple Metal.
Phi-4 Mini: Smartest Small Model
Phi-4 Mini (3.8B parameters, Microsoft, December 2024) is the smartest sub-4B model in 2026 thanks to a training-data mix optimised for reasoning over breadth. It outperforms Gemma 3 4B and Llama 3.2 3B on chain-of-thought tasks despite being a similar size. Use it as the default on any phone with 8 GB+ RAM.
- Parameters and training: 3.8B parameters; trained on a Microsoft-curated mix of high-quality web text, synthetic reasoning chains, and academic content. Architecture is a Transformer with grouped-query attention.
- Memory footprint: ~2.7 GB at Q4_K_M, ~3.5 GB at Q5_K_M. Fits comfortably on iPhone 16 Pro / 17 Pro (8 GB) and Galaxy S25 Ultra (12 GB) with room for the OS.
- Speed (tokens/sec): iPhone 17 Pro ~13β18, iPhone 16 Pro ~10β15, Galaxy S25 Ultra ~10β15 (Termux+Ollama), iPhone 14 Pro ~6β10 (slow but functional).
- Quality strengths: chain-of-thought reasoning, summarisation, factual Q&A, basic code generation. Outperforms similarly-sized open models on standard benchmarks (MMLU, GSM8K).
- Quality weaknesses: narrower world knowledge than Llama 3.2 3B (less Common Crawl exposure); shorter natural creative writing than Gemma 3 4B; weaker multilingual than Qwen 2.5 1.5B outside English.
- Best for: users with a flagship phone who want the smartest single-model default for English-language chat, summarisation, and reasoning.
π‘Tip: Phi-4 Mini benefits from a system prompt that explicitly invokes step-by-step reasoning ("Think through this carefully before answering"). The training data was heavy on reasoning chains, so prompting in that style consistently produces better answers than terse instructions. For quick chat, no system prompt is needed; the default behaviour is already conversational.
Gemma 3 4B: Balanced Default
Gemma 3 4B (Google DeepMind, 2025) is the balanced default when Phi-4 Mini is unavailable in your app or you prefer Google's training-data mix. Slightly slower than Phi-4 Mini on identical hardware but matches it on chat and summarisation, with broader natural-language coverage.
- Parameters and training: 4B parameters; trained on Google's curated mix of web text, code, and multilingual data. Same architecture family as Gemma 2 with extended context.
- Memory footprint: ~2.9 GB at Q4_K_M, ~3.7 GB at Q5_K_M. Fits on 8 GB+ phones; tight on 6 GB phones (use Phi-4 Mini or smaller instead).
- Speed (tokens/sec): iPhone 17 Pro ~10β13, iPhone 16 Pro ~7β10, Galaxy S25 Ultra ~7β10 (slightly slower than Phi-4 Mini despite similar size due to architecture differences).
- Quality strengths: natural conversational tone, strong summarisation, broader world knowledge than Phi-4 Mini (Common Crawl exposure), decent multilingual.
- Quality weaknesses: weaker chain-of-thought reasoning than Phi-4 Mini; slower tokens/sec on the same hardware; not always the first model added to mobile apps (lags Phi-4 Mini in PocketPal AI release timing).
- Best for: flagship phone users who want a Google-trained model as a Phi-4 Mini alternative, particularly for everyday chat, summarisation, and short drafting.
π‘Tip: Gemma 3 4B uses a different chat template than Phi-4 Mini β verify your app uses the correct Gemma template (with <start_of_turn> markers). Wrong template produces broken or repetitive output. PocketPal AI, MLC Chat, and LM Studio detect this automatically; LLM Farm requires manual selection of the Gemma template under Model Settings.
Gemma 3 1B: Lightweight Pick for Older Phones
Gemma 3 1B (Google DeepMind, 2025) is the smallest viable mobile model in 2026 β ~720 MB at Q4_K_M and runs on 4 GB phones. Quality is limited to short coherent responses (1β2 paragraphs) but it is the only option below 1 GB that produces usable output on weak hardware.
- Parameters and training: 1B parameters; same Gemma 3 family architecture as the 4B model but with reduced training compute. Trained for efficient inference on edge devices.
- Memory footprint: ~720 MB at Q4_K_M, ~900 MB at Q5_K_M. Runs on iPhone SE 3rd gen, iPhone 12 / 13, older Android (4 GB RAM minimum).
- Speed (tokens/sec): iPhone 17 Pro ~35β45, iPhone 16 Pro ~28β38, iPhone 14 ~20β28, older Android (4 GB) ~10β15. Fastest model in this lineup on every device.
- Quality strengths: speed, low memory footprint, coherent short-form responses, low battery drain.
- Quality weaknesses: weak multi-step reasoning, frequent factual errors on niche topics, repetitive on long generations (>500 tokens), shallow conversational depth.
- Best for: users with phones below the 6 GB RAM threshold who still want on-device AI, or anyone optimising for battery life on long flights or in low-power scenarios.
π‘Tip: Use Gemma 3 1B for short, focused tasks β single-sentence summarisation, one-paragraph drafts, quick definitions, simple translation between major language pairs. Avoid asking it for multi-paragraph explanations, multi-step reasoning, or anything where accuracy on niche facts matters. The model knows it is small; prompting it to "be concise" plays to its strengths.
SmolLM 2 1.7B: Fastest Tokens per Second
SmolLM 2 1.7B (Hugging Face, 2024) is the fastest tokens-per-second mobile model in this lineup on every tested phone. ~26β32 tok/sec on iPhone 17 Pro and ~20β28 on Galaxy S25 Ultra. Best when responsiveness matters more than answer depth.
- Parameters and training: 1.7B parameters; trained on a Hugging Face-curated mix optimised for small-model efficiency. Architecture tuned for low-latency inference on consumer hardware.
- Memory footprint: ~1.1 GB at Q4_K_M. Fits on any phone with 6 GB+ RAM with substantial OS headroom.
- Speed (tokens/sec): iPhone 17 Pro ~26β32, iPhone 16 Pro ~22β28, Galaxy S25 Ultra ~20β28, iPhone 14 Pro ~15β22. Roughly 2Γ faster than Phi-4 Mini on the same chip.
- Quality strengths: snappy conversational responses, simple Q&A, autocomplete-style continuation, English-language drafting.
- Quality weaknesses: weaker reasoning than Phi-4 Mini, narrower world knowledge than Llama 3.2 3B, weaker multilingual than Qwen 2.5 1.5B, occasional hallucination on factual queries.
- Best for: mid-range phones where latency matters (text-input autocomplete, voice assistant turn-taking, real-time chat), or older flagships where larger models feel sluggish.
π‘Tip: SmolLM 2 1.7B is the strongest pairing for an offline voice assistant stack on mobile β see Build a Local Voice Assistant on Your Phone for the Whisper + LLM + TTS pipeline. The high tokens/sec keeps voice turn-taking under the ~1.5-second perceptual threshold even on mid-range hardware.
Qwen 2.5 1.5B: Strongest Multilingual Mobile Model
Qwen 2.5 1.5B (Alibaba, 2024) is the strongest multilingual mobile model in 2026 β trained on 35+ languages including Chinese, Japanese, Korean, Arabic, German, French, Spanish, and Russian. Best choice for translation, non-English chat, and travel use where the user switches languages mid-conversation.
- Parameters and training: 1.5B parameters; trained on Alibaba's multilingual corpus with strong representation of CJK languages, Arabic, and major European languages. Architecture optimised for multilingual reasoning.
- Memory footprint: ~1.0 GB at Q4_K_M. Fits on any phone with 6 GB+ RAM.
- Speed (tokens/sec): iPhone 17 Pro ~24β32, iPhone 16 Pro ~20β28, Galaxy S25 Ultra ~18β26, iPhone 14 Pro ~14β20. Comparable speed to SmolLM 2.
- Quality strengths: native-quality output in 35+ languages (most small models are English-first with weak multilingual fallback), strong translation between major language pairs, coherent CJK output where Phi-4 Mini and Llama 3.2 produce broken characters.
- Quality weaknesses: English-only reasoning slightly weaker than Phi-4 Mini, shorter natural creative writing than Gemma 3 4B, weaker tool-calling than Llama 3.2 3B.
- Best for: non-English users (especially Chinese, Japanese, German, Spanish, French speakers), travellers needing offline translation, or developers building multilingual mobile features.
π‘Tip: For one-shot translation between two specific language pairs, Qwen 2.5 1.5B usually beats a larger English-first model running translation as a secondary task. For a German user chatting in German, Qwen 2.5 produces noticeably more natural output than Phi-4 Mini despite being 60% smaller. The right rule: pick the model trained for your primary language, not the model with the most parameters.
Llama 3.2 3B: Reliable 3B Workhorse
Llama 3.2 3B (Meta, 2024) is the most-tested 3B model in 2026 β broadest app compatibility, strongest tool-calling support among the six, and the largest community fine-tune ecosystem. Slightly behind Phi-4 Mini on raw quality but more reliable in edge cases and better-supported by mobile apps.
- Parameters and training: 3B parameters; trained on Meta's large pretraining corpus with instruction-tuning for chat and tool use. Same Llama 3 architecture as the 8B and 70B siblings.
- Memory footprint: ~2.2 GB at Q4_K_M, ~2.8 GB at Q5_K_M. Fits on 8 GB+ phones with comfortable OS headroom; works on tight 6 GB phones if other apps are closed.
- Speed (tokens/sec): iPhone 17 Pro ~16β22, iPhone 16 Pro ~12β18, Galaxy S25 Ultra ~12β18, iPhone 14 Pro ~7β11.
- Quality strengths: broad world knowledge, robust tool-calling and function-calling support (best-in-class among sub-4B models), reliable chat behaviour, mature ecosystem of fine-tunes for specific tasks (medical, legal, coding).
- Quality weaknesses: weaker chain-of-thought reasoning than Phi-4 Mini, slightly lower MMLU scores at similar size, less natural conversational tone than Gemma 3 4B.
- Best for: mobile apps that need tool calling or function calling (RAG over local documents, on-device agent workflows), or users who want the model with the largest community fine-tune library.
π‘Tip: Llama 3.2 3B is the only model in this lineup with broad tool-calling support reliable enough for on-device agent workflows β see Local AI Agents with MCP 2026 for the agent layer. Phi-4 Mini and SmolLM 2 can technically tool-call but Llama 3.2 3B is the only one production-ready in 2026.
Quantisation for Mobile: Q4_K_M Is the Default
Q4_K_M is the standard quantisation for mobile LLM inference in 2026 β preserves ~95% of the original model's quality at one-quarter the file size. Use Q5_K_M or Q6_K only on 12 GB+ phones (iPhone 17 Pro Max, Galaxy S25 Ultra) where the extra memory headroom is genuinely free.
π In One Sentence
Q4_K_M is the mobile default β ~95% quality at one-quarter size. Q5_K_M / Q6_K are only worth it on 12 GB+ phones.
π¬ In Plain Terms
Models on Hugging Face are published at full precision (each parameter stored as a 16-bit number). On phones, you download a quantised version where each parameter is squeezed into 4 bits β making the file four times smaller and inference roughly four times faster, with a small quality cost. Q4_K_M is the variant that everyone in 2026 settled on as the right balance for phones. Higher Q numbers (Q5, Q6, Q8) mean less squeezing and better quality but bigger files; Q4 is the sweet spot for phone constraints.
- Q4_K_M (recommended default): 4-bit quantisation with K-quants and "M" mixed precision. ~95% of original quality. Standard for mobile in 2026. All six models available in this format on Hugging Face.
- Q5_K_M (for 12 GB+ phones): 5-bit quantisation. ~98% of original quality. ~25% larger files. Worth it on iPhone 17 Pro Max (12 GB) or Galaxy S25 Ultra (12 GB) for Phi-4 Mini and Llama 3.2 3B; not worth the RAM cost on 8 GB phones.
- Q6_K (rarely needed): 6-bit quantisation. ~99% of original quality. ~50% larger files. Only worth it for memory-rich phones running models you genuinely care about quality on (e.g., long-form drafting where every percentage point of quality matters).
- Q8_0 (avoid on mobile): 8-bit quantisation. ~99.5% of original quality. Roughly 2Γ the size of Q4_K_M. Not worth the RAM cost on phones; reserve for desktop/laptop use.
- Q3_K_M / Q2_K (only for very constrained phones): 3-bit and 2-bit quantisation. Quality drops to ~85β90%. Use only if Gemma 3 1B at Q4_K_M still does not fit (rare in 2026).
β οΈWarning: Do not download the same model in multiple quantisations expecting to "test which is best" on a phone. The quality differences between Q4_K_M and Q5_K_M are real but small, and you will burn 5+ GB of phone storage hosting redundant variants. Pick Q4_K_M, run it for a week of real use, and only upgrade to Q5_K_M if you have specific evidence the quality is insufficient.
Per-Tier Verdict: Flagship vs Mid-Range vs Budget
Phone tier dictates the model ceiling β chip generation and RAM matter more than brand. A flagship phone (8 GB+ RAM, A18 Pro / A19 Pro / Snapdragon 8 Elite) runs 3.8Bβ4B models comfortably; a mid-range phone (6β8 GB RAM, older flagship chip) runs 1.7Bβ3B; a budget or older phone (4β6 GB RAM) runs 1Bβ1.5B.
π In One Sentence
Flagship phones (8 GB+) β Phi-4 Mini 3.8B; mid-range (6β8 GB) β SmolLM 2 1.7B or Llama 3.2 3B; budget or older (4β6 GB) β Gemma 3 1B or Qwen 2.5 1.5B.
π¬ In Plain Terms
Match the model to your phone, not your aspirations. A 3.8B model on a 6 GB phone produces frustrating 3-second pauses and crashes when other apps need memory. A 1B model on a flagship phone leaves capability on the table. Pick the largest model your phone can run comfortably with the OS and at least one other app open.
| Phone Tier | Examples | Recommended Model | Why |
|---|---|---|---|
| Flagship (8β12 GB RAM) | iPhone 17 Pro / Pro Max, iPhone 16 Pro, Galaxy S25 Ultra, OnePlus 13 | Phi-4 Mini (3.8B Q4_K_M) | Smartest model the chip sustains at usable speed |
| Older flagship (8 GB RAM) | iPhone 15 Pro, Galaxy S24 Ultra, Pixel 9 Pro | Llama 3.2 3B or Phi-4 Mini | Llama 3.2 3B for tool calling; Phi-4 Mini for raw quality |
| Mid-range (6β8 GB RAM) | iPhone 14 Pro, Pixel 9, Snapdragon 8 Gen 2 phones | SmolLM 2 1.7B or Qwen 2.5 1.5B | Snappy speed; fits with OS headroom |
| Budget / older (4β6 GB RAM) | iPhone 14, mid Snapdragon 7-series, older Android | Gemma 3 1B or Qwen 2.5 1.5B | Smallest viable models that still produce coherent output |
| Very old (4 GB RAM) | iPhone SE 3rd gen, older 4 GB Android | Gemma 3 1B | Only model that fits; limited reasoning, fast tokens/sec |
| Unsupported (<4 GB) | iPhone SE 2nd gen, ancient Android | Remote-connect to home machine instead | On-device LLM not practical; use a tablet/phone as a UI for a home Ollama server |
π‘Tip: For the app side of the equation, see the iPhone and Android sister guides β they cover which apps actually expose each of these models on each platform. App availability sometimes lags model availability: Gemma 3 4B was on Hugging Face six months before PocketPal AI added a one-tap downloader for it. If a model is missing from your app's curated list, it can usually be sideloaded as a GGUF from Hugging Face.
Common Mistakes
- Picking a model larger than the phone's RAM allows. Phi-4 Mini on a 6 GB phone runs at 3β5 tok/sec and crashes when iOS / Android reclaims memory for another app. Match the model to your tier (see the per-tier table above).
- Downloading multiple quantisation variants of the same model. Pick Q4_K_M and stop. Five GB of redundant Q5/Q6 variants on a 256 GB phone is wasted space, and the quality differences are not perceptible in everyday chat.
- Using SmolLM 2 1.7B for multi-step reasoning. It is the fastest model but not the smartest. For chain-of-thought tasks (math, planning, complex reasoning), use Phi-4 Mini even if the slower tokens/sec feels frustrating. Speed without quality is just a faster wrong answer.
- Asking Phi-4 Mini for non-English output without a multilingual prompt prefix. Phi-4 Mini handles common European languages adequately but produces uneven output in CJK or Arabic. For multilingual use, install Qwen 2.5 1.5B alongside Phi-4 Mini and switch per language.
- Expecting cloud-AI quality from any of these models. All six are 1Bβ4B, which means roughly 60β80% of the capability of GPT-4o on chat tasks and far less on complex reasoning. Use them for what they are good at (private chat, summarisation, drafting, translation) and use cloud or remote-connect for what requires a 70B+ model.
- Confusing Phi-4 Mini (3.8B) with the older Phi-3 Mini (3.8B). They share a parameter count but Phi-4 Mini's training data and chat template are different. Always confirm the model identifier in the GGUF filename β
phi-4-mini-instructnotphi-3-mini-4k-instruct.
Sources
- Phi-4 Mini technical report β Microsoft Research (December 2024).
- Gemma 3 technical report β Google DeepMind (2025).
- SmolLM 2 model card β Hugging Face (2024).
- Qwen 2.5 technical report β Alibaba Cloud (2024).
- Llama 3.2 model card β Meta AI (2024).
- Q4_K_M quantisation reference β llama.cpp documentation.
FAQ
Which mobile model is fastest on iPhone?
Gemma 3 1B is the absolute fastest at ~35β45 tokens/sec on iPhone 17 Pro, but it is the smallest model in this lineup. Among 1.5Bβ1.7B models (where speed and quality are balanced), SmolLM 2 1.7B is the fastest at ~26β32 tokens/sec. Among models that produce flagship-quality output, Phi-4 Mini at ~13β18 tokens/sec is the fastest "smart" option. Pick by use case: if responsiveness matters more than depth, SmolLM 2; if depth matters more, Phi-4 Mini.
Does Phi-4 Mini really beat 7B models on phone?
It beats older 7B models (Llama 2 7B, Mistral 7B v0.1) on standard benchmarks like MMLU and reasoning tasks despite being half the size. It does NOT beat current 7B models (Llama 3.1 7B, Mistral 7B v0.3) on raw capability β those still lead on broad knowledge and complex reasoning. The reason Phi-4 Mini punches above its weight is Microsoft's training-data mix (heavy on synthetic reasoning chains and high-quality text). On phone, 7B models are usually too slow to be practical anyway, so Phi-4 Mini wins by default.
Can SmolLM 2 run on a 4-year-old phone?
Yes, on most 4-year-old flagships. SmolLM 2 1.7B at Q4_K_M needs ~1.1 GB RAM for the model plus ~500 MB for inference overhead β fits on iPhone 13 (6 GB), iPhone 12 Pro Max (6 GB), and equivalent Android (6 GB+). On 4 GB phones from 2021 (iPhone 12, base Android), it technically loads but is unstable under any other memory pressure; use Gemma 3 1B instead.
Which model handles translation best on mobile?
Qwen 2.5 1.5B for any pair involving Chinese, Japanese, Korean, Arabic, German, French, Spanish, or Russian. It was trained with strong multilingual representation and produces native-quality output where English-first models (Phi-4 Mini, Llama 3.2 3B) produce stilted or broken results. For European-language pairs only, Gemma 3 4B is a viable second choice. For one-off translations between English and one specific language, an installed translation app (Google Translate, DeepL) is often better than any local LLM β local models shine when you need translation chained with chat or summarisation in the same conversation.
Do I need a flagship phone to run these well?
No, only for the largest models (Phi-4 Mini 3.8B, Gemma 3 4B, Llama 3.2 3B). Mid-range phones with 6β8 GB RAM run SmolLM 2 1.7B and Qwen 2.5 1.5B at full speed (~20β28 tokens/sec). Budget phones with 4β6 GB RAM run Gemma 3 1B at ~15β25 tokens/sec. The honest answer: if you do not already own a flagship phone, do not buy one for local AI β the smaller models on your existing phone are good enough for most use cases.
Which model has the smallest battery drain?
Gemma 3 1B by a wide margin β smallest model means fewest computations per token, which means lower CPU/GPU load and lower power draw. SmolLM 2 1.7B and Qwen 2.5 1.5B are next. The 3Bβ4B models (Phi-4 Mini, Llama 3.2 3B, Gemma 3 4B) draw 2β3Γ more power per response. For long flights or extended off-grid use where battery matters most, Gemma 3 1B is the right pick despite the quality cost.
Can mobile models handle multi-turn conversations?
Yes for short conversations (5β10 turns), with quality degrading after that. All six models have 4kβ8k token context windows; longer conversations exceed the window and the model loses track of earlier turns. For ongoing chat that needs memory beyond a session, the practical pattern is: summarise the conversation periodically, store the summary, and feed it back as context. Most mobile apps (PocketPal AI, Private LLM) do this automatically; LLM Farm requires manual configuration.
Do these models work with voice input?
Yes, when paired with a Whisper speech-to-text layer. The standard offline mobile voice stack in 2026 is: Whisper (small or tiny model) for speech-to-text β Phi-4 Mini or SmolLM 2 for response generation β Apple TTS or Android TTS for speech synthesis. SmolLM 2 1.7B is the best LLM choice for voice because the high tokens/sec keeps voice turn-taking responsive β see Build a Local Voice Assistant on Your Phone for the full pipeline.
Which is best for offline travel use?
For travel where you switch languages and need translation: Qwen 2.5 1.5B. For travel where you mostly need an English-language reference (questions, summarising travel docs, drafting emails): Phi-4 Mini on a flagship phone, SmolLM 2 1.7B on a mid-range phone. Travel use is the strongest case for local AI overall β no roaming data needed, no cloud-API costs, and no risk of cloud dependencies failing in low-connectivity areas. Download the model before the trip; it works for the whole journey on a single charge if used moderately.
Are mobile models still useful in 2027?
Yes, but the specific model names will change. The mobile small-LLM frontier moves roughly every 6β9 months β by Q4 2026 there will likely be new ~3B models that outperform Phi-4 Mini, and by mid-2027 the 1Bβ2B class will likely match what 3Bβ4B models do today. The category does not become obsolete; specific picks rotate. Re-check this article (refresh due 2026-11-08) for the next-generation lineup.