Skip to main content
PromptQuorumPromptQuorum
Home/Local LLMs/On-Device AI & Memory: Why HBM Memory Drives Local AI Speed (2026)
Hardware & Performance

On-Device AI & Memory: Why HBM Memory Drives Local AI Speed (2026)

·11 min read·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

**The decode phase of LLM inference is bandwidth-bound, not compute-bound: tokens/sec ≈ memory_bandwidth / model_size_in_bytes. Galaxy S26 LPDDR5X (85.6 GB/s) limits a 7B model to ~24 tokens/sec max. Data-center H100 GPU HBM3E (1.229 TB/s) hits 100+ tokens/sec. The 14x bandwidth gap explains the speed difference. SK Hynix holds 62% HBM market share; Samsung focuses on LPDDR5X-PIM (processing-in-memory) to reduce data movement. HBM4 (>2 TB/s) arrives 2026-2027. This memory bottleneck is fundamental to why local AI will always be slower than cloud—you cannot fit HBM into a phone.

Memory bandwidth, not compute TOPS, is the bottleneck for AI inference. The Galaxy S26 (Exynos 2600) has LPDDR5X at 85.6 GB/s; data centers use HBM3E at 1.229 TB/s—a 14x difference. This gap explains why 7B-parameter models run on phones at 8–15 tokens/sec but data-center GPUs handle 100+ tokens/sec. Samsung and SK Hynix are the key players: SK Hynix dominates HBM (62% market share), while Samsung is pushing LPDDR5X-PIM (Processing-In-Memory) to narrow the gap. This guide explains the memory bottleneck, the role of Samsung and SK Hynix, and what it means for on-device AI in 2026 and beyond.

Key Takeaways

  • Memory bandwidth, not TOPS, is the bottleneck for AI inference decode phase. Formula: tokens/sec ≈ memory_bandwidth / model_size_in_bytes. A 7B model in FP16 (14 GB) at LPDDR5X 85.6 GB/s = ~6 tokens/sec. Same model quantized Q4 (3.5 GB) = ~24 tokens/sec. Data-center H100 HBM3E (1.229 TB/s) = ~88 tokens/sec. The gap is 14x, not because of compute—both have fast FLOPS—but because of how fast you can feed data to the compute units.
  • SK Hynix holds 62% HBM market share (peak Q2 2025, projected >50% through 2026). SK supplies Nvidia H100, H200, B200 GPUs. SK Hynix is shipping HBM4 samples to Nvidia (>2 TB/s, coming 2026-2027). Samsung competes on LPDDR5X and is developing LPDDR5X-PIM (Processing-In-Memory) to compute inside memory and reduce data movement.
  • On-device AI will always be slower than cloud AI because you cannot fit HBM into a phone. LPDDR5X is 8-15x slower than HBM. This is a fundamental architectural gap, not a gap that engineering can close in the phone form factor.
  • Exynos 2600 (Galaxy S26) achieves ~15 tokens/sec for a quantized 7B model due to LPDDR5X bandwidth constraint. No amount of chip redesign can fix this—you need more memory bandwidth, which requires larger, more power-hungry memory chips (HBM is 2+ inches tall; LPDDR5X is a thin film).
  • The memory-bandwidth bottleneck explains why fine-tuning or distillation doesn't help: you still have to load every parameter into memory on every forward pass. Smaller models help (3B, 1B), but quantization is the only practical solution for phones.
  • Samsung's PIM (Processing-In-Memory) strategy aims to compute operations inside the memory chip itself, eliminating data movement. This could eventually close the gap, but LPDDR5X-PIM is still in early stages and unlikely to ship in volume until 2027-2028.

Why Memory Bandwidth Determines AI Speed

During the decode phase of LLM inference, the GPU/NPU must load the entire model into memory, perform one forward pass per token, and write the output. The bottleneck: how fast can you feed parameters to the compute units? This is memory bandwidth, not compute TOPS.

Simplified formula: tokens/sec = memory_bandwidth / (model_size_in_bytes × bytes_per_precision). For FP16 (2 bytes per parameter), a 7B model = 14 GB. At LPDDR5X 85.6 GB/s: 85.6 GB/s ÷ 14 GB = ~6 tokens/sec theoretical maximum. In practice, 3–5 tokens/sec due to compute and cache overhead.

Quantization changes the equation dramatically. Q4 (4-bit, 0.5 bytes per parameter) shrinks a 7B model to 3.5 GB. 85.6 GB/s ÷ 3.5 GB = ~24 tokens/sec theoretical. Real-world ~8–15 tokens/sec, a 3–4x improvement.

Data-center H100 with HBM3E (1.229 TB/s) can sustain 100+ tokens/sec for the same model, because HBM is 14x faster. This is why frontier models (70B, 405B) run only in data centers—you need HBM bandwidth.

Inference is different from training. Training is compute-bound (you do 1000+ passes to update weights). Inference (especially once a model is prompt-cached) is a single forward pass, pure memory-bandwidth bound. This is why inference on phones is so much slower than on servers—you cannot engineer your way out of the bandwidth gap.

LLM decode is memory-bandwidth-bound: tokens/sec = bandwidth / model_size. On-device 85.6 GB/s vs data-center 1.229 TB/s = 14x gap.

Think of a factory assembly line: compute is the workers, memory is the supply chain. Workers are fast, but if supplies arrive slowly, they get bottlenecked. More workers (more FLOPS) doesn't help if supplies (data) arrive at the same rate. Phones lack "fast supply chains" (HBM).

Memory Bandwidth Comparison: LPDDR5X vs HBM

Memory TypeBandwidthUsed InTokens/sec (7B FP16)Tokens/sec (7B Q4)
LPDDR5X 10.7 Gbps85.6 GB/s (x64 bus)Galaxy S26, Snapdragon 8 Elite Gen 5, most phones~6 tokens/sec (theory); ~3–5 realistic~24 tokens/sec (theory); ~8–15 realistic
HBM2E~460 GB/s per stackOlder GPUs (P100, V100 pre-2020)~33 tokens/sec (theory)~131 tokens/sec (theory)
HBM3 19.2 Gbps~819 GB/s per stackNvidia A100, 80GB variant~59 tokens/sec (theory)~234 tokens/sec (theory)
HBM3E 21.4 Gbps1.18–1.229 TB/s per stackNvidia H100, H200, B200 (2+ stacks common)~88 tokens/sec (theory); ~60–80 realistic~352 tokens/sec (theory); ~200+ realistic
HBM4 (announced)>2 TB/s per stackNvidia (2026-2027 sampling)~143 tokens/sec (theory)~571 tokens/sec (theory)
LPDDR5X-PIM (research)85.6 GB/s + compute inside memorySamsung (lab samples, production 2027–2028?)Unknown; eliminates some round-tripsUnknown; potentially +50% vs standard LPDDR5X
LPDDR6 (announced)~200+ GB/s (estimated)Phones (2027–2028)~14 tokens/sec (theory)~57 tokens/sec (theory)

Samsung and SK Hynix: Who Makes What

SK Hynix — HBM Leader: SK holds ~62% of the HBM market (peak Q2 2025, projected >50% through 2026). SK supplies HBM3E to Nvidia for H100, H200, and B200 GPUs. SK is sampling HBM4 (>2 TB/s) to Nvidia for next-gen GPUs launching 2026-2027. HBM revenue is critical for SK's data-center division; Samsung is chasing.

Samsung — LPDDR5X & PIM Push: Samsung manufactures LPDDR5X for Galaxy S26, Snapdragon phones, and Apple (for A18 Pro). Samsung is developing LPDDR5X-PIM (Processing-In-Memory), which embeds compute operations inside the memory die itself. This reduces data round-trips and could eventually narrow the bandwidth gap. LPDDR5X-PIM is in lab/early sampling phase; production volume unlikely before 2027-2028.

Competitive Dynamics: Samsung has pursued HBM (HBM3, HBM3E samples), but loses to SK due to yield and cost. Samsung pivoted to LPDDR5X-PIM as a differentiated strategy: make the phone memory smarter rather than trying to match HBM bandwidth. This is a "can't compete, so innovate differently" move.

Both companies supply Nvidia: SK Hynix provides HBM for Nvidia GPU VRAM. Samsung may supply standard DRAM for Nvidia's CPU/host memory. Neither supplies the compute (Nvidia designs the GPU cores). The ecosystem is specialized: design, memory, compute are separate.

Timeline: HBM4 enters production 2026-2027 (SK Hynix). LPDDR5X-PIM enters limited production 2027-2028 (Samsung). LPDDR6 enters phones 2027-2028, with ~2x bandwidth vs LPDDR5X (~200+ GB/s vs 85.6 GB/s)—still 6x slower than HBM3E, but a meaningful improvement.

On-Device AI Limits on the Galaxy S26

The Galaxy S26 Exynos 2600 with LPDDR5X 85.6 GB/s defines the practical ceiling for on-device LLM inference. A quantized 7B model at Q4 reaches ~8–15 tokens/sec realistic performance. This is suitable for latency-sensitive tasks (autocomplete, real-time transcription, simple tasks) but impractical for long conversations.

Model size limits: A 7B model is practical (3–4 hour latency per 100-token response). A 13B model at Q4 (~6.5 GB) hits 85.6 GB/s ÷ 6.5 GB = ~13 tokens/sec, barely an improvement. A 70B model at Q4 (~35 GB) hits 85.6 GB/s ÷ 35 GB = ~2 tokens/sec—unusable.

Quantization is essential: FP16 (2 bytes/param) is impractical. Q4 (0.5 bytes/param) is the sweet spot—4x smaller models with acceptable quality loss. Q3 (3-bit) saves more space but loses quality; Q5 loses less quality but gains little bandwidth improvement.

Speed vs quality tradeoff: 7B Q4 is ~8–15 tokens/sec (acceptable for some use cases). 3B Q4 is ~24–36 tokens/sec (excellent for simple tasks). 1B Q4 is ~60+ tokens/sec (real-time, Pixel 3 era performance on modern hardware).

Practical use cases: autocomplete, real-time code suggestion, on-device transcription, local summarization. Not practical: long conversations, complex reasoning, multi-turn dialogue without caching.

The bottleneck is bandwidth, not compute or weight size. Even if you reduce model parameters to 0, the memory still needs to transfer them, and the bandwidth is fixed. This is why on-device AI is architecturally limited—you cannot engineer yourself out of 85.6 GB/s on a phone form factor.

  • Use LPDDR5X 85.6 GB/s bandwidth to estimate max tokens/sec: divide by model size in GB
  • 7B Q4 (3.5 GB): ~24 tokens/sec theory; ~8–15 realistic (practical)
  • 13B Q4 (6.5 GB): ~13 tokens/sec theory; ~4–8 realistic (slow)
  • 1B Q4 (~500 MB): ~171 tokens/sec theory; ~50–100 realistic (fast)
  • Quantization is mandatory: Q4 is the baseline for usable on-device models
  • Trade off model size against latency; no model size fits the "good enough" window below 5 tokens/sec

Data Center vs. Phone: The 14x Bandwidth Gap

A Nvidia H100 GPU with HBM3E (1.229 TB/s) is 14x faster at inference throughput than a Galaxy S26 (LPDDR5X 85.6 GB/s). This gap is not due to compute FLOPS (both are fast), but pure memory bandwidth. The H100 can do 100+ tokens/sec; the S26 does 8–15 tokens/sec for the same 7B Q4 model.

Why the gap exists: HBM is physically different. LPDDR5X is a thin film sitting next to the CPU (power-efficient for phones). HBM is a stack of memory chips directly bonded to the GPU using through-silicon vias (TSVs), creating massive bandwidth. HBM stacks are 2+ inches tall; impossible to fit in a phone.

Why it can't be closed: Phones are thermally and power-constrained. HBM consumes significant power (~100+ W for a full stack). LPDDR5X is ~5–10 W. Phones run on batteries; data centers have unlimited power/cooling. You cannot physically fit HBM bandwidth into a phone without destroying battery life.

Consequence: On-device AI will always be slower than cloud AI for large models. This is not a technology gap that will close—it's a physical constraint (power, thermal, form factor). Smaller models, aggressive quantization, and clever caching are the solutions, not hoping for better memory.

The flip side: on-device is private, offline-capable, and zero-latency for privacy-sensitive tasks. The 14x speed penalty is the price of privacy. Data-center AI trades speed for privacy loss.

Future: LPDDR5X-PIM (2027-2028) and LPDDR6 (2027-2028) will improve phone bandwidth to ~200 GB/s (still 6x slower than HBM3E). This is meaningful (double the tokens/sec) but won't make phones match data-center speed. The gap will remain 6x, not 14x.

Memory Roadmap: HBM4 and LPDDR6

HBM4 (SK Hynix, 2026-2027): >2 TB/s per stack. Arriving first in Nvidia next-gen GPUs (post-B200). HBM4 is irrelevant for phones but will push data-center inference even faster. SK Hynix is the primary supplier.

LPDDR6 (2027-2028): ~200+ GB/s (estimated; x64 bus, 12.8 Gbps). That's ~2.3x LPDDR5X bandwidth. For a 7B Q4 model: 200 GB/s ÷ 3.5 GB ≈ 57 tokens/sec theoretical (up from 24). In practice, ~20–35 tokens/sec realistic. A meaningful improvement, but still 3x slower than data-center HBM3E. LPDDR6 will ship in Galaxy S27/S28 era (2027-2028).

LPDDR5X-PIM (Samsung, 2027-2028): Processing-In-Memory embeds compute inside the DRAM die. Instead of loading every model weight from memory, you compute operations (matrix multiplies) inside the memory itself, eliminating some round-trips. Samsung is actively developing this. Estimated 50%+ throughput improvement vs standard LPDDR5X, if successful. Still won't match HBM bandwidth, but a clever engineering solution.

Reality: Even with LPDDR6 + PIM, phones will still be 3–6x slower than data centers for inference. This is the fundamental gap that cannot be closed without changing the phone's physical design (larger, hotter, more power).

For on-device AI 2026-2027: Exynos 2600 + LPDDR5X is the current baseline. Exynos 2700 (S27) may improve compute, but bandwidth will be the bottleneck. Expect LPDDR6 and PIM as incremental improvements, not transformative.

FAQ

Why is memory bandwidth the bottleneck for AI inference?

Because the decode phase (generating each token) requires loading the entire model into memory for one forward pass. The compute units finish quickly, but memory can't feed them data fast enough. FLOPS are not the bottleneck; data delivery is.

What's the tokens/sec formula for on-device AI?

Simplified: tokens/sec = memory_bandwidth / (model_size × bytes_per_precision). For a 7B FP16 model (14 GB) at 85.6 GB/s: 85.6 ÷ 14 = ~6 tokens/sec. Quantized Q4 (3.5 GB): 85.6 ÷ 3.5 = ~24 tokens/sec. Real-world ~40–60% of theoretical.

Does SK Hynix dominate HBM?

Yes. SK holds ~62% HBM market share (Q2 2025 peak). SK supplies Nvidia H100, H200, B200 GPUs. Samsung makes LPDDR5X but hasn't achieved cost/yield parity with SK on HBM, so Samsung pivoted to PIM.

Can Samsung catch up to SK in HBM?

Unlikely for phones. Samsung is betting on LPDDR5X-PIM (computing inside memory) instead of trying to match HBM bandwidth. For data-center HBM, Samsung samples HBM but loses to SK on cost and yield.

When does LPDDR6 ship?

Estimated 2027-2028 in Galaxy S27/S28. ~200+ GB/s (2.3x LPDDR5X). Will double on-device token throughput but still 3–6x slower than HBM3E data-center GPUs.

Why can't you put HBM in a phone?

Physical constraints: HBM stacks are 2+ inches tall (form factor). HBM power consumption (~100+ W) kills battery life. LPDDR5X is thin-film, 5–10 W. Phones need to fit pockets and last 24 hours.

Will LPDDR5X-PIM close the gap with HBM?

Partially. By computing inside memory, it eliminates some data round-trips, potentially +50% throughput. But physics limits it: still 85.6 GB/s bandwidth. Helpful, but won't make phones match data-center speed.

Is compute FLOPS relevant for on-device AI?

Not as much as people think. Decode is memory-bound, not compute-bound. A slower compute unit with faster memory beats a faster compute unit with slower memory. This is why Exynos 2600 (2nm) > Snapdragon 8 Elite Gen 5 (3nm) for inference: Exynos bandwidth and cache are better tuned.

Can I run a 70B model on Galaxy S26?

Technically yes; practically no. 70B Q4 (~35 GB) gives 85.6 GB/s ÷ 35 GB = ~2 tokens/sec. That's 1 token per second—unusable for any interactive task. Stick to 7B or smaller.

What's the best model size for on-device?

7B Q4 is the Goldilocks zone: 8–15 tokens/sec, acceptable quality. 3B Q4 is faster (24–36 tokens/sec) but lower quality. 1B Q4 is ultra-fast (50+ tokens/sec) but very limited. 13B+ is too slow.

Wird LPDDR5X-PIM in Galaxy S27 sein?

Wahrscheinlich nicht. LPDDR5X-PIM ist in Laborproben; Produktionsvolumen wahrscheinlich 2028. Galaxy S27 (2027) wird wahrscheinlich standard LPDDR5X oder frühe LPDDR6 verwenden. PIM kommt später.

Kann die Speicherbandbreite für KI-Inferenz erhöht werden?

Ja, aber mit Limits. LPDDR6 (~200 GB/s), LPDDR5X-PIM (smart caching). Aber physische Grenzen verhindern HBM-ähnliche Bandbreite in Telefonen. Die 14x-Lücke wird sich zu 6x reduzieren, nicht verschwinden.

A Note on Third-Party Facts

This article references third-party AI models, benchmarks, prices, and licenses. The AI landscape changes rapidly. Benchmark scores, license terms, model names, and API prices can shift between the time of writing and the time you read this. Before making deployment or compliance decisions based on this article, verify current figures on each provider’s official source: Hugging Face model cards for licenses and benchmarks, provider websites for API pricing, and EUR-Lex for current GDPR and EU AI Act text. This article reflects publicly available information as of May 2026.

Run PromptQuorum with a local LLM, your own API keys, or both — you pick the backend.

Join the PromptQuorum Waitlist →

← Back to Local LLMs

HBM vs LPDDR5X Memory: On-Device AI Bandwidth Explained