Home/Prompt Engineering/Tokens, Costs & Limits: The Economics of AI Prompting in 2026

Fundamentals

Tokens, Costs & Limits: The Economics of AI Prompting in 2026

Last updated: April 2026·13 min read·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

Every AI API call is measured and billed in tokens — the unit that controls both what the model can process and how much you pay. Understanding tokens is the foundation of efficient, cost-effective prompting.

Key Takeaways

Tokens are the unit of AI cost and processing. Approximately 3–4 characters = 1 token in English; other languages require more tokens.
You pay separately for input tokens and output tokens — output tokens typically cost 2–5× more. So long verbose outputs are where costs spike.
Token counting includes system prompts, full conversation history, attached files, and images — not just your latest message.
Rate limits (requests per minute, tokens per minute) exist to prevent abuse and ensure fair resource allocation. Free tiers have strict limits; paid tiers are much higher.
Using the right model for the task reduces cost by 10–50×. GPT-5.5 mini or Claude Haiku 4.5 can handle tasks that don't need GPT-5.5 or Claude Opus 4.8.
Local LLMs via Ollama or LM Studio have zero per-token API cost but require VRAM investment and have lower capabilities than frontier models.

Visual Summary: Tokens, Costs & Limits: The Economics of AI Prompting in 2026

Prefer slides over reading? Click through this interactive presentation covering all key concepts, settings, and use cases — then save as PDF for reference.

The slide deck below covers: token pricing, rate limits, model selection, and cost-cutting strategies. Download the PDF as an AI token economics reference card.

Download Tokens, Costs & Limits: The Economics of AI Prompting in 2026 Reference Card (PDF)

What Is a Token?

A token is the smallest unit of text an AI model processes, approximately 3–4 characters or ¾ of an English word. In English text, "ChatGPT" counts as 2 tokens, and "Hello, how are you?" is roughly 5–6 tokens. Other languages tokenise less efficiently — the same phrase in German or Japanese may consume 20–40% more tokens. You are billed for every token in your prompt (input) and every token the model outputs. Understanding tokens is fundamental to what is prompt engineering — the practice of structuring your inputs to get reliable outputs.

Models do not "think" in words or characters. Internally, they convert your text into token IDs and process those numerically. This is why tokenisation matters: a single character change can sometimes affect the token boundary, and a poorly organised prompt with redundant words can waste hundreds of tokens without improving output quality.

In one sentence: a token is the smallest unit of text an AI model processes, approximately 3–4 characters or ¾ of an English word, and you are billed for every token in and every token out.

How Token Counting Works

Every element of your API call — system prompt, conversation history, new message, files, and the model's own output — consumes tokens from your quota. This is why a conversation that started with a small message can suddenly become expensive after five turns of back-and-forth. You're paying for all of it, accumulated. Understanding the distinction between system prompt and user prompt is critical because both are billed on every call.

System prompt: Counted once per message. A 200-word system prompt = ~250 tokens on every API call.
Full conversation history: Included on every request unless explicitly summarised or dropped. A 10-turn conversation with 500 tokens per turn = 5,000 tokens counted again on turn 11.
Your input message: Counted as-is.
Attached files or images: Images consume 100–2,000 tokens each depending on size and resolution. Large PDFs can consume thousands.
Model output: The generated response is counted in full at output token rates (usually 2–5× higher than input rates).
Worked example: A 3-turn research conversation: System prompt (300 tokens) + User Q1 (150 tokens) + Model A1 (200 tokens) + User Q2 (200 tokens) + Model A2 (300 tokens) + User Q3 (100 tokens) = 1,250 tokens so far. When you send Q3, you pay for the entire history again (1,250 tokens) plus the output of A3. A single "short" follow-up can cost as much as the entire prior conversation.

Pricing Across Cloud Providers

Prices vary dramatically based on model capability. All figures below are public pricing as of April 2026. Note that output tokens typically cost 2–5× more than input tokens — this is where costs accumulate fastest. The right model choice is the biggest cost lever — see how to pick between GPT-5.5, Claude, and Gemini for detailed comparisons.

Prices as of April 2026. Verify current rates: OpenAI pricing · Anthropic pricing · Google pricing

Model	Input (per 1M tokens)	Output (per 1M tokens)
OpenAI GPT-5.5	$5.00	$15.00
Anthropic Claude Opus 4.8	$3.00	$15.00
Google Gemini 3.5 Pro	$3.50	$10.50
OpenAI GPT-5.5 mini	$0.15	$0.60
Anthropic Claude 4.5 Haiku	$0.25	$1.25
Google Gemini 3.5 Flash	$0.075	$0.30

Rate Limits

Rate limits are caps on how many requests you can make per minute (RPM), how many tokens you can process per minute (TPM), or how many tokens per day (TPD). Providers impose limits to prevent abuse, ensure fair resource allocation across users, and create pricing tiers. Free-tier users face the strictest limits; paid tiers unlock much higher throughput.

Requests per minute (RPM): The number of API calls you can make in a 60-second window. Exceed this and requests are queued or rejected.
Tokens per minute (TPM): The total token throughput. A single large prompt can consume your entire TPM quota in seconds.
Common scenarios where you hit limits: Automated pipelines making rapid sequential calls (50+ per second), large batch-processing jobs, or free-tier users in burst situations.
Typical limits: Free tier: 3–15 RPM, 40k–100k TPM. Paid tier 1: 500 RPM, 200k–500k TPM. Enterprise: 3,000+ RPM, millions of TPM.
Workaround strategies: Batch small tasks into larger requests (fewer API calls), add delays between requests, or upgrade to a higher-tier account.

How Prompt Design Controls Costs

Tested in PromptQuorum — 20 identical research-summary prompts executed on GPT-5.5, Claude Opus 4.8, and Gemini 3.5 Pro with varying levels of system prompt verbosity: With a 500-token system prompt, average output was 450 tokens at an average cost of $0.032 per call. With the same instructions in a trimmed 200-token prompt, average output was 460 tokens at $0.025 per call — an 18% cost reduction with identical output quality. This aligns with how to prompt for speed — efficiency reduces both latency and cost.

Every unnecessary token in your prompt wastes money — and the costs accumulate faster because your entire prompt is reincluded on every API call in a conversation. Trimming a 500-token system prompt to 300 tokens saves $0.001 per call, but on 1,000 calls per day, that's $1/day or $365/year.

Trim context aggressively: Don't repeat what the model already knows. Instead of "The user asked X. I told them Y. Now they ask Z," just include Z.
Use explicit length constraints: "Answer in 3 bullets." or "Maximum 100 words." forces brevity and prevents verbose outputs (which cost more).
Avoid padding in system prompts: Every filler word costs money. "You are an expert assistant who helps users" is 10 tokens. "You are an expert assistant" is 6 tokens. Both convey the same meaning.
Example: Bloated vs Trimmed System Prompt:
Bad Prompt "You are a helpful AI assistant with extensive knowledge across many domains. You help users by providing detailed, comprehensive answers to their questions. Always be thorough and explain your reasoning step by step. Avoid being concise — users appreciate thorough explanations."
Good Prompt "You are an expert assistant. Provide accurate, detailed answers. Explain your reasoning."
Token difference: Bad = 55 tokens, Good = 13 tokens. On 100 calls per day: 42 × 100 × 30 days × ($0.005 / 1M input tokens) ≈ $0.63/month saved by just one trimmed prompt.

How to Cut LLM API Costs in 5 Steps

1
Match model to task complexity: use GPT-5.5 mini or Claude 4.5 Haiku for simple classification and Q&A — 33× cheaper than frontier models
2
Summarise conversation history every 5 turns: prevents full history re-billing on every call (a technique aligned with chain-of-thought prompting — structure your reasoning upfront)
3
Cap output length explicitly: "Answer in 3 bullets" or "Maximum 100 words" prevents verbose token-heavy responses
4
Trim system prompts to essentials: remove filler phrases; every redundant word is re-billed on every API call
5
Test local LLMs via Ollama for high-volume private workflows: zero per-token cost at the price of frontier model capability

Choosing the Right Model

Not every task requires OpenAI GPT-5.5 or Anthropic Claude Opus. Simple classification, factual Q&A, and many automated tasks run perfectly on cheaper models — and the cost difference is dramatic.

Task Type	Recommended Model	Cost vs GPT-5.5
Simple classification / yes-no	GPT-5.5 mini, Claude Haiku 4.5, or Gemini Flash	33× cheaper
Short factual Q&A	GPT-5.5 mini or Claude Haiku 4.5	10–33× cheaper
Complex analysis or code	GPT-5.5 or Claude Opus 4.8	baseline
Long-form creative writing	Claude Opus 4.8 or GPT-5.5	baseline
High-volume private workflows	Local model via Ollama	zero API cost

Local LLMs — Zero Cost Option

Local models via Ollama or LM Studio have zero per-token API cost — you only pay for the hardware (VRAM and electricity). This makes them ideal for high-volume workflows, privacy-sensitive applications, and cost-critical pipelines. The trade-offs are capability (local models lag frontier models) and latency (running on consumer VRAM is slower). Understanding context windows is essential when planning local deployments — your VRAM limits the context window size you can support.

Hardware costs: Ollama models like LLaMA 3.1 7B require ~8GB VRAM, 13B models need ~16GB, 70B models need 40GB+. GPU memory is the limiting factor.
Capability trade-off: Local models are excellent at classification, summarisation, and repetitive tasks. They struggle with multi-step reasoning, code generation, and creative writing compared to GPT-5.5 or Claude Opus 4.8.
Latency trade-off: Cloud models respond in 500ms–2s. Local models on consumer hardware: 2–10s depending on model size and system specs.
When to use local: High-volume automation (1,000+ calls/day), GDPR-sensitive data (EU users processing personal data under GDPR benefit from on-device processing with no external API calls), or cost-critical workflows where quality is "good enough."
When to use cloud: Latency-sensitive applications, tasks requiring reasoning, or one-off analyses where API cost is negligible.

Regional Context

EU / GDPR For EU organizations processing personal data through AI APIs, token costs include a compliance cost not visible in pricing tables: each token sent to a cloud API is personal data processed by a third-party under GDPR Article 28, requiring a Data Processing Agreement and transfer mechanism under Article 46 for non-EU providers.

Local LLMs via Ollama eliminate this entirely. For EU teams processing customer data, support tickets, or internal documents: the true cost of a cloud API call includes the compliance overhead of external data transfer. At scale, this can make local inference economically competitive even accounting for hardware investment.

German organizations under BSI IT-Grundschutz guidelines must document AI processing costs and data flows — token logs from cloud APIs satisfy this requirement if retained with appropriate access controls.

Japan (METI) Japanese text requires 20–40% more tokens than equivalent English text due to tokenizer inefficiency on CJK scripts. A 1,000-word Japanese document costs approximately $0.007 on GPT-5.5 vs $0.005 for the same English content. For Japanese-language AI workflows, Qwen3 models via Ollama are significantly more token-efficient — native CJK tokenization reduces Japanese token count by 30–40%, directly reducing per-call cost.

China Under China's Data Security Law (数据安全法), sending business data to foreign cloud AI APIs requires data localization compliance review. For Chinese enterprise teams, local inference via Qwen3 (Alibaba) eliminates cross-border data transfer cost and compliance risk simultaneously. At 1,000+ API calls per day, the hardware amortization cost of a local inference server is typically lower than API fees within 6–12 months.

How PromptQuorum Helps You Manage Token Costs

PromptQuorum uses two LLMs: a Backend LLM and a Frontend LLM (your chosen model that answers your prompt question). The Backend LLM optimizes your prompt and runs Quorum consensus analysis across multiple Frontend models. Unlike single-model chat interfaces, PromptQuorum makes token usage visible and actionable.

Backend LLM tokens are always visible. Frontend tokens visibility depends on how you access the model:

Public interfaces (Copilot, public Claude web chat): Frontend tokens NOT visible — only Backend tokens show.

Local models (LM Studio, Ollama): Frontend tokens ARE visible — runs on your hardware, PromptQuorum sees token usage directly.

APIs (OpenAI, Anthropic): It depends. With direct API integration, Frontend tokens visible. Via third-party endpoint or public interface, Frontend tokens NOT visible.

Tested in PromptQuorum — 20 identical research-summary prompts dispatched to GPT-5.5 and GPT-5.5 mini: Output quality matched on 17 of 20 tasks. Cost difference: $0.003 per prompt (GPT-5.5) vs $0.00007 per prompt (mini) — a 43× cost reduction. On the 3 tasks where GPT-5.5 outperformed, complexity involved multi-step reasoning across documents.

Token Cost Recipes

Use these templates as starting points for optimizing costs in specific workflows.

"Quick lookup / yes-no task": Use GPT-5.5 mini or Haiku. Minimal system prompt (≤50 tokens). No conversation history. Constrain output to 1–2 sentences. Total cost per task: ~$0.00001–0.0001.
"Long research task (5–10 turns)": Use Claude Opus 4.8 (excellent at long context). After every 5 turns, summarise the conversation and replace history with a summary (cuts tokens by 70%). Cost: ~$0.01–0.05 per research session.
"Automated pipeline / batch processing": Use GPT-5.5 mini for filtering or classification (33× cheaper). Only escalate to GPT-5.5 for final synthesis on borderline cases. Batch similar prompts to reuse context caching where the API supports it.
"Privacy-sensitive workflow": Route to Ollama or LM Studio running locally. Manage context window: 4k–8k tokens for 8GB VRAM, 16k–32k for 16GB. Zero API costs. Accept slightly lower quality for compliance.
"Comparing outputs across models": Send one well-structured prompt to GPT-5.5, Claude Opus 4.8, and Claude Haiku 4.5 simultaneously. Compare quality + cost. Pick the cheapest that meets your quality bar. Discovery cost: ~$0.001. Ongoing cost: 33–43× savings.

Common Mistakes

Avoid these token-wasting patterns.

Sending full conversation history on every call: If a conversation is 5,000 tokens after 10 turns, you're paying 5,000 tokens again on turn 11 even though only 200 tokens are new. Solution: Summarise every 5 turns or use prompt caching if the API supports it.
Using a high-capability model for simple tasks: Don't use GPT-5.5 for "extract the date from this email." Use GPT-5.5 mini or Haiku. Cost difference: 33× on this task alone.
Not constraining output length: A vague "tell me about X" prompt can return 500 tokens when "summarise in 50 words" returns 60 tokens. You pay 8× more for the verbose response.
Repeating long system prompts on every call: If your system prompt is 500 tokens and you make 100 API calls, that's 50,000 wasted tokens if you're not reusing or caching it. Use system prompt templates or request-level caching.
Forgetting image tokens: A single high-resolution image can consume 500–2,000 tokens depending on resolution. Downscale images or crop to the relevant region before uploading.
Running manual test calls instead of batching: Testing 20 variations of a prompt costs 20× the token cost of one call. Use batch APIs or PromptQuorum's multi-model comparison to test all variations in one shot.
Switching models mid-conversation: Cloud APIs (OpenAI, Anthropic) don't carry over conversation context between models. Restarting the conversation on a different model re-sends all prior messages. Commit to one model per conversation.

Frequently Asked Questions

What is a token in AI?

A token is the smallest unit of text an AI model processes — approximately 3–4 characters or ¾ of an English word. "ChatGPT" counts as 2 tokens. You are billed for every input token and every output token, with output tokens typically costing 2–5× more than input tokens.

How much does GPT-5.5 cost per token?

As of April 2026: GPT-5.5 costs $5.00 per 1M input tokens and $15.00 per 1M output tokens. GPT-5.5 mini costs $0.15 per 1M input and $0.60 per 1M output — 33× cheaper for tasks that don't require full GPT-5.5 capability.

How do rate limits work?

Rate limits cap requests per minute (RPM) and tokens per minute (TPM). Free tier: 3–15 RPM, 40k–100k TPM. Paid tier: 500 RPM, 200k–500k TPM. Enterprise: 3,000+ RPM. Workarounds: batch small tasks into larger requests, add delays between calls, or upgrade to a higher tier.

How many tokens is a typical article or report?

A 1,000-word article is approximately 1,200–1,500 tokens. A 10-page PDF is 4,000–6,000 tokens. A single high-resolution image is 500–2,000 tokens depending on resolution and content density.

Why is my API bill higher than expected even with short prompts?

Three common causes: (1) You are sending full conversation history on every call — summarise after 5 turns. (2) Your system prompt is long — trim to essentials. (3) You are using a powerful model for simple tasks — switch to GPT-5.5 mini or Haiku for classification or short Q&A.

Does a longer system prompt always mean better output?

No. A well-crafted 100-token system prompt often outperforms a rambling 500-token prompt. Quality beats quantity. Specificity beats verbosity.

When should I use a local LLM instead of a cloud API?

Use local LLMs for: high-volume automation (1,000+ calls/day), GDPR-sensitive data where no personal data should leave your infrastructure, or cost-critical pipelines where quality is good enough. Use cloud APIs for: latency-sensitive applications, complex reasoning tasks, or one-off analyses where API cost is negligible.

How can I reduce my AI API token costs?

Seven strategies: trim system prompts, constrain output length, summarise conversation history every 5 turns, use cheaper models for simple tasks, avoid sending full conversation history, downscale images before uploading, and batch test calls rather than running them manually.

How many tokens does a typical AI prompt use?

A typical prompt uses 150–500 tokens depending on complexity. A simple question (5–20 tokens), a medium paragraph (50–150 tokens), a full research prompt with examples (200–600 tokens). Tokens per prompt vary based on language and complexity.

What does it mean when a prompt has 3,000 tokens?

A 3,000-token prompt is roughly a 2,000-word article or 10+ pages of text. This indicates a long system prompt, complete conversation history, or large document context. For efficiency, consider summarizing conversation history or trimming unnecessary context.

How much does each AI prompt cost across different models?

Costs vary by model: GPT-5.5 mini = ~$0.00005–0.0001 per prompt. GPT-5.5 = ~$0.001–0.01. Claude Haiku = ~$0.00003 per prompt. Claude Opus = ~$0.005–0.02. Gemini Flash = ~$0.00002. Costs depend on prompt length and output.

How are AI prompt tokens calculated?

Tokens are calculated by breaking text into units of 3–4 characters (roughly ¾ of English words). System prompts, conversation history, images, attached files, and output all count. Most API providers show exact token count in responses. Shorter prompts and constrained output reduce token usage.

How many tokens is a 1,000-word prompt?

A 1,000-word prompt is approximately 1,200–1,500 tokens in English. Other languages tokenize less efficiently and may require 20–40% more tokens. Token count depends on word choice and average word length in the language used.

Are token limits based on a single prompt or the entire conversation?

Token limits apply to the entire conversation history, including all system prompts, previous messages, retrieved documents, and the current prompt. Rate limits (tokens per minute) accumulate across all your API calls in that timeframe, not just one prompt.

How many prompts can you get from 1 million tokens?

With 1 million tokens: 2,000–6,667 prompts if each prompt averages 150–500 tokens. GPT-5.5 mini prompts (~300 tokens) = ~3,333 prompts. GPT-5.5 prompts (~500 tokens) = ~2,000 prompts. Actual count depends on prompt size and output length.

Does prompt optimization reduce API costs significantly?

Yes. Trimming a 500-token system prompt to 300 tokens saves ~$0.001 per API call. At 1,000 calls/day, that's $365/year saved. Constraining output length and summarizing conversation history every 5 turns reduces costs 30–50%. Model selection is the largest lever — GPT-5.5 mini costs 33× less than GPT-5.5.

Sources & Further Reading

Apply these techniques with a local LLM or your own API keys — PromptQuorum works with any backend.

Try PromptQuorum free →

← Back to Prompt Engineering