What Is a Token?
A token is the smallest unit of text an AI model processes, approximately 3โ4 characters or ยพ of an English word. In English text, "ChatGPT" counts as 2 tokens, and "Hello, how are you?" is roughly 5โ6 tokens. Other languages tokenise less efficiently โ the same phrase in German or Japanese may consume 20โ40% more tokens. You are billed for every token in your prompt (input) and every token the model outputs. Understanding tokens is fundamental to what is prompt engineering โ the practice of structuring your inputs to get reliable outputs.
Models do not "think" in words or characters. Internally, they convert your text into token IDs and process those numerically. This is why tokenisation matters: a single character change can sometimes affect the token boundary, and a poorly organised prompt with redundant words can waste hundreds of tokens without improving output quality.
In one sentence: a token is the smallest unit of text an AI model processes, approximately 3โ4 characters or ยพ of an English word, and you are billed for every token in and every token out.
How Token Counting Works
Every element of your API call โ system prompt, conversation history, new message, files, and the model's own output โ consumes tokens from your quota. This is why a conversation that started with a small message can suddenly become expensive after five turns of back-and-forth. You're paying for all of it, accumulated. Understanding the distinction between system prompt and user prompt is critical because both are billed on every call.
- System prompt: Counted once per message. A 200-word system prompt = ~250 tokens on every API call.
- Full conversation history: Included on every request unless explicitly summarised or dropped. A 10-turn conversation with 500 tokens per turn = 5,000 tokens counted again on turn 11.
- Your input message: Counted as-is.
- Attached files or images: Images consume 100โ2,000 tokens each depending on size and resolution. Large PDFs can consume thousands.
- Model output: The generated response is counted in full at output token rates (usually 2โ5ร higher than input rates).
- Worked example: A 3-turn research conversation: System prompt (300 tokens) + User Q1 (150 tokens) + Model A1 (200 tokens) + User Q2 (200 tokens) + Model A2 (300 tokens) + User Q3 (100 tokens) = 1,250 tokens so far. When you send Q3, you pay for the entire history again (1,250 tokens) plus the output of A3. A single "short" follow-up can cost as much as the entire prior conversation.
Pricing Across Cloud Providers
Prices vary dramatically based on model capability. All figures below are public pricing as of April 2026. Note that output tokens typically cost 2โ5ร more than input tokens โ this is where costs accumulate fastest. The right model choice is the biggest cost lever โ see how to pick between GPT-4o, Claude, and Gemini for detailed comparisons.
Prices as of April 2026. Verify current rates: OpenAI pricing ยท Anthropic pricing ยท Google pricing
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| OpenAI GPT-4o | $5.00 | $15.00 |
| Anthropic Claude Opus 4.7 | $3.00 | $15.00 |
| Google Gemini 1.5 Pro | $3.50 | $10.50 |
| OpenAI GPT-4o mini | $0.15 | $0.60 |
| Anthropic Claude 4.5 Haiku | $0.25 | $1.25 |
| Google Gemini 1.5 Flash | $0.075 | $0.30 |
Rate Limits
Rate limits are caps on how many requests you can make per minute (RPM), how many tokens you can process per minute (TPM), or how many tokens per day (TPD). Providers impose limits to prevent abuse, ensure fair resource allocation across users, and create pricing tiers. Free-tier users face the strictest limits; paid tiers unlock much higher throughput.
- Requests per minute (RPM): The number of API calls you can make in a 60-second window. Exceed this and requests are queued or rejected.
- Tokens per minute (TPM): The total token throughput. A single large prompt can consume your entire TPM quota in seconds.
- Common scenarios where you hit limits: Automated pipelines making rapid sequential calls (50+ per second), large batch-processing jobs, or free-tier users in burst situations.
- Typical limits: Free tier: 3โ15 RPM, 40kโ100k TPM. Paid tier 1: 500 RPM, 200kโ500k TPM. Enterprise: 3,000+ RPM, millions of TPM.
- Workaround strategies: Batch small tasks into larger requests (fewer API calls), add delays between requests, or upgrade to a higher-tier account.
How Prompt Design Controls Costs
Tested in PromptQuorum โ 20 identical research-summary prompts executed on GPT-4o, Claude Opus 4.7, and Gemini 1.5 Pro with varying levels of system prompt verbosity: With a 500-token system prompt, average output was 450 tokens at an average cost of $0.032 per call. With the same instructions in a trimmed 200-token prompt, average output was 460 tokens at $0.025 per call โ an 18% cost reduction with identical output quality. This aligns with how to prompt for speed โ efficiency reduces both latency and cost.
Every unnecessary token in your prompt wastes money โ and the costs accumulate faster because your entire prompt is reincluded on every API call in a conversation. Trimming a 500-token system prompt to 300 tokens saves $0.001 per call, but on 1,000 calls per day, that's $1/day or $365/year.
- Trim context aggressively: Don't repeat what the model already knows. Instead of "The user asked X. I told them Y. Now they ask Z," just include Z.
- Use explicit length constraints: "Answer in 3 bullets." or "Maximum 100 words." forces brevity and prevents verbose outputs (which cost more).
- Avoid padding in system prompts: Every filler word costs money. "You are an expert assistant who helps users" is 10 tokens. "You are an expert assistant" is 6 tokens. Both convey the same meaning.
- Example: Bloated vs Trimmed System Prompt:
- Bad Prompt "You are a helpful AI assistant with extensive knowledge across many domains. You help users by providing detailed, comprehensive answers to their questions. Always be thorough and explain your reasoning step by step. Avoid being concise โ users appreciate thorough explanations."
- Good Prompt "You are an expert assistant. Provide accurate, detailed answers. Explain your reasoning."
- Token difference: Bad = 55 tokens, Good = 13 tokens. On 100 calls per day: 42 ร 100 ร 30 days ร ($0.005 / 1M input tokens) โ $0.63/month saved by just one trimmed prompt.
How to Cut LLM API Costs in 5 Steps
- 1Match model to task complexity: use GPT-4o mini or Claude 4.5 Haiku for simple classification and Q&A โ 33ร cheaper than frontier models
- 2Summarise conversation history every 5 turns: prevents full history re-billing on every call (a technique aligned with chain-of-thought prompting โ structure your reasoning upfront)
- 3Cap output length explicitly: "Answer in 3 bullets" or "Maximum 100 words" prevents verbose token-heavy responses
- 4Trim system prompts to essentials: remove filler phrases; every redundant word is re-billed on every API call
- 5Test local LLMs via Ollama for high-volume private workflows: zero per-token cost at the price of frontier model capability
Choosing the Right Model
Not every task requires OpenAI GPT-4o or Anthropic Claude Opus. Simple classification, factual Q&A, and many automated tasks run perfectly on cheaper models โ and the cost difference is dramatic.
| Task Type | Recommended Model | Cost vs GPT-4o |
|---|---|---|
| Simple classification / yes-no | GPT-4o mini, Claude Haiku 4.5, or Gemini Flash | 33ร cheaper |
| Short factual Q&A | GPT-4o mini or Claude Haiku 4.5 | 10โ33ร cheaper |
| Complex analysis or code | GPT-4o or Claude Opus 4.7 | baseline |
| Long-form creative writing | Claude Opus 4.7 or GPT-4o | baseline |
| High-volume private workflows | Local model via Ollama | zero API cost |
Local LLMs โ Zero Cost Option
Local models via Ollama or LM Studio have zero per-token API cost โ you only pay for the hardware (VRAM and electricity). This makes them ideal for high-volume workflows, privacy-sensitive applications, and cost-critical pipelines. The trade-offs are capability (local models lag frontier models) and latency (running on consumer VRAM is slower). Understanding context windows is essential when planning local deployments โ your VRAM limits the context window size you can support.
- Hardware costs: Ollama models like LLaMA 3.1 7B require ~8GB VRAM, 13B models need ~16GB, 70B models need 40GB+. GPU memory is the limiting factor.
- Capability trade-off: Local models are excellent at classification, summarisation, and repetitive tasks. They struggle with multi-step reasoning, code generation, and creative writing compared to GPT-4o or Claude Opus 4.7.
- Latency trade-off: Cloud models respond in 500msโ2s. Local models on consumer hardware: 2โ10s depending on model size and system specs.
- When to use local: High-volume automation (1,000+ calls/day), GDPR-sensitive data (EU users processing personal data under GDPR benefit from on-device processing with no external API calls), or cost-critical workflows where quality is "good enough."
- When to use cloud: Latency-sensitive applications, tasks requiring reasoning, or one-off analyses where API cost is negligible.
Regional Context
EU / GDPR For EU organizations processing personal data through AI APIs, token costs include a compliance cost not visible in pricing tables: each token sent to a cloud API is personal data processed by a third-party under GDPR Article 28, requiring a Data Processing Agreement and transfer mechanism under Article 46 for non-EU providers.
Local LLMs via Ollama eliminate this entirely. For EU teams processing customer data, support tickets, or internal documents: the true cost of a cloud API call includes the compliance overhead of external data transfer. At scale, this can make local inference economically competitive even accounting for hardware investment.
German organizations under BSI IT-Grundschutz guidelines must document AI processing costs and data flows โ token logs from cloud APIs satisfy this requirement if retained with appropriate access controls.
Japan (METI) Japanese text requires 20โ40% more tokens than equivalent English text due to tokenizer inefficiency on CJK scripts. A 1,000-word Japanese document costs approximately $0.007 on GPT-4o vs $0.005 for the same English content. For Japanese-language AI workflows, Qwen2.5 models via Ollama are significantly more token-efficient โ native CJK tokenization reduces Japanese token count by 30โ40%, directly reducing per-call cost.
China Under China's Data Security Law (ๆฐๆฎๅฎๅ จๆณ), sending business data to foreign cloud AI APIs requires data localization compliance review. For Chinese enterprise teams, local inference via Qwen2.5 (Alibaba) eliminates cross-border data transfer cost and compliance risk simultaneously. At 1,000+ API calls per day, the hardware amortization cost of a local inference server is typically lower than API fees within 6โ12 months.
How PromptQuorum Helps You Manage Token Costs
PromptQuorum uses two LLMs: a Backend LLM and a Frontend LLM (your chosen model that answers your prompt question). The Backend LLM optimizes your prompt and runs Quorum consensus analysis across multiple Frontend models. Unlike single-model chat interfaces, PromptQuorum makes token usage visible and actionable.
Backend LLM tokens are always visible. Frontend tokens visibility depends on how you access the model:
- Public interfaces (Copilot, public Claude web chat): Frontend tokens NOT visible โ only Backend tokens show.
- Local models (LM Studio, Ollama): Frontend tokens ARE visible โ runs on your hardware, PromptQuorum sees token usage directly.
- APIs (OpenAI, Anthropic): It depends. With direct API integration, Frontend tokens visible. Via third-party endpoint or public interface, Frontend tokens NOT visible.
Tested in PromptQuorum โ 20 identical research-summary prompts dispatched to GPT-4o and GPT-4o mini: Output quality matched on 17 of 20 tasks. Cost difference: $0.003 per prompt (GPT-4o) vs $0.00007 per prompt (mini) โ a 43ร cost reduction. On the 3 tasks where GPT-4o outperformed, complexity involved multi-step reasoning across documents.
Token Cost Recipes
Use these templates as starting points for optimizing costs in specific workflows.
- "Quick lookup / yes-no task": Use GPT-4o mini or Haiku. Minimal system prompt (โค50 tokens). No conversation history. Constrain output to 1โ2 sentences. Total cost per task: ~$0.00001โ0.0001.
- "Long research task (5โ10 turns)": Use Claude Opus 4.7 (excellent at long context). After every 5 turns, summarise the conversation and replace history with a summary (cuts tokens by 70%). Cost: ~$0.01โ0.05 per research session.
- "Automated pipeline / batch processing": Use GPT-4o mini for filtering or classification (33ร cheaper). Only escalate to GPT-4o for final synthesis on borderline cases. Batch similar prompts to reuse context caching where the API supports it.
- "Privacy-sensitive workflow": Route to Ollama or LM Studio running locally. Manage context window: 4kโ8k tokens for 8GB VRAM, 16kโ32k for 16GB. Zero API costs. Accept slightly lower quality for compliance.
- "Comparing outputs across models": Send one well-structured prompt to GPT-4o, Claude Opus 4.7, and Claude Haiku 4.5 simultaneously. Compare quality + cost. Pick the cheapest that meets your quality bar. Discovery cost: ~$0.001. Ongoing cost: 33โ43ร savings.
Common Mistakes
Avoid these token-wasting patterns.
- Sending full conversation history on every call: If a conversation is 5,000 tokens after 10 turns, you're paying 5,000 tokens again on turn 11 even though only 200 tokens are new. Solution: Summarise every 5 turns or use prompt caching if the API supports it.
- Using a high-capability model for simple tasks: Don't use GPT-4o for "extract the date from this email." Use GPT-4o mini or Haiku. Cost difference: 33ร on this task alone.
- Not constraining output length: A vague "tell me about X" prompt can return 500 tokens when "summarise in 50 words" returns 60 tokens. You pay 8ร more for the verbose response.
- Repeating long system prompts on every call: If your system prompt is 500 tokens and you make 100 API calls, that's 50,000 wasted tokens if you're not reusing or caching it. Use system prompt templates or request-level caching.
- Forgetting image tokens: A single high-resolution image can consume 500โ2,000 tokens depending on resolution. Downscale images or crop to the relevant region before uploading.
- Running manual test calls instead of batching: Testing 20 variations of a prompt costs 20ร the token cost of one call. Use batch APIs or PromptQuorum's multi-model comparison to test all variations in one shot.
- Switching models mid-conversation: Cloud APIs (OpenAI, Anthropic) don't carry over conversation context between models. Restarting the conversation on a different model re-sends all prior messages. Commit to one model per conversation.
FAQ
What is a token in AI?
A token is the smallest unit of text an AI model processes โ approximately 3โ4 characters or ยพ of an English word. "ChatGPT" counts as 2 tokens. You are billed for every input token and every output token, with output tokens typically costing 2โ5ร more than input tokens.
How much does GPT-4o cost per token?
As of April 2026: GPT-4o costs $5.00 per 1M input tokens and $15.00 per 1M output tokens. GPT-4o mini costs $0.15 per 1M input and $0.60 per 1M output โ 33ร cheaper for tasks that don't require full GPT-4o capability.
How do rate limits work?
Rate limits cap requests per minute (RPM) and tokens per minute (TPM). Free tier: 3โ15 RPM, 40kโ100k TPM. Paid tier: 500 RPM, 200kโ500k TPM. Enterprise: 3,000+ RPM. Workarounds: batch small tasks into larger requests, add delays between calls, or upgrade to a higher tier.
How many tokens is a typical article or report?
A 1,000-word article is approximately 1,200โ1,500 tokens. A 10-page PDF is 4,000โ6,000 tokens. A single high-resolution image is 500โ2,000 tokens depending on resolution and content density.
Why is my API bill higher than expected even with short prompts?
Three common causes: (1) You are sending full conversation history on every call โ summarise after 5 turns. (2) Your system prompt is long โ trim to essentials. (3) You are using a powerful model for simple tasks โ switch to GPT-4o mini or Haiku for classification or short Q&A.
Does a longer system prompt always mean better output?
No. A well-crafted 100-token system prompt often outperforms a rambling 500-token prompt. Quality beats quantity. Specificity beats verbosity.
When should I use a local LLM instead of a cloud API?
Use local LLMs for: high-volume automation (1,000+ calls/day), GDPR-sensitive data where no personal data should leave your infrastructure, or cost-critical pipelines where quality is good enough. Use cloud APIs for: latency-sensitive applications, complex reasoning tasks, or one-off analyses where API cost is negligible.
How can I reduce my AI API token costs?
Seven strategies: trim system prompts, constrain output length, summarise conversation history every 5 turns, use cheaper models for simple tasks, avoid sending full conversation history, downscale images before uploading, and batch test calls rather than running them manually.
How many tokens does a typical AI prompt use?
A typical prompt uses 150โ500 tokens depending on complexity. A simple question (5โ20 tokens), a medium paragraph (50โ150 tokens), a full research prompt with examples (200โ600 tokens). Tokens per prompt vary based on language and complexity.
What does it mean when a prompt has 3,000 tokens?
A 3,000-token prompt is roughly a 2,000-word article or 10+ pages of text. This indicates a long system prompt, complete conversation history, or large document context. For efficiency, consider summarizing conversation history or trimming unnecessary context.
How much does each AI prompt cost across different models?
Costs vary by model: GPT-4o mini = ~$0.00005โ0.0001 per prompt. GPT-4o = ~$0.001โ0.01. Claude Haiku = ~$0.00003 per prompt. Claude Opus = ~$0.005โ0.02. Gemini Flash = ~$0.00002. Costs depend on prompt length and output.
How are AI prompt tokens calculated?
Tokens are calculated by breaking text into units of 3โ4 characters (roughly ยพ of English words). System prompts, conversation history, images, attached files, and output all count. Most API providers show exact token count in responses. Shorter prompts and constrained output reduce token usage.
How many tokens is a 1,000-word prompt?
A 1,000-word prompt is approximately 1,200โ1,500 tokens in English. Other languages tokenize less efficiently and may require 20โ40% more tokens. Token count depends on word choice and average word length in the language used.
Are token limits based on a single prompt or the entire conversation?
Token limits apply to the entire conversation history, including all system prompts, previous messages, retrieved documents, and the current prompt. Rate limits (tokens per minute) accumulate across all your API calls in that timeframe, not just one prompt.
How many prompts can you get from 1 million tokens?
With 1 million tokens: 2,000โ6,667 prompts if each prompt averages 150โ500 tokens. GPT-4o mini prompts (~300 tokens) = ~3,333 prompts. GPT-4o prompts (~500 tokens) = ~2,000 prompts. Actual count depends on prompt size and output length.
Does prompt optimization reduce API costs significantly?
Yes. Trimming a 500-token system prompt to 300 tokens saves ~$0.001 per API call. At 1,000 calls/day, that's $365/year saved. Constraining output length and summarizing conversation history every 5 turns reduces costs 30โ50%. Model selection is the largest lever โ GPT-4o mini costs 33ร less than GPT-4o.