What Is a Token?
A token is the smallest unit of text an AI model processes, approximately 3β4 characters or ΒΎ of an English word. In English text, "ChatGPT" counts as 2 tokens, and "Hello, how are you?" is roughly 5β6 tokens. Other languages tokenise less efficiently β the same phrase in German or Japanese may consume 20β40% more tokens. You are billed for every token in your prompt (input) and every token the model outputs. Understanding tokens is fundamental to what is prompt engineering β the practice of structuring your inputs to get reliable outputs.
Models do not "think" in words or characters. Internally, they convert your text into token IDs and process those numerically. This is why tokenisation matters: a single character change can sometimes affect the token boundary, and a poorly organised prompt with redundant words can waste hundreds of tokens without improving output quality.
In one sentence: a token is the smallest unit of text an AI model processes, approximately 3β4 characters or ΒΎ of an English word, and you are billed for every token in and every token out.
Key Takeaways
- Tokens are the unit of AI cost and processing. Approximately 3β4 characters = 1 token in English; other languages require more tokens.
- You pay separately for input tokens and output tokens β output tokens typically cost 2β5Γ more. So long verbose outputs are where costs spike.
- Token counting includes system prompts, full conversation history, attached files, and images β not just your latest message.
- Rate limits (requests per minute, tokens per minute) exist to prevent abuse and ensure fair resource allocation. Free tiers have strict limits; paid tiers are much higher.
- Using the right model for the task reduces cost by 10β50Γ. GPT-4o mini or Claude Haiku 4.5 can handle tasks that don't need GPT-4o or Claude 4.6 Sonnet.
- Local LLMs via Ollama or LM Studio have zero per-token API cost but require VRAM investment and have lower capabilities than frontier models.
How Token Counting Works in Practice
Every element of your API call β system prompt, conversation history, new message, files, and the model's own output β consumes tokens from your quota. This is why a conversation that started with a small message can suddenly become expensive after five turns of back-and-forth. You're paying for all of it, accumulated. Understanding the distinction between system prompt and user prompt is critical because both are billed on every call.
- System prompt: Counted once per message. A 200-word system prompt = ~250 tokens on every API call.
- Full conversation history: Included on every request unless explicitly summarised or dropped. A 10-turn conversation with 500 tokens per turn = 5,000 tokens counted again on turn 11.
- Your input message: Counted as-is.
- Attached files or images: Images consume 100β2,000 tokens each depending on size and resolution. Large PDFs can consume thousands.
- Model output: The generated response is counted in full at output token rates (usually 2β5Γ higher than input rates).
- Worked example: A 3-turn research conversation: System prompt (300 tokens) + User Q1 (150 tokens) + Model A1 (200 tokens) + User Q2 (200 tokens) + Model A2 (300 tokens) + User Q3 (100 tokens) = 1,250 tokens so far. When you send Q3, you pay for the entire history again (1,250 tokens) plus the output of A3. A single "short" follow-up can cost as much as the entire prior conversation.
How Much Do GPT-4o, Claude, and Gemini Cost per Million Tokens in 2026?
Prices vary dramatically based on model capability. All figures below are public pricing as of March 2026. Note that output tokens typically cost 2β5Γ more than input tokens β this is where costs accumulate fastest. The right model choice is the biggest cost lever β see how to pick between GPT-4o, Claude, and Gemini for detailed comparisons.
Prices as of March 2026. Verify current rates: OpenAI pricing Β· Anthropic pricing Β· Google pricing
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| OpenAI GPT-4o | $5.00 | $15.00 |
| Anthropic Claude 4.6 Sonnet | $3.00 | $15.00 |
| Google Gemini 1.5 Pro | $3.50 | $10.50 |
| OpenAI GPT-4o mini | $0.15 | $0.60 |
| Anthropic Claude 4.5 Haiku | $0.25 | $1.25 |
| Google Gemini 1.5 Flash | $0.075 | $0.30 |
What Are Rate Limits β and Why Do They Exist?
Rate limits are caps on how many requests you can make per minute (RPM), how many tokens you can process per minute (TPM), or how many tokens per day (TPD). Providers impose limits to prevent abuse, ensure fair resource allocation across users, and create pricing tiers. Free-tier users face the strictest limits; paid tiers unlock much higher throughput.
- Requests per minute (RPM): The number of API calls you can make in a 60-second window. Exceed this and requests are queued or rejected.
- Tokens per minute (TPM): The total token throughput. A single large prompt can consume your entire TPM quota in seconds.
- Common scenarios where you hit limits: Automated pipelines making rapid sequential calls (50+ per second), large batch-processing jobs, or free-tier users in burst situations.
- Typical limits: Free tier: 3β15 RPM, 40kβ100k TPM. Paid tier 1: 500 RPM, 200kβ500k TPM. Enterprise: 3,000+ RPM, millions of TPM.
- Workaround strategies: Batch small tasks into larger requests (fewer API calls), add delays between requests, or upgrade to a higher-tier account.
How Can I Reduce My LLM API Costs by 30β50Γ?
Tested in PromptQuorum β 20 identical research-summary prompts executed on GPT-4o, Claude 4.6 Sonnet, and Gemini 1.5 Pro with varying levels of system prompt verbosity: With a 500-token system prompt, average output was 450 tokens at an average cost of $0.032 per call. With the same instructions in a trimmed 200-token prompt, average output was 460 tokens at $0.025 per call β an 18% cost reduction with identical output quality. This aligns with how to prompt for speed β efficiency reduces both latency and cost.
Every unnecessary token in your prompt wastes money β and the costs accumulate faster because your entire prompt is reincluded on every API call in a conversation. Trimming a 500-token system prompt to 300 tokens saves $0.001 per call, but on 1,000 calls per day, that's $1/day or $365/year.
- Trim context aggressively: Don't repeat what the model already knows. Instead of "The user asked X. I told them Y. Now they ask Z," just include Z.
- Use explicit length constraints: "Answer in 3 bullets." or "Maximum 100 words." forces brevity and prevents verbose outputs (which cost more).
- Avoid padding in system prompts: Every filler word costs money. "You are an expert assistant who helps users" is 10 tokens. "You are an expert assistant" is 6 tokens. Both convey the same meaning.
- Example: Bloated vs Trimmed System Prompt:
- Bad Prompt "You are a helpful AI assistant with extensive knowledge across many domains. You help users by providing detailed, comprehensive answers to their questions. Always be thorough and explain your reasoning step by step. Avoid being concise β users appreciate thorough explanations."
- Good Prompt "You are an expert assistant. Provide accurate, detailed answers. Explain your reasoning."
- Token difference: Bad = 55 tokens, Good = 13 tokens. On 100 calls per day: 42 Γ 100 Γ 30 days Γ ($0.005 / 1M input tokens) β $0.63/month saved by just one trimmed prompt.
How to Cut LLM API Costs in 5 Steps
- 1Match model to task complexity: use GPT-4o mini or Claude 4.5 Haiku for simple classification and Q&A β 33Γ cheaper than frontier models
- 2Summarise conversation history every 5 turns: prevents full history re-billing on every call (a technique aligned with chain-of-thought prompting β structure your reasoning upfront)
- 3Cap output length explicitly: "Answer in 3 bullets" or "Maximum 100 words" prevents verbose token-heavy responses
- 4Trim system prompts to essentials: remove filler phrases; every redundant word is re-billed on every API call
- 5Test local LLMs via Ollama for high-volume private workflows: zero per-token cost at the price of frontier model capability
Choosing the Right Model for the Right Task
Not every task requires OpenAI GPT-4o or Anthropic Claude Opus. Simple classification, factual Q&A, and many automated tasks run perfectly on cheaper models β and the cost difference is dramatic.
| Task Type | Recommended Model | Cost vs GPT-4o |
|---|---|---|
| Simple classification / yes-no | GPT-4o mini, Claude Haiku 4.5, or Gemini Flash | 33Γ cheaper |
| Short factual Q&A | GPT-4o mini or Claude Haiku 4.5 | 10β33Γ cheaper |
| Complex analysis or code | GPT-4o or Claude 4.6 Sonnet | baseline |
| Long-form creative writing | Claude 4.6 Sonnet or GPT-4o | baseline |
| High-volume private workflows | Local model via Ollama | zero API cost |
What Are the Trade-offs of Local LLMs (Ollama) vs Cloud APIs?
Local models via Ollama or LM Studio have zero per-token API cost β you only pay for the hardware (VRAM and electricity). This makes them ideal for high-volume workflows, privacy-sensitive applications, and cost-critical pipelines. The trade-offs are capability (local models lag frontier models) and latency (running on consumer VRAM is slower). Understanding context windows is essential when planning local deployments β your VRAM limits the context window size you can support.
- Hardware costs: Ollama models like LLaMA 3.1 7B require ~8GB VRAM, 13B models need ~16GB, 70B models need 40GB+. GPU memory is the limiting factor.
- Capability trade-off: Local models are excellent at classification, summarisation, and repetitive tasks. They struggle with multi-step reasoning, code generation, and creative writing compared to GPT-4o or Claude 4.6 Sonnet.
- Latency trade-off: Cloud models respond in 500msβ2s. Local models on consumer hardware: 2β10s depending on model size and system specs.
- When to use local: High-volume automation (1,000+ calls/day), GDPR-sensitive data (EU users processing personal data under GDPR benefit from on-device processing with no external API calls), or cost-critical workflows where quality is "good enough."
- When to use cloud: Latency-sensitive applications, tasks requiring reasoning, or one-off analyses where API cost is negligible.
How PromptQuorum Helps You Manage Token Costs
PromptQuorum uses two LLMs: a Backend LLM and a Frontend LLM (your chosen model that answers your prompt question). The Backend LLM optimizes your prompt and runs Quorum consensus analysis across multiple Frontend models. Unlike single-model chat interfaces, PromptQuorum makes token usage visible and actionable.
Backend LLM tokens are always visible. Frontend tokens visibility depends on how you access the model:
- Public interfaces (Copilot, public Claude web chat): Frontend tokens NOT visible β only Backend tokens show.
- Local models (LM Studio, Ollama): Frontend tokens ARE visible β runs on your hardware, PromptQuorum sees token usage directly.
- APIs (OpenAI, Anthropic): It depends. With direct API integration, Frontend tokens visible. Via third-party endpoint or public interface, Frontend tokens NOT visible.
Tested in PromptQuorum β 20 identical research-summary prompts dispatched to GPT-4o and GPT-4o mini: Output quality matched on 17 of 20 tasks. Cost difference: $0.003 per prompt (GPT-4o) vs $0.00007 per prompt (mini) β a 43Γ cost reduction. On the 3 tasks where GPT-4o outperformed, complexity involved multi-step reasoning across documents.
Token Cost Recipes β Common Scenarios
Use these templates as starting points for optimizing costs in specific workflows.
- "Quick lookup / yes-no task": Use GPT-4o mini or Haiku. Minimal system prompt (β€50 tokens). No conversation history. Constrain output to 1β2 sentences. Total cost per task: ~$0.00001β0.0001.
- "Long research task (5β10 turns)": Use Claude 4.6 Sonnet (excellent at long context). After every 5 turns, summarise the conversation and replace history with a summary (cuts tokens by 70%). Cost: ~$0.01β0.05 per research session.
- "Automated pipeline / batch processing": Use GPT-4o mini for filtering or classification (33Γ cheaper). Only escalate to GPT-4o for final synthesis on borderline cases. Batch similar prompts to reuse context caching where the API supports it.
- "Privacy-sensitive workflow": Route to Ollama or LM Studio running locally. Manage context window: 4kβ8k tokens for 8GB VRAM, 16kβ32k for 16GB. Zero API costs. Accept slightly lower quality for compliance.
- "Comparing outputs across models": Send one well-structured prompt to GPT-4o, Claude 4.6 Sonnet, and Claude Haiku 4.5 simultaneously. Compare quality + cost. Pick the cheapest that meets your quality bar. Discovery cost: ~$0.001. Ongoing cost: 33β43Γ savings.
Common Mistakes That Spike Your Token Bill
Avoid these token-wasting patterns.
- Sending full conversation history on every call: If a conversation is 5,000 tokens after 10 turns, you're paying 5,000 tokens again on turn 11 even though only 200 tokens are new. Solution: Summarise every 5 turns or use prompt caching if the API supports it.
- Using a high-capability model for simple tasks: Don't use GPT-4o for "extract the date from this email." Use GPT-4o mini or Haiku. Cost difference: 33Γ on this task alone.
- Not constraining output length: A vague "tell me about X" prompt can return 500 tokens when "summarise in 50 words" returns 60 tokens. You pay 8Γ more for the verbose response.
- Repeating long system prompts on every call: If your system prompt is 500 tokens and you make 100 API calls, that's 50,000 wasted tokens if you're not reusing or caching it. Use system prompt templates or request-level caching.
- Forgetting image tokens: A single high-resolution image can consume 500β2,000 tokens depending on resolution. Downscale images or crop to the relevant region before uploading.
- Running manual test calls instead of batching: Testing 20 variations of a prompt costs 20Γ the token cost of one call. Use batch APIs or PromptQuorum's multi-model comparison to test all variations in one shot.
- Switching models mid-conversation: Cloud APIs (OpenAI, Anthropic) don't carry over conversation context between models. Restarting the conversation on a different model re-sends all prior messages. Commit to one model per conversation.
FAQ
How many tokens is a typical article or report?
A 1,000-word article β 1,200β1,500 tokens. A 10-page PDF β 4,000β6,000 tokens. A single high-resolution image β 500β2,000 tokens depending on resolution and content density.
Why is my API bill higher than expected even with short prompts?
Three common causes: (1) You're sending full conversation history on every call β summarise after 5 turns. (2) Your system prompt is long β trim it to essentials. (3) You're using a high-capability model for simple tasks β switch to GPT-4o mini or Haiku for classification or short Q&A.
Does a longer system prompt always mean better output?
No. A well-crafted 100-token system prompt often outperforms a rambling 500-token prompt. Quality beats quantity. Specificity beats verbosity.
Can I cache my system prompt to save costs?
OpenAI and Anthropic both offer prompt caching for long system prompts or repeated prefixes. OpenAI charges 90% discount on cached tokens; Anthropic charges 10% discount. Check your API documentation to enable this β it requires a specific header on your request.
Do local LLMs really have zero cost?
Zero per-token API cost, yes. But hardware costs money: GPU VRAM (8GB = ~$100, 16GB = ~$200), electricity, and your time to manage the local setup. For one-off queries this is uneconomical. For 1,000+ queries per day, local models break even quickly.
How do I estimate costs before running a big batch?
Estimate: (average tokens per prompt Γ number of prompts) Γ (input cost per 1M + output cost per 1M). PromptQuorum does this automatically before you run a batch β input your prompt and desired model, and it forecasts total spend.
Is GPT-4o worth the cost vs GPT-4o mini?
For most tasks, GPT-4o mini is the better choice. GPT-4o mini costs 33Γ less per token and handles classification, short Q&A, data extraction, and routine summarisation with comparable accuracy. Reserve GPT-4o for tasks requiring multi-step reasoning, code generation, nuanced analysis, or long-form structured writing β tasks where you can measure the quality difference.
How do Claude and GPT-4o token costs compare?
As of March 2026: Claude 4.6 Sonnet and GPT-4o are priced similarly ($3.00/$15.00 vs $5.00/$15.00 per million input/output tokens). Claude 4.6 Sonnet is 40% cheaper on input; GPT-4o output costs are identical. For high-volume input-heavy workflows (large documents, long system prompts), Claude has a cost advantage. For output-heavy workflows (long essays, long code), costs are equivalent.