Quick Facts
- 1Coined: Wei et al. (2022), Google Brain — NeurIPS 2022 foundational paper
- 2Key result: Zero-shot CoT raised PaLM 540B accuracy from 17.7% → 78.7% on MultiArith
- 3Trigger phrase: "Let's think step by step" (zero-shot) or worked examples (few-shot)
- 42026 evolution: Built-in reasoning modes (o3, Claude extended thinking) now automate CoT internally
- 5Cost impact: Prompt-level CoT = more output tokens; reasoning modes = separate thinking token budget
- 6Best models for prompt-level CoT: Non-reasoning models (Claude Haiku 4.5, Gemini Flash, GPT-4o mini, LLaMA 4 Scout)
What Chain-of-Thought Prompting Is
Chain-of-thought prompting asks the model to reason step by step before giving a final conclusion. Instead of returning only "the answer," the model writes out intermediate calculations, logical steps, or explanations.
You can trigger this behavior by instructions like "think step by step," "show your reasoning," or by providing worked examples where the reasoning is explicit. The result is a trace you can read to understand how the model reached its conclusion.
Why Chain-of-Thought Prompting Matters
Chain-of-thought prompting matters because it makes model behavior more transparent on tasks that involve multi-step reasoning. When you see each step, you can spot misinterpretations, missing assumptions, or arithmetic errors.
This is especially valuable in domains like analytics, planning, and troubleshooting. Instead of a single opaque output, you get a narrative that can be checked, corrected, or reused as documentation.
🔍 Works with local models
Chain-of-thought works on any model with 7B+ parameters. Test it locally with Ollama or LM Studio.
When Chain-of-Thought Helps (and When It Doesn't)
Chain-of-thought prompting helps most on tasks that naturally break into clear steps, but it is not necessary for every prompt. It shines wherever the path is as important as the destination.
Chain-of-Thought reasoning is what lets a tool-calling agent stay reliable across multi-step tasks. For a local agent stack that gives a CoT-capable model the ability to query databases and edit files, see Local AI Agents With MCP.
Good use cases include:
- Math and quantitative reasoning problems.
- Multi-step logical puzzles or decision analyses.
- Root-cause analysis, incident postmortems, and trade-off discussions.
- Planning tasks where the sequence of actions must be explicit.
For simple classification, quick copywriting, or short factual answers, chain-of-thought often adds verbosity without much extra value. In sensitive domains, you may also want to keep reasoning internal and show only the final answer to end users.
Example: Without vs With Chain of Thought
The difference becomes clear when you compare a direct-answer prompt with one that explicitly asks for reasoning. Here is a simple decision example.
Bad Prompt
"Which project should we prioritize next quarter?"
Good Prompt
"You are a product operations manager. We have three candidate projects for next quarter. Use chain-of-thought reasoning to decide which project to prioritize. 1) List the decision criteria you will use (for example revenue impact, risk, alignment with strategy). 2) Evaluate each project against these criteria step by step. 3) Make a clear recommendation and justify it in 3–5 sentences. At the end, provide a short final answer starting with `Recommendation:` on a separate line."
In the "good" version, the model explains how it chose its criteria, how each project scores, and then states a recommendation you can challenge or accept.
How to Write Effective Chain-of-Thought Prompts
To write effective chain-of-thought prompts, you should define the structure of the reasoning and the structure of the final answer. Vague requests like "explain more" are less reliable than concrete instructions.
A practical pattern is:
- Tell the model its role (for example "You are a senior data analyst.").
- Specify that it should think step by step or use chain-of-thought.
- Define the sections of reasoning you expect (for example assumptions, calculations, comparison, conclusion).
- Ask for a short, clearly marked final answer at the end so you can use it quickly.
This separates the detailed reasoning from the concise output, which is helpful when you integrate the result into other tools or reports.
Chain-of-Thought Prompting in PromptQuorum
PromptQuorum is a multi-model AI dispatch tool where you can apply chain-of-thought prompting consistently across different models. You write one structured chain-of-thought prompt and send it to several providers in parallel.
In PromptQuorum, you can:
- Combine chain-of-thought instructions with reasoning-focused frameworks such as TRACE or APE so that thinking steps are explicitly labeled.
- Compare how different models handle the same reasoning task and inspect their step-by-step traces side by side.
- Save chain-of-thought prompts as templates for recurring analyses, incident reviews, or strategic decisions.
This turns chain-of-thought prompting from a one-off trick into a repeatable part of your decision-making process.
How to Use Chain-of-Thought (CoT) Prompting
- 1For logic, reasoning, or debugging tasks, ask the model to 'think step by step' before answering. Instead of 'What is the bug?', ask 'Trace the execution step by step, then identify the bug.'
- 2Provide a worked example showing step-by-step reasoning. Don't just describe it—show the model what step-by-step reasoning looks like. Example: 'First, I check the function signature... Then, I trace the first call with input X...'
- 3Use explicit prompts like 'Let's think step by step' or 'First, identify... Then...' These trigger more deliberate reasoning in the model.
- 4For complex problems, ask the model to trace intermediate outputs. Example: 'Trace the execution of this function for input 5. Show the value of each variable after each line.'
- 5Combine CoT with verifiable outputs: ask the model to show its work so you can audit it. 'Explain your reasoning at each step. If you make a mistake, I should be able to spot it from your shown work.'
Math Example: Revenue Calculation
Without CoT, a model might give a single final answer. With CoT, the model shows calculations step by step.
Without CoT:
"A customer buys 50 units at $15 each, but gets a 10% discount. What do they pay?"
Model: "$675"
With CoT:
"A customer buys 50 units at $15 each, but gets a 10% discount. Work through this step by step: 1) Calculate the subtotal. 2) Calculate the discount amount. 3) Subtract the discount from the subtotal to get the final price."
Model: "1) Subtotal = 50 × $15 = $750. 2) Discount = 10% of $750 = $75. 3) Final price = $750 − $75 = $675."
Both give the same answer, but the CoT version lets you see the math and catch errors (e.g., if someone miscalculates 10% of $750).
CoT vs Built-In Reasoning Models (2026)
In 2026, the frontier models—Claude Opus 4.7, OpenAI o3, Gemini Deep Think—have built-in reasoning modes that internalize chain-of-thought automatically. You do not need to add "think step by step" instructions on these models.
When to use prompt-level CoT: Non-reasoning models (Claude Haiku 4.5, GPT-4o mini, Gemini Flash, Llama 4), local LLMs, or when you want to avoid the extra cost of reasoning token budgets.
When to use built-in reasoning modes: Maximum accuracy on frontier models, math-heavy tasks, complex analysis. These models bill reasoning tokens separately (usually higher rate than output tokens).
| Approach | Best For | Cost | Transparency | Models |
|---|---|---|---|---|
| Prompt-level CoT ("think step by step") | Small models, local LLMs, cost-sensitive tasks | Increases output tokens | Full: visible steps in output | Haiku, Flash, LLaMA, Qwen |
| Claude extended thinking (Opus 4.7, Sonnet 4.6) | Complex analysis, maximum accuracy | Separate thinking token budget (input rate) | Inspector trace via API | Claude Opus 4.7, Claude Sonnet 4.6 |
| OpenAI o3 | Hardest problems (math, coding, competition) | Reasoning token budget (higher tier) | Hidden reasoning, visible output | OpenAI o3 |
| Gemini Deep Think | Google Cloud integration, Gemini ecosystem | Thinking tokens separate from output | thinking_level parameter (LOW, MEDIUM, HIGH) | Gemini 3.1 Pro |
| DeepSeek R1 | Open-weights option, on-device reasoning | Visible reasoning streamed as output text | Full: inline CoT in output | DeepSeek R1 |
💡 Pro Tip
If you're building for cost, use prompt-level CoT on smaller models. If you're building for accuracy on hard problems, use o3 or Claude extended thinking and let the model handle reasoning internally.
Chain-of-Thought Variants and Extensions
Beyond the basic "think step by step" pattern, researchers have developed several CoT variants, each optimized for different problem types.
- Zero-shot CoT: Ask "Let's think step by step" with no examples. Works on most models and is the simplest to implement. Boost: ~10–20% accuracy improvement on reasoning tasks.
- Few-shot CoT: Show 2–5 worked examples where the reasoning is explicit, then ask the model to apply the same pattern to a new problem. More reliable than zero-shot but requires manual example creation. Boost: ~20–40% accuracy.
- Self-consistency (Wang et al., 2023): Generate multiple CoT reasoning paths independently, then take a majority vote on the final answer. Significantly more robust to errors. Boost: ~30–50% on hard tasks.
- Tree of Thought (ToT): Instead of a linear chain, explore multiple reasoning branches and prune poor ones. Use when there are many possible solution paths (planning, game-playing, creative tasks).
- ReAct (Reasoning + Acting): Interleave reasoning with external actions—call APIs, search databases, or run code—and incorporate the results back into the next reasoning step. Best for real-world tasks that need live data or verification.
Model Comparison: How Models Handle CoT Prompting (2026)
| Model | Prompt-Level CoT | Built-In Reasoning | Best Use Case | Cost (approx.) |
|---|---|---|---|---|
| Claude Opus 4.7 | Not needed | Extended thinking (inspect trace via API) | Maximum accuracy analysis | Higher (input + output + thinking tokens) |
| Claude Sonnet 4.6 | Not needed | Extended thinking | Balanced accuracy/cost | Medium |
| Claude Haiku 4.5 | Recommended | None | Fast, cost-effective reasoning | Low |
| OpenAI o3 | Not needed | Effort levels (low, medium, high, xhigh) | Competition-level problems | Very high (reasoning token tier) |
| GPT-4o mini | Recommended | None | Budget-conscious deployment | Very low |
| Gemini 3.1 Pro | Works | Deep Think (thinking_level param) | Google Cloud integration | Medium-high |
| Gemini Flash | Recommended | None | Fast responses | Low |
| DeepSeek R1 | Not needed | Inline reasoning in output | Open-weights, on-device | Free (open source) |
| Llama 4 | Recommended | None | Local deployment, privacy | Self-hosted (compute-dependent) |
Frequently Asked Questions
Does chain-of-thought work on all models?
Chain-of-thought works on most models with 7B+ parameters, but the benefit varies. It's most effective on mid-size and smaller models (Haiku, Flash, Llama 4). On frontier models (Claude Opus 4.7, o3), built-in reasoning modes are often more efficient than prompt-level CoT.
Does chain-of-thought increase cost?
Yes, prompt-level CoT increases the number of output tokens (since the model writes out reasoning before the final answer). Built-in reasoning modes (Claude extended thinking, OpenAI o3) use separate thinking token budgets that may have different billing rates. Test both on your use case to compare cost vs accuracy tradeoff.
When should I use few-shot CoT instead of zero-shot?
Use zero-shot CoT first—it's simpler and works in most cases. Move to few-shot (with 2–5 examples) if zero-shot is unreliable or if your domain requires specific reasoning patterns (e.g., financial analysis with standard line-item structure).
Can I combine chain-of-thought with structured output (JSON)?
Yes. You can ask the model to show its reasoning in plain text first, then output a JSON object with the final answer. Combine instructions: "Think step by step. Then output your result as valid JSON." This is common in production systems.
What's the difference between chain-of-thought and tree-of-thought?
Chain-of-thought is a linear sequence: step 1 → step 2 → ... → conclusion. Tree-of-thought explores multiple branches (alternative reasoning paths) and prunes weaker ones before arriving at the answer. Tree-of-thought is more powerful but more expensive (requires multiple model calls).
Does OpenAI o3 require chain-of-thought prompting?
No. OpenAI o3 has built-in reasoning that activates automatically. You do not need to add "think step by step" instructions. Just give o3 the problem and set the effort level (low/medium/high/xhigh) to control how much reasoning budget to spend.
Can I audit the reasoning of built-in reasoning models?
Yes, but it depends on the model. Claude extended thinking reasoning traces are inspectable via API. OpenAI o3 reasoning is hidden by default (for competitive advantage). Gemini Deep Think reasoning is also hidden. For full auditability, use prompt-level CoT or DeepSeek R1.
Is chain-of-thought prompting suitable for real-time applications?
Prompt-level CoT adds latency (more output tokens = slower generation). For real-time use cases, either use smaller models with minimal reasoning, or use stream endpoints to show tokens as they arrive. Built-in reasoning modes may add even more latency; benchmark your specific use case.
Sources & Further Reading
- Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). "Large Language Models are Zero-Shot Reasoners." NeurIPS 2022. arXiv:2205.11916
- Wang, X., Wei, J., Schuurmans, D., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." ICLR 2023. arXiv:2203.11171
- Anthropic. (2024). "Extended Thinking in Claude." Technical documentation on Claude Opus 4.7 and Sonnet 4.6 reasoning capabilities.
- OpenAI. (2026). "OpenAI o3: Reasoning Models for Competition-Level Problem Solving." OpenAI documentation and research announcements.