PromptQuorumPromptQuorum
Home/Prompt Engineering/From GPT-2 to Today: How Prompt Engineering Evolved
Fundamentals

From GPT-2 to Today: How Prompt Engineering Evolved

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Prompt engineering evolved from informal experiments around GPT-3 in 2020 to a structured discipline with named techniques and frameworks by 2026. This timeline traces the key breakthroughs, research papers, and turning points that made prompting a core skill.

Prompt engineering evolved from informal trial-and-error around GPT-3 (2020) to a structured discipline with named frameworks, techniques, and tools by 2026, progressing through five phases: few-shot learning, chain-of-thought reasoning, the ChatGPT mainstream moment, automated optimization, and context design.

Key Takeaways

  • 2019–2020: GPT-2 and early transformers β€” prompts were inputs, not a discipline
  • 2020: GPT-3 and Brown et al. introduced few-shot prompting as a paradigm shift
  • 2022: Chain-of-Thought reasoning prompts turned prompting into a structured skill
  • Late 2022: ChatGPT brought prompt engineering into mainstream awareness and job postings
  • 2023: GPT-4, multimodal prompting, and frameworks formalised best practices
  • 2024–2026: Context design, automated prompting, and open-source LLMs redefined the field

⚑ Quick Facts

  • Β·GPT-3 (2020): 175-billion-parameter model introduced few-shot prompting as a paradigm shift
  • Β·Chain-of-Thought (2022): Wei et al. showed that prompt structure could activate step-by-step reasoning; improved GSM8K accuracy from 17.9% to 58%
  • Β·ChatGPT (Nov 2022): Reached 1 million users in 5 days, 100 million monthly active users by January 2023
  • Β·Job Market (2023): "Prompt engineer" appeared with $175K–$335K salaries; OED added "prompt" as a verb
  • Β·GPT-4 & Frameworks (2023): Multimodal inputs and formalized frameworks (CO-STAR, SPECS, RISEN) turned prompting into a teachable discipline
  • Β·Context Design (2024–2026): Open-source LLMs, 1M+ token context windows, and agent orchestration shifted focus from prompt tweaking to system-level context engineering

How Prompt Engineering Evolved: A Short Overview

Prompt engineering evolved from informal trial-and-error text manipulation around GPT-3 in 2020 to a structured discipline with named techniques, frameworks, and tools by 2026. The arc spans five phases: early few-shot experiments, the ChatGPT moment that brought the skill into mainstream awareness, the development of structured reasoning techniques, the rise of automated prompt optimisation, and the current shift toward context design.

The discipline did not emerge from a single paper or company. It grew from the overlap between research (few-shot learning, chain-of-thought reasoning, RAG), practitioner communities sharing prompt collections online, and the sudden public availability of powerful models that made good prompting immediately rewarding. By 2026, prompt engineering is no longer a niche trick β€” it is a baseline skill for anyone working with AI systems.

Before Prompt Engineering Had a Name (Pre-2020)

Before the term "prompt engineering" existed, researchers were already manipulating model inputs to elicit better outputs β€” they just did not call it that. Early transformer models like GPT-2 (2019, OpenAI) and BERT (2018, Google) were used through carefully chosen input text, but the practice was treated as part of data preprocessing, not a skill in its own right.

GPT-2, released in February 2019, was a 1.5-billion-parameter model that could complete text in surprisingly coherent ways. Researchers and early practitioners noticed that the phrasing of an input dramatically changed the quality of the completion β€” but there was no framework, no terminology, and no community built around this observation yet. Prompts were inputs, not engineering artifacts.

2020: GPT-3 and the Few-Shot Breakthrough

The modern history of prompt engineering effectively begins with GPT-3. In May 2020, OpenAI released GPT-3, a 175-billion-parameter model, alongside the landmark paper by Brown et al., "Language Models are Few-Shot Learners". The paper demonstrated that by including a few examples of the desired task directly in the prompt β€” without any weight updates to the model β€” performance on downstream tasks improved dramatically.

This was the seed of prompt engineering as a discipline. Researchers and developers realised that the same model could be turned into a translator, a summariser, a code generator, or a question-answering system simply by changing how the prompt was written. The model did not need retraining β€” it needed a better prompt. That insight reframed what a prompt was: not just an input, but a design artifact.

Brown et al. reported that few-shot performance scaled consistently with model size: the 175B GPT-3 model substantially outperformed smaller variants across every benchmark tested, establishing that scale and prompt-based learning were directly linked. This made the quality of the prompt a variable that practitioners β€” not just researchers β€” could control.

See Zero-Shot vs. Few-Shot: Which Approach Gets Better Results? for a practical guide to the technique GPT-3 made famous.

2021–Early 2022: From Prompt Tricks to a Recognised Skill

Between 2021 and early 2022, prompt crafting moved from research papers into practitioner communities. GitHub repositories with curated prompt collections appeared β€” "awesome-prompts" style lists that shared what worked for coding assistance, summarisation, and creative writing. Prompt collections, shared on Twitter and Reddit, became community assets. The Prompt Engineering Guide became one of the first dedicated references cataloguing techniques systematically.

The term "prompt engineering" began appearing more frequently in research papers, blog posts, and job descriptions through this period. OpenAI's InstructGPT paper (Ouyang et al., 2022) introduced RLHF-tuned models that responded far more reliably to natural-language instructions β€” making prompt quality even more consequential. By mid-2022, it was clear that this was a transferable skill, not just a researcher's curiosity.

2022: Chain-of-Thought and Reasoning Prompts

The introduction of Chain-of-Thought (CoT) prompting in 2022 was the most significant technical development in the discipline's short history. Wei et al. (Google Brain) published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models", demonstrating that asking a model to reason step by step before answering dramatically improved performance on arithmetic, commonsense reasoning, and symbolic reasoning tasks. In one headline result, chain-of-thought prompting improved PaLM's accuracy on the GSM8K grade-school maths benchmark from 17.9% to 58% β€” a gain achieved purely by changing the prompt structure, with no additional model training. The implication was profound: the structure of the prompt could activate different reasoning behaviour β€” not just different facts.

Related techniques followed quickly. Zhou et al. introduced least-to-most prompting, which decomposed complex problems into a sequence of simpler sub-problems solved in order. These approaches turned prompt engineering from a formatting exercise into a tool for eliciting structured reasoning from models that had not been explicitly trained to reason that way. Prompting had become scaffolding for cognition.

For the full technique guide, see Chain-of-Thought Prompting: Make AI Show Its Reasoning and Prompt Chaining: How to Break Big Tasks Into Winning Steps.

Late 2022–2023: The ChatGPT Moment and the Prompt Engineer Job Title

The release of ChatGPT on November 30, 2022, changed the public profile of prompt engineering overnight. ChatGPT reached one million users within its first five days β€” confirmed by OpenAI CEO Sam Altman on Twitter in December 2022 β€” and 100 million monthly active users by January 2023, according to a UBS analysis cited by Reuters. Within days, millions of people were experimenting with prompts and discovering that their results varied enormously based on how they phrased requests. Tech media covered "prompt engineering" as a skill worth learning. The Oxford English Dictionary added "prompt" as a verb related to AI in 2023, and the word itself became a runner-up for word of the year in multiple rankings.

By early 2023, "prompt engineer" appeared as a job title with reported salaries of $175,000–$335,000 at companies including Anthropic, according to widely cited job postings. The role attracted significant media attention β€” Bloomberg, The Guardian, and The Atlantic all covered whether prompt engineering was a real career. The consensus at the time: it was a transitional role, part human-computer interface design, part subject-matter expertise, part quality assurance.

The popularisation of the phrase "prompt engineering" is sometimes attributed to various practitioners and commentators. Richard Socher, former Chief Scientist at Salesforce, is mentioned in some commentary as having helped frame the idea early. The Wikipedia article on prompt engineering provides a balanced overview of competing claims about the term's origins.

πŸ” Did You Know

ChatGPT reached 100 million monthly active users in January 2023 β€” just two months after launch. For comparison, it took TikTok 9 months and Instagram 2.5 years to reach the same milestone. This speed of adoption is why prompt engineering went from research concept to mainstream skill almost overnight.

2023: GPT-4, Multimodal Prompting and Frameworks

The release of GPT-4 in March 2023 expanded prompt engineering in two directions simultaneously: larger context windows (up to 128K tokens in later versions) and multimodal inputs. Practitioners could now include images in prompts alongside text, opening prompt engineering to visual tasks β€” describing images, comparing diagrams, annotating charts. Early Gemini models from Google and multimodal Claude versions from Anthropic followed within months.

The same year saw the formalisation of prompt engineering best practices. OpenAI published its official prompt engineering guide. Google Cloud released its own prompt engineering documentation. Independent authors codified frameworks β€” CRAFT, CO-STAR, SPECS, RISEN, TRACE β€” that gave practitioners repeatable templates for structuring prompts, reducing the reliance on trial and error.

These frameworks represented the maturation of prompt engineering from a personal skill into a teachable, shareable practice. See Which Prompt Framework Should You Use? for a guide to choosing between them, and Beyond Text: How to Prompt with Images for the multimodal dimension.

2023–2024: Automated Prompt Engineering and RAG

A striking development in 2023 was research showing that LLMs could optimise prompts as well as humans could. Zhou et al. published "Large Language Models Are Human-Level Prompt Engineers" (APE), demonstrating that an LLM tasked with generating and evaluating prompt candidates could match or exceed human-written prompts on benchmark tasks. Stanford's DSPy framework (2023) took this further β€” allowing developers to describe what a prompt should accomplish and letting the system optimise the wording automatically.

Simultaneously, Retrieval-Augmented Generation (RAG) β€” originally introduced by Lewis et al. at Meta in 2020 β€” became a central pattern in production AI systems. RAG injected retrieved documents directly into the prompt context, grounding model outputs in real, up-to-date sources rather than requiring prompts to contain all the necessary facts. This shifted the emphasis in prompt engineering from "how do I make the model know this?" to "how do I structure the context so the model uses this correctly?"

See RAG Explained: How to Ground AI Answers in Real Data and Self-Consistency Prompting: Let the AI Check Its Own Work for coverage of the key techniques from this period.

2024–2025: From Prompt Engineering to Context Design

By 2024, a new framing began to displace the simple idea of "write a better prompt." Practitioners and researchers started referring to context engineering β€” the practice of orchestrating what goes into the full context window: the system prompt, retrieved documents, tool outputs, conversation history, and user input, all composed deliberately to guide model behaviour. The prompt was no longer a standalone artifact; it was one layer in a designed context.

Several developments accelerated this shift. Meta's Llama 3-class models (2024) made capable open-source LLMs available for private deployment, shifting some prompt engineering from cloud APIs to local infrastructure. Context windows grew to 1 million tokens or more (Gemini 1.5 Pro), making it practical to inject entire codebases, books, or document collections into a single prompt. Multi-agent frameworks like LangChain and AutoGen turned prompting into orchestration β€” one prompt triggers another model, which triggers a tool, which returns context to the next prompt.

2026 and Beyond: Prompt Engineering as a Core Literacy

As of 2026, research and commentary increasingly describe prompt engineering not as a niche job title, but as a fundamental literacy skill for knowledge workers who use AI tools. Academic papers frame structured prompting alongside reading, writing, and computation as a baseline competency for working with generative AI systems.

The role has split into two distinct tracks. The first is system and context design β€” the engineering of production AI systems where prompts form part of a larger architecture involving retrieval, agents, and evaluation pipelines. The second is everyday use β€” the ability to write clear, structured prompts that produce useful outputs without knowing the underlying architecture. Both tracks benefit from the same core principles: clear task specification, appropriate context, constraints, and output format.

What has not changed, despite more capable models and automated tools, is the fundamental principle: the clearer and more structured the input, the more reliable and useful the output. The techniques, terminology, and tooling have matured, but the core insight from the GPT-3 era remains true in 2026.

πŸ” Pro Tip

The shift from "prompt engineering" to "context design" isn't just terminology β€” it changes what you optimize. Instead of tweaking the wording of your instruction, you design what goes into the context window: system prompt, retrieved documents, conversation history, tool outputs, and user input. The prompt is one layer, not the whole thing.

Timeline: Key Milestones in Prompt Engineering

The table below summarises the key milestones from 2018 to 2026 β€” the events, papers, and model releases that shaped how prompt engineering evolved into its current form.

YearMilestoneWhy It Matters
2018–2019BERT (Google) and GPT-2 (OpenAI) releasedDemonstrated transformer models could be guided by input phrasing β€” but no formal discipline yet
2020GPT-3 and Brown et al. "Language Models are Few-Shot Learners"Established few-shot prompting as a paradigm: rewriting the prompt changes the model's behaviour without retraining
2022 (Jan)InstructGPT / RLHF (Ouyang et al., OpenAI)Models trained to follow instructions β€” made prompt quality far more consequential
2022 (May)Chain-of-Thought prompting (Wei et al., Google Brain)Proved that prompt structure could elicit step-by-step reasoning β€” turned prompting into a cognitive scaffold
2022 (Nov)ChatGPT launchBrought prompt engineering into mainstream awareness; millions began experimenting overnight
2023 (Q1)"Prompt engineer" job title reaches $300K+ salary postings; OED adds prompt as a verbDefined prompt engineering as a recognised profession and named skill
2023 (Mar)GPT-4 release; multimodal prompting with imagesExtended prompt engineering beyond text to visual inputs and large context windows
2023Frameworks formalised: CRAFT, CO-STAR, SPECS, RISEN; official guides from OpenAI and GoogleTurned prompt engineering from personal craft into teachable, shareable practice
2023–2024APE paper (Zhou et al.) and DSPy framework β€” AI-optimised promptsLLMs shown to write prompts as well as humans; automated prompt optimisation became practical
2024Llama 3-class models; context windows exceed 1M tokens (Gemini 1.5 Pro)Open-source LLMs for private deployment; massive context shifted focus to context engineering
2025 (Q1–Q2)Extended thinking / reasoning modes: Claude 4.7 Sonnet, OpenAI o3, DeepSeek R1, Gemini Deep ThinkModels internalize step-by-step reasoning; prompt-level CoT becomes optional on frontier models
2025 (Q3–Q4)LLaMA 4 (MoE); context windows reach 10M tokens on some modelsOpen-weights models reach near-frontier quality; MoE architecture reduces compute costs for self-hosting
2026Context design and multi-agent orchestration replace simple prompt tweakingPrompting becomes one layer in a composed context β€” system-level thinking required; prompt engineering skill embedded in all AI-using roles

How the History Shapes Today's Best Practices

Each phase of prompt engineering's evolution left a lasting deposit in current practice. The GPT-3 era gave us the core insight that model behaviour is shaped by input structure β€” not just content. The Chain-of-Thought era gave us explicit reasoning scaffolds: step-by-step prompting, prompt chaining, and tree-of-thought approaches. The framework era gave us reusable templates that encode best practices without requiring each practitioner to discover them from scratch.

The RAG and context-design era gave us the understanding that prompts do not exist in isolation β€” they are composed with retrieved data, system instructions, and tool outputs to form a full context. And the automated-prompting era reminded us that the principles of good prompting are measurable: better-structured prompts produce better outputs in ways that can be evaluated and optimised systematically.

FAQ: The Evolution of Prompt Engineering

Who first coined the term "prompt engineering"?

The exact origin is debated. The term appeared in research contexts as early as 2021 and gained wider use through 2022. Richard Socher is mentioned in some commentary as having helped frame the concept publicly, though no single person is credited with inventing it.

Why did prompt engineering explode in popularity after ChatGPT?

ChatGPT was the first general-purpose AI model that millions of non-researchers could use immediately, for free, without writing code. The gap between a well-crafted prompt and a vague one was visible and immediately consequential β€” better prompts produced usably better outputs. That feedback loop, experienced simultaneously by millions of people, turned prompt engineering from a research concept into a mass skill.

How did research papers influence real-world prompting techniques?

The transfer was unusually fast for AI research. Chain-of-Thought prompting (Wei et al., 2022) went from academic paper to widely used practitioner technique within months, partly because it required no tooling β€” just a change in how you wrote the prompt. Few-shot prompting from the GPT-3 paper (Brown et al., 2020) was immediately adoptable by anyone with API access. The accessibility of the techniques accelerated their spread.

Is prompt engineering becoming less important as models improve?

No β€” more capable models respond better to well-structured prompts, not less. The gains from good prompting increase as the model becomes more capable of following precise instructions. What has changed is the level of prompt engineering required for simple tasks: conversational questions now require less crafting than they did in 2021. But for complex, production-grade outputs, structured prompting remains the most reliable lever available.

What is the difference between prompt engineering and context engineering?

Prompt engineering typically refers to designing the text input to a model to improve its output. Context engineering is a broader, more recent concept that refers to orchestrating everything in the model's context window: the system prompt, retrieved documents, conversation history, tool outputs, and user input β€” all composed deliberately. Context engineering treats the prompt as one component in a designed system, not a standalone artifact.

Will automated tools replace the need to understand prompt engineering?

Automated tools like DSPy can optimise prompt wording within defined objectives, but they require a human to specify what the objective is, what constraints apply, and how to evaluate success. Understanding prompt engineering principles remains necessary to use these tools effectively β€” and to diagnose when they produce the wrong outcome. Automation removes some of the manual iteration; it does not remove the need for structured thinking.

Is prompt engineering dead in 2026?

No. The discipline has shifted, not disappeared. As models grow more capable, the work moves from syntax tricks and formatting hacks to context design β€” structuring inputs, managing retrieval, and composing tool outputs. The job title "Prompt Engineer" is narrowing, but the underlying skill is embedded in every role that uses AI: developer, analyst, marketer, researcher. Effective AI adoption still correlates strongly with how well users frame tasks for the model.

Do I need to learn prompt engineering if AI models keep improving?

Yes β€” but the focus shifts with each generation. Better models reduce the need for elaborate workarounds (special tokens, repetitive reinforcement, rigid formatting constraints) and increase the payoff for clear intent, structured context, and well-chosen examples. The fundamentals β€” role, context, format, constraints β€” remain stable across every model generation. Learning them now means the skill compounds rather than expires.

What is the difference between prompt engineering and fine-tuning?

Prompt engineering changes how you talk to a model without modifying its weights. Fine-tuning retrains a model on new data to change its behaviour permanently. Prompt engineering is faster, cheaper, and reversible β€” you can iterate in minutes. Fine-tuning is better when the target behaviour is consistent, high-volume, or impossible to describe reliably in a prompt. Most teams start with prompting and fine-tune only when prompting approaches a ceiling on their specific task.

Common Misconceptions About Prompt Engineering

❌ Prompt engineering is only about writing better sentences.

Why it hurts: This overlooks the structural and contextual dimensions. A prompt's effectiveness depends not just on word choice but on role assignment (assigning the model a persona), constraint specification, output format, and example selection β€” all structural elements that have nothing to do with grammar.

Fix: Think of prompt engineering as designing a system where the prompt is the interface. Invest in structure: assign roles ("You are a...", "Assume..."), specify constraints ("Do not...", "Must include..."), define output format, and provide examples. Structure often matters more than eloquence.

❌ Better models make prompt engineering irrelevant.

Why it hurts: A more capable model is like a more capable person β€” it responds *better* to clear instructions, not worse. The gains from good prompting compound as model capability increases. What changes is the *kind* of prompting needed, not whether it's necessary.

Fix: Assume prompting will remain central to AI work. What evolves is the level of detail and scaffolding needed. With weaker models, you may need explicit step-by-step structure. With stronger models, a clear one-line instruction may suffice β€” but that directness is itself a prompt engineering choice.

❌ Automated prompt optimization tools will replace human prompt engineering.

Why it hurts: Automation tools like DSPy help optimize wording within defined objectives, but a human must still specify the objective, constraints, success criteria, and evaluation method. Automation removes iteration drudgery; it does not remove the need for structured thinking about what the model should do.

Fix: Use automation as a tool, not a replacement. Start with a well-structured prompt designed by a human who understands the task. Use tools like DSPy to refine and optimize it. The human judgement about task structure remains irreplaceable.

Sources

  • Brown, T. et al. (2020). "Language Models are Few-Shot Learners." arXiv preprint arXiv:2005.14165. β€” OpenAI GPT-3 paper introducing few-shot prompting as a paradigm.
  • Wei, J. et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv preprint arXiv:2201.11903. β€” Google Brain paper on step-by-step reasoning prompts.
  • Ouyang, L. et al. (2022). "Training language models to follow instructions with human feedback." arXiv preprint arXiv:2203.02155. β€” OpenAI InstructGPT paper on instruction-following via RLHF.
  • Zhou, Y. et al. (2023). "Large Language Models Are Human-Level Prompt Engineers." arXiv preprint arXiv:2211.01910. β€” Stanford APE paper on LLMs optimizing prompts.
  • Lewis, P., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems. β€” Meta paper introducing RAG.
  • Stiennon, N. et al. (2022). "Summarize, Please! A Study on Prompts for Improving LLM Summarisation." arXiv preprint. β€” Work on prompt design for factual accuracy.

Apply these techniques across 25+ AI models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Prompt Engineering

Prompt Engineering is Dead? How It Evolved Into Context Design (2026)