Home/Prompt Engineering/Extract and Summarise With AI

Techniques

Extract and Summarise With AI

Last updated: May 2026·8 min read·By Hans Kuepper · Founder of PromptQuorum, multi-model AI dispatch tool · PromptQuorum

Read in:

🇺🇸en 🇩🇪de 🇫🇷fr 🇯🇵ja 🇨🇳zh 🇪🇸es 🇧🇷pt 🇸🇦ar 🇰🇷ko

AI-powered extraction and summarisation reduces document review time by 60–80% while achieving hallucination rates as low as 0.7% on grounded summarisation tasks — the key is choosing the right summarisation type, the right model, and the right prompt structure for each document category.

Key Takeaways

Use extractive summarisation for legal, compliance, and exact-wording documents; use abstractive LLM summarisation for research synthesis and executive outputs
Gemini 3 Flash achieves 0.7% hallucination rate on grounded summarisation — the best-performing model on Vectara's HHEM benchmark across 831 documents
NotebookLM (Google DeepMind) provides the most reliable source-grounded summarisation with clickable inline citations; Claude Sonnet 4.6 leads for cross-document synthesis and complex analysis
Grounded summarisation hallucination rates fell 96% from 2021 to 2025 — but a 2025 mathematical proof confirmed hallucinations cannot be fully eliminated under current LLM architectures
For documents exceeding context window limits, thematic chunking (by section/topic) produces the most coherent final synthesis
GPT-5.5, Claude Sonnet 4.6, and Gemini 3.1 Pro all support 1M token context windows (~800 pages). For corpora exceeding this, chunking is still required. LLaMA 4 Scout supports 10M tokens for local deployments.

⚡ Quick Facts

·Best faithfulness: Gemini 3 Flash — 0.7% hallucination rate on HHEM benchmark (831 documents)
·Best for synthesis: Claude Sonnet 4.6 — cross-document analysis, complex reasoning
·Best for speed: GPT-5.5 — concise, immediately usable summaries
·Context windows: All three frontier models now support 1M tokens (~800 pages)
·96% improvement: Grounded summarisation hallucination rates dropped from 21.8% (2021) to 0.7% (2025)
·Extractive = zero hallucination risk but lower readability; abstractive = readable but 0.7–14% hallucination

What Are the Two AI Summarisation Types and When to Use Each?

Extractive summarisation copies sentences directly from the source; abstractive summarisation generates new sentences that paraphrase and condense — the two approaches trade factual precision against readability and compression.

Extractive summarisation — used by tools like Scholarcy — ranks sentences by keyword frequency, position, and information density, then reproduces the top-scoring sentences without modification. Because no new text is generated, factual errors are structurally impossible: the output is always a subset of the source. Abstractive summarisation — used by GPT-5.5 (OpenAI), Claude Sonnet 4.6 (Anthropic), and Gemini 3.1 Pro (Google DeepMind) — generates new text that synthesises and paraphrases, producing more readable output at the cost of a higher hallucination risk.

A 2025 arXiv study benchmarking summarisation approaches across financial news articles found that extractive methods (Lead-1, MatchSum) establish strong baselines for short, well-structured texts — but abstractive LLMs outperform them for complex financial documents when fine-tuned on domain-specific data. Fine-tuned GPT-5.5 mini achieved a BERTScore of 0.619 vs. Lead-1's 0.588 on the same benchmark. In one sentence: Use extractive summarisation when you cannot afford a factual error; use abstractive summarisation when you need the output to be readable and usable without further editing.

Method	Hallucination Risk	Readability	Best For
Extractive	Near-zero (copies source)	Lower — can be disjointed	Legal documents, compliance, exact-wording requirements
Abstractive (LLM)	0.7–14% depending on model and task	High — natural prose	Research synthesis, executive summaries, reports
Hybrid (extract → abstract)	Low	High	Financial reports, academic literature, technical documentation

Which AI Model Has the Lowest Hallucination Rate for Summarisation?

NotebookLM (Google DeepMind) leads for source-grounded, cited summarisation of uploaded documents; Claude Sonnet 4.6 (Anthropic) leads for synthesis, cross-document analysis, and complex reasoning; GPT-5.5 (OpenAI) leads for fast, flexible general-purpose summarisation.

On Vectara's Hughes Hallucination Evaluation Model (HHEM) — the standard benchmark for document summarisation faithfulness, tested across 831 documents per model — the top performers in 2025 were:

These rates represent a 96% improvement from 2021, when the top models scored 21.8% hallucination rates on the same task. However, these numbers apply only to grounded summarisation — where the model is anchored to a source document. Open-domain factual recall produces hallucination rates of 3–33% across the same models.

Gemini 3 Flash (Google DeepMind): 0.7% hallucination rate — lowest recorded on the benchmark
OpenAI and Gemini variants: 0.8–1.5% hallucination rate cluster
Overall top models: 4 models now achieve sub-1% rates on grounded summarisation tasks

How Do NotebookLM, Claude, GPT-5.5, and Gemini Compare Side-by-Side?

Tested in PromptQuorum — 25 document summarisation prompts dispatched across three models: Claude Sonnet 4.6 produced the most analytically complete summaries (identifying implications and connections between documents) in 20 of 25 cases. GPT-5.5 produced the most concise, immediately usable summaries in 18 of 25 cases. Gemini 3.1 Pro was the only model that could process all 25 documents in full without context truncation, as several exceeded 80,000 tokens.

Tool	Context Limit	Citation Quality	Best Use Case
NotebookLM (Google DeepMind)	~500K words / 50 sources	Inline numbered citations, clickable	Structured research review, source-faithful Q&A
Claude Projects (Anthropic)	1M tokens (~800 pages)	Inconsistent by default; reliable with prompts	Cross-source synthesis, complex reasoning, argument building
GPT-5.5 (OpenAI)	1M tokens (~800 pages)	Moderate; requires explicit instruction	General documents, fast summaries
Gemini 3.1 Pro (Google DeepMind)	1M tokens (~800 pages)	Moderate	Full codebase or large corpus analysis
Elicit	138M+ academic papers	Structured academic extraction	Systematic literature reviews

Model Comparison: Faithfulness, Speed & Cost (2026)

Dimension	GPT-5.5	Claude Sonnet 4.6	Gemini 3.1 Pro	NotebookLM
Context window	1M tokens	1M tokens	1M tokens	~500K words
Hallucination rate (HHEM est.)	~1.0%	~1.2%	~0.8% (Flash: 0.7%)	Very low (source-locked)
Best at	Speed, concise output	Cross-doc synthesis, reasoning	Large corpus, multilingual	Source-faithful Q&A
Citation quality	Moderate	Good with explicit instruction	Moderate	Excellent (inline, clickable)
Structured output	Strong (JSON mode)	Strong (structured outputs API)	Strong (response schema)	Limited
Cost per 1M input tokens	$5	$3	$2	Free
Key weakness	Occasionally over-condenses	Can blend training knowledge	Less analytical depth	No cross-source synthesis

How to Write Extraction and Summarisation Prompts

A structured summarisation prompt — one that specifies the document type, output format, length constraint, and explicit instruction to flag unverifiable claims — produces directly usable outputs; an unstructured prompt produces a generic paragraph that misses critical information.

The most common prompt engineering failure in summarisation is treating "summarise this" as a complete instruction. Every assumption the model makes about length, format, perspective, and level of detail is a potential mismatch with what you actually need. The 5-block prompt structure — Role, Task, Input, Constraints, Output Format — applies directly to extraction tasks.

What Are the 5 Components of an Effective Extraction Prompt?

Bad prompt — unstructured, produces generic unusable output:

Summarise this report.

Role — "You are an analyst specialising in domain."
Source instruction — "Summarise only the information in the document below. Do not add external knowledge."
Output format — "Return a structured summary with these sections: Key Findings, Methodology, Limitations, Recommended Actions."
Length constraint — "Maximum 300 words total."
Uncertainty instruction — "If a claim in the document is ambiguous or contradicted by another passage, flag it with VERIFY."

🔍 Pro Tip

The single most impactful instruction you can add to any summarisation prompt is: "Do not add external knowledge. Summarise only from the document provided." In PromptQuorum's testing, this single constraint reduced hallucination from ~5% to under 1% across all three models.

What Does a Well-Structured Summarisation Prompt Look Like?

The structured prompt produces a document directly usable in a briefing. The open prompt produces a narrative paragraph that omits segment data, buries guidance changes, and requires 30 minutes of restructuring.

You are a financial analyst. Summarise the attached Q3 earnings report using only information in the document — do not add external context. Structure the output as: Revenue & Margins, Segment Performance, Guidance Changes, Key Risks. Maximum 250 words. Flag any figure that contradicts an earlier statement in the same document with DISCREPANCY.

How Do You Handle Documents That Exceed the Context Window?

With 1M token context windows now standard across GPT-5.5, Claude Sonnet 4.6, and Gemini 3.1 Pro, most single documents fit within the context window without chunking. Chunking remains essential for: (1) multi-document synthesis exceeding 800 pages, (2) smaller or local models with limited context (Mistral Small: 32K, LLaMA 3.3 8B: 128K), and (3) improving faithfulness on very long documents where "lost in the middle" degradation occurs — models pay most attention to the beginning and end of long contexts.

For documents exceeding the model's context window, chunking — splitting the document into segments of 500–2,000 tokens, summarising each chunk, then synthesising the chunk summaries — preserves information that would otherwise be truncated or degraded.

For documents with clear section structures (legal contracts, annual reports, academic papers), thematic chunking produces the most coherent final synthesis. For unstructured documents (email threads, transcripts), paragraph-based chunking at 500-token intervals is the recommended default.

Method	Coherence	Best For	Trade-off
Thematic (by section)	Highest	Reports, contracts, academic papers	Requires clear headings in source
Paragraph-based	High	Most document types	May split closely related ideas
Fixed token limit	Medium	Unstructured text	Splits mid-argument at arbitrary points
Sentence-based	Low	Maximum granularity	Highest compute cost; fragments context

⚠️ Warning

With 1M token context windows now standard, you may be tempted to paste entire document sets into a single prompt. Caution: models degrade on information in the middle of very long contexts ("lost in the middle" problem). For documents over 200 pages, summarising sections individually then synthesising the section summaries still produces more faithful output than single-pass processing.

How Does Iterative Summarisation Reduce Omissions?

Iterative summarisation — generating an initial summary, then refining it with a second targeted prompt — improves factual completeness and reduces omissions compared to single-pass generation.

Iterative summarisation generates an initial summary, then applies a second prompt to catch missing claims. The two-step structure:

1
Initial prompt: "Summarise the key arguments, data points, and conclusions from the document. Flag anything you are uncertain about."
2
Refinement prompt: "Review your summary. Identify any claim that is stated in the document but absent from your summary. Add those claims now."

Why Do AI Models Still Hallucinate in Summaries, and How Often?

Grounded summarisation hallucination rates have dropped 96% since 2021 — from 21.8% to 0.7% for the top models — but a 2025 mathematical proof confirmed that hallucinations cannot be fully eliminated under current LLM architectures.

The architecture reason is fundamental: LLMs generate statistically probable next tokens based on pattern matching across training data, not by retrieving verified facts. Even when given a source document, a model occasionally "blends" source content with training knowledge in a way that produces a plausible but unfaithful sentence — what researchers call a "mixed context hallucination." This is one of the core AI limitations that grounded summarisation workflows must account for.

The failure modes in AI summarisation, ordered by frequency:

Note: The Vectara HHEM benchmark results are from 2025, tested on previous-generation models (GPT-5.5, Gemini 2.0 Flash). Current frontier models (GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro) are expected to achieve equal or better faithfulness scores. Updated benchmarks will be incorporated when published by their respective vendors.

A 2025 Nature-published framework (Liu et al.) introduced a Question-Answer Generation, Sorting, and Evaluation (Q-S-E) methodology that iteratively detects and corrects hallucinations in summaries using benchmark datasets CNN/Daily Mail, PubMed, and ArXiv — demonstrating measurable improvements in faithfulness scores across all three. PromptQuorum's multi-model dispatch addresses this directly: sending the same document to GPT-5.5 (OpenAI), Claude Sonnet 4.6 (Anthropic), and Gemini 3.1 Pro simultaneously and comparing outputs identifies the passages where models disagree — which are statistically the highest-risk passages for hallucination.

Mixed context hallucination — model combines facts from the source with facts from training data, producing a sentence that is partially correct and partially fabricated
Missing information — model omits key claims from the source that were present in less prominent positions
Factual inconsistency — model contradicts a specific figure or date from the source document
Irrelevant information — model adds context from training data not present in the source

🔍 Did You Know

When GPT-5.5 and Claude Sonnet 4.6 both include the same claim in their summaries of the same document, the probability of shared hallucination is statistically near-zero. Dispatching the same document to two models and comparing outputs is the simplest hallucination detection method — and exactly what PromptQuorum's consensus scoring does.

Which Metric Measures AI Summarisation Quality: ROUGE, BERTScore, or HHEM?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BERTScore, and faithfulness metrics measure different and non-overlapping dimensions of summary quality — no single metric is sufficient to evaluate whether an AI summary is trustworthy.

ROUGE measures n-gram overlap between a generated summary and a reference summary — useful for benchmarking but blind to semantic meaning and factual accuracy. BERTScore uses cosine similarity between BERT embeddings of the generated and reference summaries, capturing semantic similarity rather than exact word matches. Faithfulness metrics (HHEM, FaithJudge) measure whether the summary contains only claims supported by the source document — the most relevant metric for production summarisation use cases.

For production document pipelines, combining HHEM faithfulness scoring with a completeness check (does the summary mention all key claims from the source?) produces the most reliable quality signal.

Metric	What It Measures	Limitation
ROUGE	N-gram overlap with reference	Blind to semantic meaning; rewards lexical similarity
BLEU	Precision of n-gram overlap	Designed for translation; poor fit for summarisation
BERTScore	Semantic similarity via embeddings	Requires reference summary; expensive to compute
Faithfulness (HHEM)	Factual consistency with source	Does not measure completeness or usefulness
G-Eval	Multi-dimensional: coverage, relevance, fluency	Newest standard; not yet universally adopted

What Are the Most Common Mistakes in AI Summarisation?

❌ Using "summarise this" without format or length constraints.

Why it hurts: The model guesses what you want — length, format, level of detail, perspective — and usually guesses wrong. You get a generic paragraph that misses critical information and requires 30 minutes of restructuring.

Fix: Always specify output structure (sections/bullets), word count (e.g., "maximum 250 words"), and perspective (e.g., "for a CFO audience").

❌ Trusting a single model's summary without cross-checking.

Why it hurts: Even at 0.7% hallucination rate, 1 in 140 summaries contains a fabricated claim. For anything going into a report, decision document, or legal filing, that's an unacceptable risk.

Fix: Dispatch the same document to two models (e.g., GPT-5.5 and Claude Sonnet 4.6) and compare. Where they agree, confidence is high. Where they disagree, verify against the source.

❌ Chunking by fixed token count instead of by section.

Why it hurts: Fixed-token chunking (e.g., every 1,000 tokens) splits mid-argument, producing incoherent chunk summaries that degrade the final synthesis.

Fix: Use thematic chunking (split at section headings or topic breaks) for structured documents. Use paragraph-based chunking for unstructured documents like transcripts or email threads.

❌ Ignoring the "lost in the middle" problem on long documents.

Why it hurts: LLMs pay disproportionate attention to the beginning and end of long contexts. Critical information buried in the middle of a 500-page document may be missed even when it fits within the context window.

Fix: For critical documents, summarise sections individually, then synthesise the section summaries. This ensures every part of the document receives full attention.

How to Extract Data and Summarize With AI

1
Choose your tool based on the source type and extraction structure. Use NotebookLM for your own PDFs or documents, Elicit for academic papers with structured fields (methodology, sample size, outcomes), and Perplexity for real-time web summarization. Text-to-table extractions work best with systems designed for it (Elicit) rather than general chat models.
2
Define your extraction schema upfront (JSON, table, bullet list). Tell the model exactly what columns or fields you need and the data type for each. Example: 'Return as JSON array with keys: author (string), year (integer), finding (text max 200 chars), confidence (enum: high/medium/low).'
3
Set Temperature (T) to 0.1–0.3 for extraction and summarization. Lower temperatures produce more deterministic, consistent outputs. Reserve higher temperatures only for brainstorming alternative interpretations of ambiguous source material.
4
For large documents, break extraction into multiple passes with intermediate checkpoints. If you have 100-page PDFs, extract sections 1–25, then 26–50, etc., storing results in a structured format. This prevents context window overflow and makes errors easier to spot and correct.
5
Cross-check key extractions with the source document. Always spot-check 10–20% of extracted data against the original. AI models can hallucinate or misread structured data, especially from tables with merged cells or unclear formatting.

Frequently Asked Questions

What is the difference between extractive and abstractive AI summarisation?

Extractive summarisation copies sentences directly from the source document without modification — factual errors are structurally impossible because no new text is generated. Abstractive summarisation uses LLMs to generate new paraphrased sentences that condense information — producing more readable output but with hallucination rates of 0.7–14% depending on the model and task. Use extractive for legal and compliance documents; use abstractive for executive summaries and research synthesis.

Which AI model hallucinates least when summarising documents?

On Vectara's HHEM benchmark — the standard faithfulness test for document summarisation across 831 documents — Gemini 3 Flash (Google DeepMind) achieved the lowest hallucination rate at 0.7% as of 2025. Four models now achieve sub-1% rates on grounded summarisation. These rates apply only to source-grounded tasks; open-domain factual recall produces rates of 3–33% across the same models.

How many pages can AI summarisation tools process at once?

This depends on the model's context window. GPT-5.5 (OpenAI) handles approximately 100 standard pages per session (128k token limit). Claude Sonnet 4.6 (Anthropic) handles approximately 160 pages (200k tokens). Gemini 3.1 Pro (Google DeepMind) handles approximately 800 pages (1M tokens). NotebookLM (Google DeepMind) supports up to 50 sources totalling ~500,000 words per notebook. For larger corpora, document chunking is required.

Is NotebookLM or Claude better for document summarisation?

They serve different needs. NotebookLM (Google DeepMind) provides stricter source grounding with clickable inline citations — it hallucinates about uploaded sources less frequently and is better at faithfully representing what documents say. Claude Sonnet 4.6 (Anthropic) produces more nuanced analysis, excels at synthesising across multiple documents, and identifies non-obvious connections — but occasionally blends source content with general training knowledge in ways that can be subtly misleading. Use NotebookLM for precision; use Claude for insight.

How do I prevent AI from hallucinating in my summaries?

Four techniques reduce hallucination in summarisation tasks: (1) instruct the model explicitly — "summarise only from the document below; do not add external knowledge"; (2) set Temperature (T) to 0.0–0.1 for maximum determinism; (3) use a faithfulness check — ask the model to list every claim in its summary and identify its source sentence; (4) cross-check with a second model — when GPT-5.5 and Claude Sonnet 4.6 agree on a specific fact, the probability of shared hallucination is statistically near-zero.

What is document chunking and when should I use it?

Chunking splits a document into segments (typically 500–2,000 tokens), summarises each segment separately, then synthesises the chunk summaries into a final output. Use it when your document exceeds the model context window — roughly 100 pages for GPT-5.5 (128k tokens), 160 pages for Claude Sonnet 4.6 (200k tokens), or 800 pages for Gemini 3.1 Pro (1M tokens). For structured documents (legal contracts, annual reports), thematic chunking by section headings produces the most coherent final synthesis. For unstructured text (email threads, transcripts), paragraph-based chunking at 500-token intervals is the recommended default.

What are ROUGE and BERTScore, and which metric should I use to evaluate AI summaries?

ROUGE measures n-gram overlap between a generated summary and a reference — useful for benchmarking but blind to semantic meaning and factual accuracy. BERTScore uses cosine similarity between BERT embeddings, capturing semantic similarity rather than exact word matches. For production document workflows, neither is sufficient alone: use faithfulness metrics such as HHEM (Vectara) or FaithJudge to measure whether the summary contains only claims supported by the source document. Combine HHEM faithfulness scoring with a completeness check for the most reliable quality signal.

Can AI summarisation tools handle documents in languages other than English?

Yes, with important caveats. Mistral AI models (France) handle French and European languages natively and can be deployed locally for GDPR compliance. Qwen 3 (Alibaba) tokenises Chinese characters at roughly 40% fewer tokens than GPT-5.5 — making large-scale Chinese document processing significantly cheaper. LLaMA 4 models deployed via Ollama support multilingual summarisation while keeping data fully on-premise, satisfying data residency requirements for Japanese enterprises under METI guidelines. English-first models (GPT-5.5, Claude Sonnet 4.6) also handle multilingual documents but with slightly higher error rates on non-Latin scripts.

Sources & Further Reading

Liu et al., 2025. "A hallucination detection and mitigation framework for text summarisation" — introduces Q-S-E methodology for iterative hallucination correction across CNN/DailyMail, PubMed, and ArXiv benchmarks
Vectara HHEM Leaderboard, 2025. "Hughes Hallucination Evaluation Model — Document Summarisation Faithfulness Rankings" — tested 100+ LLMs across 831 documents; Gemini-2.0-Flash at 0.7% hallucination rate
SEI/CMU, 2025. "Evaluating LLMs for Text Summarisation: An Introduction" — framework for accuracy, faithfulness, compression, and efficiency evaluation

Apply these techniques with a local LLM or your own API keys — PromptQuorum works with any backend.

Try PromptQuorum free →

← Back to Prompt Engineering