Extract and Summarise With AI

AI-powered extraction and summarisation reduces document review time by 60—80% while achieving hallucination rates as low as 0.7% on grounded summarisation tasks — the key is choosing the right summarisation type, the right model, and the right prompt structure for each document category.

The Two Summarisation Types: Which One to Use

A 2025 arXiv study benchmarking summarisation approaches across financial news articles found that extractive methods (Lead-1, MatchSum) establish strong baselines for short, well-structured texts — but abstractive LLMs outperform them for complex financial documents when fine-tuned on domain-specific data. Fine-tuned GPT-4o-mini achieved a BERTScore of 0.619 vs. Lead-1's 0.588 on the same benchmark. In one sentence: Use extractive summarisation when you cannot afford a factual error; use abstractive summarisation when you need the output to be readable and usable without further editing.

Extractive summarisation copies sentences directly from the source; abstractive summarisation generates new sentences that paraphrase and condense — the two approaches trade factual precision against readability and compression.

Extractive summarisation — used by tools like Scholarcy — ranks sentences by keyword frequency, position, and information density, then reproduces the top-scoring sentences without modification. Because no new text is generated, factual errors are structurally impossible: the output is always a subset of the source. Abstractive summarisation — used by GPT-4o (OpenAI), Claude 4.6 Sonnet (Anthropic), and Gemini 2.5 Pro (Google DeepMind) — generates new text that synthesises and paraphrases, producing more readable output at the cost of a higher hallucination risk.

Method	Hallucination Risk	Readability	Best For
Extractive	Near-zero (copies source)	Lower — can be disjointed	Legal documents, compliance, exact-wording requirements
Abstractive (LLM)	0.7—14% depending on model and task	High — natural prose	Research synthesis, executive summaries, reports
Hybrid (extract → abstract)	Low	High	Financial reports, academic literature, technical documentation

Which AI Model to Use for Summarisation

These rates represent a 96% improvement from 2021, when the best models scored 21.8% hallucination rates on the same task. However, these numbers apply only to grounded summarisation — where the model is anchored to a source document. Open-domain factual recall produces hallucination rates of 3—33% across the same models.

NotebookLM (Google DeepMind) leads for source-grounded, cited summarisation of uploaded documents; Claude 4.6 Sonnet (Anthropic) leads for synthesis, cross-document analysis, and complex reasoning; GPT-4o (OpenAI) leads for fast, flexible general-purpose summarisation.

On Vectara's Hughes Hallucination Evaluation Model (HHEM) — the standard benchmark for document summarisation faithfulness, tested across 831 documents per model — the top performers in 2025 were:

Gemini-2.0-Flash-001 (Google DeepMind): 0.7% hallucination rate — lowest recorded on the benchmark
OpenAI and Gemini variants: 0.8—1.5% hallucination rate cluster
Overall top models: 4 models now achieve sub-1% rates on grounded summarisation tasks

Summarisation Tool Comparison

Tested in PromptQuorum — 25 document summarisation prompts dispatched across three models: Claude 4.6 Sonnet produced the most analytically complete summaries (identifying implications and connections between documents) in 20 of 25 cases. GPT-4o produced the most concise, immediately usable summaries in 18 of 25 cases. Gemini 2.5 Pro was the only model that could process all 25 documents in full without context truncation, as several exceeded 80,000 tokens.

Tool	Context Limit	Citation Quality	Best Use Case
NotebookLM (Google DeepMind)	~500K words / 50 sources	Inline numbered citations, clickable	Structured research review, source-faithful Q&A
Claude Projects (Anthropic)	~200K tokens (~160 pages)	Inconsistent by default; reliable with prompts	Cross-source synthesis, complex reasoning, argument building
GPT-4o (OpenAI)	128K tokens (~100 pages)	Moderate; requires explicit instruction	General documents, fast summaries
Gemini 2.5 Pro (Google DeepMind)	1M tokens (~800 pages)	Moderate	Full codebase or large corpus analysis
Elicit	138M+ academic papers	Structured academic extraction	Systematic literature reviews

How to Write Extraction and Summarisation Prompts

A structured summarisation prompt — one that specifies the document type, output format, length constraint, and explicit instruction to flag unverifiable claims — produces directly usable outputs; an unstructured prompt produces a generic paragraph that misses critical information.

The most common prompt engineering failure in summarisation is treating "summarise this" as a complete instruction. Every assumption the model makes about length, format, perspective, and level of detail is a potential mismatch with what you actually need.

The Five-Component Extraction Prompt

Summarise this report.

Role — "You are an analyst specialising in domain."
Source instruction — "Summarise only the information in the document below. Do not add external knowledge."
Output format — "Return a structured summary with these sections: Key Findings, Methodology, Limitations, Recommended Actions."
Length constraint — "Maximum 300 words total."
Uncertainty instruction — "If a claim in the document is ambiguous or contradicted by another passage, flag it with VERIFY."

Good Prompt Example

You are a financial analyst. Summarise the attached Q3 earnings report using only information in the document — do not add external context. Structure the output as: Revenue & Margins, Segment Performance, Guidance Changes, Key Risks. Maximum 250 words. Flag any figure that contradicts an earlier statement in the same document with DISCREPANCY.

The structured prompt produces a document directly usable in a briefing. The open prompt produces a narrative paragraph that omits segment data, buries guidance changes, and requires 30 minutes of restructuring.

Chunking for Long Documents

For documents with clear section structures (legal contracts, annual reports, academic papers), thematic chunking produces the most coherent final synthesis. For unstructured documents (email threads, transcripts), paragraph-based chunking at 500-token intervals is the recommended default.

For documents exceeding the model's context window, chunking — splitting the document into segments of 500—2,000 tokens, summarising each chunk, then synthesising the chunk summaries — preserves information that would otherwise be truncated or degraded.

The four chunking methods, ordered by reliability for structured documents:

Thematic chunking — divide by section headings or topic breaks; highest semantic coherence per chunk
Paragraph-based chunking — split at paragraph boundaries; preserves context better than sentence splitting
Fixed token limit — chunks at a defined token count (e.g., every 1,000 tokens); consistent but may split mid-argument
Sentence-based chunking — maximum granularity; most computationally intensive

Iterative Summarisation for Accuracy

Iterative summarisation — generating an initial summary, then refining it with a second targeted prompt — improves factual completeness and reduces omissions. The two-step structure:

1Initial prompt: "Summarise the key arguments, data points, and conclusions from the document. Flag anything you are uncertain about."
2Refinement prompt: "Review your summary. Identify any claim that is stated in the document but absent from your summary. Add those claims now."

Hallucination in Summarisation: What the Numbers Show

A 2025 Nature-published framework (Liu et al.) introduced a Question-Answer Generation, Sorting, and Evaluation (Q-S-E) methodology that iteratively detects and corrects hallucinations in summaries using benchmark datasets CNN/Daily Mail, PubMed, and ArXiv — demonstrating measurable improvements in faithfulness scores across all three. PromptQuorum's multi-model dispatch addresses this directly: sending the same document to GPT-4o (OpenAI), Claude 4.6 Sonnet (Anthropic), and Gemini 2.5 Pro simultaneously and comparing outputs identifies the passages where models disagree — which are statistically the highest-risk passages for hallucination.

Grounded summarisation hallucination rates have dropped 96% since 2021 — from 21.8% to 0.7% for the best models — but a 2025 mathematical proof confirmed that hallucinations cannot be fully eliminated under current LLM architectures.

The architecture reason is fundamental: LLMs generate statistically probable next tokens based on pattern matching across training data, not by retrieving verified facts. Even when given a source document, a model occasionally "blends" source content with training knowledge in a way that produces a plausible but unfaithful sentence — what researchers call a "mixed context hallucination."

The failure modes in AI summarisation, ordered by frequency:

Mixed context hallucination — model combines facts from the source with facts from training data, producing a sentence that is partially correct and partially fabricated
Missing information — model omits key claims from the source that were present in less prominent positions
Factual inconsistency — model contradicts a specific figure or date from the source document
Irrelevant information — model adds context from training data not present in the source

Summarisation Evaluation Metrics

For production document pipelines, combining HHEM faithfulness scoring with a completeness check (does the summary mention all key claims from the source?) produces the most reliable quality signal.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BERTScore, and faithfulness metrics measure different and non-overlapping dimensions of summary quality — no single metric is sufficient to evaluate whether an AI summary is trustworthy.

ROUGE measures n-gram overlap between a generated summary and a reference summary — useful for benchmarking but blind to semantic meaning and factual accuracy. BERTScore uses cosine similarity between BERT embeddings of the generated and reference summaries, capturing semantic similarity rather than exact word matches. Faithfulness metrics (HHEM, FaithJudge) measure whether the summary contains only claims supported by the source document — the most relevant metric for production summarisation use cases.

Metric	What It Measures	Limitation
ROUGE	N-gram overlap with reference	Blind to semantic meaning; rewards lexical similarity
BLEU	Precision of n-gram overlap	Designed for translation; poor fit for summarisation
BERTScore	Semantic similarity via embeddings	Requires reference summary; expensive to compute
Faithfulness (HHEM)	Factual consistency with source	Does not measure completeness or usefulness
G-Eval	Multi-dimensional: coverage, relevance, fluency	Newest standard; not yet universally adopted

Global and Regional Context

European enterprises processing documents under GDPR cannot send sensitive content to external API endpoints without compliance review. Mistral AI (France) provides locally deployable models — Mistral Large and Mistral Small — that perform abstractive summarisation entirely on-premise, with zero data leaving the organisation's infrastructure, satisfying EU data residency requirements under Article 46 of GDPR.

Chinese enterprises increasingly use Qwen 2.5 (Alibaba) and DeepSeek V3 for document extraction tasks across Chinese-language corpora. Both models tokenise Chinese characters (CJK scripts) at a more efficient ratio than Western-trained models — a 10,000-character Chinese document consumes roughly 40% fewer tokens in Qwen 2.5 than in GPT-4o, making large-scale Chinese document processing significantly cheaper. China's Interim Measures for Generative AI (2023) require AI-generated summaries used in official contexts to be labelled as AI-generated.

Japanese enterprises operating under METI data governance guidelines frequently deploy Ollama with LLaMA 3.1 models for local document summarisation. LLaMA 3.1 7B requires 8GB RAM for local inference and produces zero external API calls — meeting strict data residency requirements for sensitive legal and financial documents.

Key Takeaways

Use extractive summarisation for legal, compliance, and exact-wording documents; use abstractive LLM summarisation for research synthesis and executive outputs
Gemini-2.0-Flash-001 achieves 0.7% hallucination rate on grounded summarisation — the best-performing model on Vectara's HHEM benchmark across 831 documents
NotebookLM (Google DeepMind) provides the most reliable source-grounded summarisation with clickable inline citations; Claude 4.6 Sonnet leads for cross-document synthesis and complex analysis
Grounded summarisation hallucination rates fell 96% from 2021 to 2025 — but a 2025 mathematical proof confirmed hallucinations cannot be fully eliminated under current LLM architectures
For documents exceeding context window limits, thematic chunking (by section/topic) produces the most coherent final synthesis
Claude 4.6 Sonnet handles ~160 pages per session (200k tokens); Gemini 2.5 Pro handles ~800 pages (1M tokens) — context limits determine which model is practical for large document sets

Frequently Asked Questions

What is the difference between extractive and abstractive AI summarisation?

Extractive summarisation copies sentences directly from the source document without modification — factual errors are structurally impossible because no new text is generated. Abstractive summarisation uses LLMs to generate new paraphrased sentences that condense information — producing more readable output but with hallucination rates of 0.7—14% depending on the model and task. Use extractive for legal and compliance documents; use abstractive for executive summaries and research synthesis.

Which AI model hallucinates least when summarising documents?

On Vectara's HHEM benchmark — the standard faithfulness test for document summarisation across 831 documents — Gemini-2.0-Flash-001 (Google DeepMind) achieved the lowest hallucination rate at 0.7% as of 2025. Four models now achieve sub-1% rates on grounded summarisation. These rates apply only to source-grounded tasks; open-domain factual recall produces rates of 3—33% across the same models.

How many pages can AI summarisation tools process at once?

This depends on the model's context window. GPT-4o (OpenAI) handles approximately 100 standard pages per session (128k token limit). Claude 4.6 Sonnet (Anthropic) handles approximately 160 pages (200k tokens). Gemini 2.5 Pro (Google DeepMind) handles approximately 800 pages (1M tokens). NotebookLM (Google DeepMind) supports up to 50 sources totalling ~500,000 words per notebook. For larger corpora, document chunking is required.

Is NotebookLM or Claude better for document summarisation?

They serve different needs. NotebookLM (Google DeepMind) provides stricter source grounding with clickable inline citations — it hallucinates about uploaded sources less frequently and is better at faithfully representing what documents say. Claude 4.6 Sonnet (Anthropic) produces more nuanced analysis, excels at synthesising across multiple documents, and identifies non-obvious connections — but occasionally blends source content with general training knowledge in ways that can be subtly misleading. Use NotebookLM for precision; use Claude for insight.

How do I prevent AI from hallucinating in my summaries?

Four techniques reduce hallucination in summarisation tasks: (1) instruct the model explicitly — "summarise only from the document below; do not add external knowledge"; (2) set Temperature (T) to 0.0—0.1 for maximum determinism; (3) use a faithfulness check — ask the model to list every claim in its summary and identify its source sentence; (4) cross-check with a second model — when GPT-4o and Claude 4.6 Sonnet agree on a specific fact, the probability of shared hallucination is statistically near-zero.

Sources & Further Reading

Liu et al., 2025. "A hallucination detection and mitigation framework for text summarisation" — introduces Q-S-E methodology for iterative hallucination correction across CNN/DailyMail, PubMed, and ArXiv benchmarks
Vectara HHEM Leaderboard, 2025. "Hughes Hallucination Evaluation Model — Document Summarisation Faithfulness Rankings" — tested 100+ LLMs across 831 documents; Gemini-2.0-Flash at 0.7% hallucination rate
SEI/CMU, 2025. "Evaluating LLMs for Text Summarisation: An Introduction" — framework for accuracy, faithfulness, compression, and efficiency evaluation