What Are the Two AI Summarisation Types and When to Use Each?
Extractive summarisation copies sentences directly from the source; abstractive summarisation generates new sentences that paraphrase and condense β the two approaches trade factual precision against readability and compression.
Extractive summarisation β used by tools like Scholarcy β ranks sentences by keyword frequency, position, and information density, then reproduces the top-scoring sentences without modification. Because no new text is generated, factual errors are structurally impossible: the output is always a subset of the source. Abstractive summarisation β used by GPT-4o (OpenAI), Claude Sonnet 4.6 (Anthropic), and Gemini 3.1 Pro (Google DeepMind) β generates new text that synthesises and paraphrases, producing more readable output at the cost of a higher hallucination risk.
A 2025 arXiv study benchmarking summarisation approaches across financial news articles found that extractive methods (Lead-1, MatchSum) establish strong baselines for short, well-structured texts β but abstractive LLMs outperform them for complex financial documents when fine-tuned on domain-specific data. Fine-tuned GPT-4o mini achieved a BERTScore of 0.619 vs. Lead-1's 0.588 on the same benchmark. In one sentence: Use extractive summarisation when you cannot afford a factual error; use abstractive summarisation when you need the output to be readable and usable without further editing.
| Method | Hallucination Risk | Readability | Best For |
|---|---|---|---|
| Extractive | Near-zero (copies source) | Lower β can be disjointed | Legal documents, compliance, exact-wording requirements |
| Abstractive (LLM) | 0.7β14% depending on model and task | High β natural prose | Research synthesis, executive summaries, reports |
| Hybrid (extract β abstract) | Low | High | Financial reports, academic literature, technical documentation |
Which AI Model Has the Lowest Hallucination Rate for Summarisation?
NotebookLM (Google DeepMind) leads for source-grounded, cited summarisation of uploaded documents; Claude Sonnet 4.6 (Anthropic) leads for synthesis, cross-document analysis, and complex reasoning; GPT-4o (OpenAI) leads for fast, flexible general-purpose summarisation.
On Vectara's Hughes Hallucination Evaluation Model (HHEM) β the standard benchmark for document summarisation faithfulness, tested across 831 documents per model β the top performers in 2025 were:
These rates represent a 96% improvement from 2021, when the top models scored 21.8% hallucination rates on the same task. However, these numbers apply only to grounded summarisation β where the model is anchored to a source document. Open-domain factual recall produces hallucination rates of 3β33% across the same models.
- Gemini 3 Flash (Google DeepMind): 0.7% hallucination rate β lowest recorded on the benchmark
- OpenAI and Gemini variants: 0.8β1.5% hallucination rate cluster
- Overall top models: 4 models now achieve sub-1% rates on grounded summarisation tasks
How Do NotebookLM, Claude, GPT-4o, and Gemini Compare Side-by-Side?
Tested in PromptQuorum β 25 document summarisation prompts dispatched across three models: Claude Sonnet 4.6 produced the most analytically complete summaries (identifying implications and connections between documents) in 20 of 25 cases. GPT-4o produced the most concise, immediately usable summaries in 18 of 25 cases. Gemini 3.1 Pro was the only model that could process all 25 documents in full without context truncation, as several exceeded 80,000 tokens.
| Tool | Context Limit | Citation Quality | Best Use Case |
|---|---|---|---|
| NotebookLM (Google DeepMind) | ~500K words / 50 sources | Inline numbered citations, clickable | Structured research review, source-faithful Q&A |
| Claude Projects (Anthropic) | 1M tokens (~800 pages) | Inconsistent by default; reliable with prompts | Cross-source synthesis, complex reasoning, argument building |
| GPT-4o (OpenAI) | 1M tokens (~800 pages) | Moderate; requires explicit instruction | General documents, fast summaries |
| Gemini 3.1 Pro (Google DeepMind) | 1M tokens (~800 pages) | Moderate | Full codebase or large corpus analysis |
| Elicit | 138M+ academic papers | Structured academic extraction | Systematic literature reviews |
Model Comparison: Faithfulness, Speed & Cost (2026)
| Dimension | GPT-4o | Claude Sonnet 4.6 | Gemini 3.1 Pro | NotebookLM |
|---|---|---|---|---|
| Context window | 1M tokens | 1M tokens | 1M tokens | ~500K words |
| Hallucination rate (HHEM est.) | ~1.0% | ~1.2% | ~0.8% (Flash: 0.7%) | Very low (source-locked) |
| Best at | Speed, concise output | Cross-doc synthesis, reasoning | Large corpus, multilingual | Source-faithful Q&A |
| Citation quality | Moderate | Good with explicit instruction | Moderate | Excellent (inline, clickable) |
| Structured output | Strong (JSON mode) | Strong (structured outputs API) | Strong (response schema) | Limited |
| Cost per 1M input tokens | $5 | $3 | $2 | Free |
| Key weakness | Occasionally over-condenses | Can blend training knowledge | Less analytical depth | No cross-source synthesis |
How to Write Extraction and Summarisation Prompts
A structured summarisation prompt β one that specifies the document type, output format, length constraint, and explicit instruction to flag unverifiable claims β produces directly usable outputs; an unstructured prompt produces a generic paragraph that misses critical information.
The most common prompt engineering failure in summarisation is treating "summarise this" as a complete instruction. Every assumption the model makes about length, format, perspective, and level of detail is a potential mismatch with what you actually need. The 5-block prompt structure β Role, Task, Input, Constraints, Output Format β applies directly to extraction tasks.
What Are the 5 Components of an Effective Extraction Prompt?
Bad prompt β unstructured, produces generic unusable output:
Summarise this report.
- Role β "You are an analyst specialising in domain."
- Source instruction β "Summarise only the information in the document below. Do not add external knowledge."
- Output format β "Return a structured summary with these sections: Key Findings, Methodology, Limitations, Recommended Actions."
- Length constraint β "Maximum 300 words total."
- Uncertainty instruction β "If a claim in the document is ambiguous or contradicted by another passage, flag it with VERIFY."
π Pro Tip
The single most impactful instruction you can add to any summarisation prompt is: "Do not add external knowledge. Summarise only from the document provided." In PromptQuorum's testing, this single constraint reduced hallucination from ~5% to under 1% across all three models.
What Does a Well-Structured Summarisation Prompt Look Like?
The structured prompt produces a document directly usable in a briefing. The open prompt produces a narrative paragraph that omits segment data, buries guidance changes, and requires 30 minutes of restructuring.
You are a financial analyst. Summarise the attached Q3 earnings report using only information in the document β do not add external context. Structure the output as: Revenue & Margins, Segment Performance, Guidance Changes, Key Risks. Maximum 250 words. Flag any figure that contradicts an earlier statement in the same document with DISCREPANCY.
How Do You Handle Documents That Exceed the Context Window?
With 1M token context windows now standard across GPT-4o, Claude Sonnet 4.6, and Gemini 3.1 Pro, most single documents fit within the context window without chunking. Chunking remains essential for: (1) multi-document synthesis exceeding 800 pages, (2) smaller or local models with limited context (Mistral 7B: 32K, LLaMA 3.3 8B: 128K), and (3) improving faithfulness on very long documents where "lost in the middle" degradation occurs β models pay most attention to the beginning and end of long contexts.
For documents exceeding the model's context window, chunking β splitting the document into segments of 500β2,000 tokens, summarising each chunk, then synthesising the chunk summaries β preserves information that would otherwise be truncated or degraded.
For documents with clear section structures (legal contracts, annual reports, academic papers), thematic chunking produces the most coherent final synthesis. For unstructured documents (email threads, transcripts), paragraph-based chunking at 500-token intervals is the recommended default.
| Method | Coherence | Best For | Trade-off |
|---|---|---|---|
| Thematic (by section) | Highest | Reports, contracts, academic papers | Requires clear headings in source |
| Paragraph-based | High | Most document types | May split closely related ideas |
| Fixed token limit | Medium | Unstructured text | Splits mid-argument at arbitrary points |
| Sentence-based | Low | Maximum granularity | Highest compute cost; fragments context |
β οΈ Warning
With 1M token context windows now standard, you may be tempted to paste entire document sets into a single prompt. Caution: models degrade on information in the middle of very long contexts ("lost in the middle" problem). For documents over 200 pages, summarising sections individually then synthesising the section summaries still produces more faithful output than single-pass processing.
How Does Iterative Summarisation Reduce Omissions?
Iterative summarisation β generating an initial summary, then refining it with a second targeted prompt β improves factual completeness and reduces omissions compared to single-pass generation.
Iterative summarisation generates an initial summary, then applies a second prompt to catch missing claims. The two-step structure:
- 1Initial prompt: "Summarise the key arguments, data points, and conclusions from the document. Flag anything you are uncertain about."
- 2Refinement prompt: "Review your summary. Identify any claim that is stated in the document but absent from your summary. Add those claims now."
Why Do AI Models Still Hallucinate in Summaries, and How Often?
Grounded summarisation hallucination rates have dropped 96% since 2021 β from 21.8% to 0.7% for the top models β but a 2025 mathematical proof confirmed that hallucinations cannot be fully eliminated under current LLM architectures.
The architecture reason is fundamental: LLMs generate statistically probable next tokens based on pattern matching across training data, not by retrieving verified facts. Even when given a source document, a model occasionally "blends" source content with training knowledge in a way that produces a plausible but unfaithful sentence β what researchers call a "mixed context hallucination." This is one of the core AI limitations that grounded summarisation workflows must account for.
The failure modes in AI summarisation, ordered by frequency:
Note: The Vectara HHEM benchmark results are from 2025, tested on previous-generation models (GPT-4o, Gemini 2.0 Flash). Current frontier models (GPT-4o, Claude Sonnet 4.6, Gemini 3.1 Pro) are expected to achieve equal or better faithfulness scores. Updated benchmarks will be incorporated when published by their respective vendors.
A 2025 Nature-published framework (Liu et al.) introduced a Question-Answer Generation, Sorting, and Evaluation (Q-S-E) methodology that iteratively detects and corrects hallucinations in summaries using benchmark datasets CNN/Daily Mail, PubMed, and ArXiv β demonstrating measurable improvements in faithfulness scores across all three. PromptQuorum's multi-model dispatch addresses this directly: sending the same document to GPT-4o (OpenAI), Claude Sonnet 4.6 (Anthropic), and Gemini 3.1 Pro simultaneously and comparing outputs identifies the passages where models disagree β which are statistically the highest-risk passages for hallucination.
- Mixed context hallucination β model combines facts from the source with facts from training data, producing a sentence that is partially correct and partially fabricated
- Missing information β model omits key claims from the source that were present in less prominent positions
- Factual inconsistency β model contradicts a specific figure or date from the source document
- Irrelevant information β model adds context from training data not present in the source
π Did You Know
When GPT-4o and Claude Sonnet 4.6 both include the same claim in their summaries of the same document, the probability of shared hallucination is statistically near-zero. Dispatching the same document to two models and comparing outputs is the simplest hallucination detection method β and exactly what PromptQuorum's consensus scoring does.
Which Metric Measures AI Summarisation Quality: ROUGE, BERTScore, or HHEM?
ROUGE (Recall-Oriented Understudy for Gisting Evaluation), BERTScore, and faithfulness metrics measure different and non-overlapping dimensions of summary quality β no single metric is sufficient to evaluate whether an AI summary is trustworthy.
ROUGE measures n-gram overlap between a generated summary and a reference summary β useful for benchmarking but blind to semantic meaning and factual accuracy. BERTScore uses cosine similarity between BERT embeddings of the generated and reference summaries, capturing semantic similarity rather than exact word matches. Faithfulness metrics (HHEM, FaithJudge) measure whether the summary contains only claims supported by the source document β the most relevant metric for production summarisation use cases.
For production document pipelines, combining HHEM faithfulness scoring with a completeness check (does the summary mention all key claims from the source?) produces the most reliable quality signal.
| Metric | What It Measures | Limitation |
|---|---|---|
| ROUGE | N-gram overlap with reference | Blind to semantic meaning; rewards lexical similarity |
| BLEU | Precision of n-gram overlap | Designed for translation; poor fit for summarisation |
| BERTScore | Semantic similarity via embeddings | Requires reference summary; expensive to compute |
| Faithfulness (HHEM) | Factual consistency with source | Does not measure completeness or usefulness |
| G-Eval | Multi-dimensional: coverage, relevance, fluency | Newest standard; not yet universally adopted |
How Do GDPR, Chinese Law, and METI Guidelines Affect AI Summarisation?
European enterprises processing documents under GDPR cannot send sensitive content to external API endpoints without compliance review. Mistral AI (France) provides locally deployable models β Mistral Large and Mistral Small β that perform abstractive summarisation entirely on-premise, with zero data leaving the organisation's infrastructure, satisfying EU data residency requirements under Article 46 of GDPR.
Chinese enterprises increasingly use Qwen 3 (Alibaba) and DeepSeek V3 for document extraction tasks across Chinese-language corpora. Both models tokenise Chinese characters (CJK scripts) at a more efficient ratio than Western-trained models β a 10,000-character Chinese document consumes roughly 40% fewer tokens in Qwen 3 than in GPT-4o, making large-scale Chinese document processing significantly cheaper. China's Interim Measures for Generative AI (2023) require AI-generated summaries used in official contexts to be labelled as AI-generated.
Japanese enterprises operating under METI data governance guidelines frequently deploy Ollama with LLaMA 4 models for local document summarisation. LLaMA 4 7B requires 8GB RAM for local inference and produces zero external API calls β meeting strict data residency requirements for sensitive legal and financial documents.
What Are the Most Common Mistakes in AI Summarisation?
β Using "summarise this" without format or length constraints.
Why it hurts: The model guesses what you want β length, format, level of detail, perspective β and usually guesses wrong. You get a generic paragraph that misses critical information and requires 30 minutes of restructuring.
Fix: Always specify output structure (sections/bullets), word count (e.g., "maximum 250 words"), and perspective (e.g., "for a CFO audience").
β Trusting a single model's summary without cross-checking.
Why it hurts: Even at 0.7% hallucination rate, 1 in 140 summaries contains a fabricated claim. For anything going into a report, decision document, or legal filing, that's an unacceptable risk.
Fix: Dispatch the same document to two models (e.g., GPT-4o and Claude Sonnet 4.6) and compare. Where they agree, confidence is high. Where they disagree, verify against the source.
β Chunking by fixed token count instead of by section.
Why it hurts: Fixed-token chunking (e.g., every 1,000 tokens) splits mid-argument, producing incoherent chunk summaries that degrade the final synthesis.
Fix: Use thematic chunking (split at section headings or topic breaks) for structured documents. Use paragraph-based chunking for unstructured documents like transcripts or email threads.
β Ignoring the "lost in the middle" problem on long documents.
Why it hurts: LLMs pay disproportionate attention to the beginning and end of long contexts. Critical information buried in the middle of a 500-page document may be missed even when it fits within the context window.
Fix: For critical documents, summarise sections individually, then synthesise the section summaries. This ensures every part of the document receives full attention.
How to Extract Data and Summarize With AI
- 1Choose your tool based on the source type and extraction structure. Use NotebookLM for your own PDFs or documents, Elicit for academic papers with structured fields (methodology, sample size, outcomes), and Perplexity for real-time web summarization. Text-to-table extractions work best with systems designed for it (Elicit) rather than general chat models.
- 2Define your extraction schema upfront (JSON, table, bullet list). Tell the model exactly what columns or fields you need and the data type for each. Example: 'Return as JSON array with keys: author (string), year (integer), finding (text max 200 chars), confidence (enum: high/medium/low).'
- 3Set Temperature (T) to 0.1β0.3 for extraction and summarization. Lower temperatures produce more deterministic, consistent outputs. Reserve higher temperatures only for brainstorming alternative interpretations of ambiguous source material.
- 4For large documents, break extraction into multiple passes with intermediate checkpoints. If you have 100-page PDFs, extract sections 1β25, then 26β50, etc., storing results in a structured format. This prevents context window overflow and makes errors easier to spot and correct.
- 5Cross-check key extractions with the source document. Always spot-check 10β20% of extracted data against the original. AI models can hallucinate or misread structured data, especially from tables with merged cells or unclear formatting.
Frequently Asked Questions
What is the difference between extractive and abstractive AI summarisation?
Extractive summarisation copies sentences directly from the source document without modification β factual errors are structurally impossible because no new text is generated. Abstractive summarisation uses LLMs to generate new paraphrased sentences that condense information β producing more readable output but with hallucination rates of 0.7β14% depending on the model and task. Use extractive for legal and compliance documents; use abstractive for executive summaries and research synthesis.
Which AI model hallucinates least when summarising documents?
On Vectara's HHEM benchmark β the standard faithfulness test for document summarisation across 831 documents β Gemini 3 Flash (Google DeepMind) achieved the lowest hallucination rate at 0.7% as of 2025. Four models now achieve sub-1% rates on grounded summarisation. These rates apply only to source-grounded tasks; open-domain factual recall produces rates of 3β33% across the same models.
How many pages can AI summarisation tools process at once?
This depends on the model's context window. GPT-4o (OpenAI) handles approximately 100 standard pages per session (128k token limit). Claude Sonnet 4.6 (Anthropic) handles approximately 160 pages (200k tokens). Gemini 3.1 Pro (Google DeepMind) handles approximately 800 pages (1M tokens). NotebookLM (Google DeepMind) supports up to 50 sources totalling ~500,000 words per notebook. For larger corpora, document chunking is required.
Is NotebookLM or Claude better for document summarisation?
They serve different needs. NotebookLM (Google DeepMind) provides stricter source grounding with clickable inline citations β it hallucinates about uploaded sources less frequently and is better at faithfully representing what documents say. Claude Sonnet 4.6 (Anthropic) produces more nuanced analysis, excels at synthesising across multiple documents, and identifies non-obvious connections β but occasionally blends source content with general training knowledge in ways that can be subtly misleading. Use NotebookLM for precision; use Claude for insight.
How do I prevent AI from hallucinating in my summaries?
Four techniques reduce hallucination in summarisation tasks: (1) instruct the model explicitly β "summarise only from the document below; do not add external knowledge"; (2) set Temperature (T) to 0.0β0.1 for maximum determinism; (3) use a faithfulness check β ask the model to list every claim in its summary and identify its source sentence; (4) cross-check with a second model β when GPT-4o and Claude Sonnet 4.6 agree on a specific fact, the probability of shared hallucination is statistically near-zero.
What is document chunking and when should I use it?
Chunking splits a document into segments (typically 500β2,000 tokens), summarises each segment separately, then synthesises the chunk summaries into a final output. Use it when your document exceeds the model context window β roughly 100 pages for GPT-4o (128k tokens), 160 pages for Claude Sonnet 4.6 (200k tokens), or 800 pages for Gemini 3.1 Pro (1M tokens). For structured documents (legal contracts, annual reports), thematic chunking by section headings produces the most coherent final synthesis. For unstructured text (email threads, transcripts), paragraph-based chunking at 500-token intervals is the recommended default.
What are ROUGE and BERTScore, and which metric should I use to evaluate AI summaries?
ROUGE measures n-gram overlap between a generated summary and a reference β useful for benchmarking but blind to semantic meaning and factual accuracy. BERTScore uses cosine similarity between BERT embeddings, capturing semantic similarity rather than exact word matches. For production document workflows, neither is sufficient alone: use faithfulness metrics such as HHEM (Vectara) or FaithJudge to measure whether the summary contains only claims supported by the source document. Combine HHEM faithfulness scoring with a completeness check for the most reliable quality signal.
Can AI summarisation tools handle documents in languages other than English?
Yes, with important caveats. Mistral AI models (France) handle French and European languages natively and can be deployed locally for GDPR compliance. Qwen 3 (Alibaba) tokenises Chinese characters at roughly 40% fewer tokens than GPT-4o β making large-scale Chinese document processing significantly cheaper. LLaMA 4 models deployed via Ollama support multilingual summarisation while keeping data fully on-premise, satisfying data residency requirements for Japanese enterprises under METI guidelines. English-first models (GPT-4o, Claude Sonnet 4.6) also handle multilingual documents but with slightly higher error rates on non-Latin scripts.
Sources & Further Reading
- Liu et al., 2025. "A hallucination detection and mitigation framework for text summarisation" β introduces Q-S-E methodology for iterative hallucination correction across CNN/DailyMail, PubMed, and ArXiv benchmarks
- Vectara HHEM Leaderboard, 2025. "Hughes Hallucination Evaluation Model β Document Summarisation Faithfulness Rankings" β tested 100+ LLMs across 831 documents; Gemini-2.0-Flash at 0.7% hallucination rate
- SEI/CMU, 2025. "Evaluating LLMs for Text Summarisation: An Introduction" β framework for accuracy, faithfulness, compression, and efficiency evaluation