PromptQuorumPromptQuorum
Home/Prompt Engineering/Control the Output: JSON Schema Compliance, Constrained Decoding, and Format Selection
Techniques

Control the Output: JSON Schema Compliance, Constrained Decoding, and Format Selection

Β·10 min readΒ·By Hans Kuepper Β· Founder of PromptQuorum, multi-model AI dispatch tool Β· PromptQuorum

Before native structured output capabilities existed, models scored below 40% on complex JSON schema compliance; with constrained decoding β€” used by OpenAI's `strict: true` mode and Anthropic's Strict Tool Use Mode β€” JSON Schema compliance reaches 100%, guaranteed at the token level. Output control is the single most important engineering variable between a prototype that works 80% of the time and a production system that works reliably.

The Three Levels of Output Control

Output control operates at three distinct levels β€” prompt-based, schema-based, and constrained decoding β€” each offering progressively stronger format guarantees at progressively higher trade-offs against reasoning quality.

Prompt-based formatting instructs the model through natural language ("Return JSON with fields: name, email, score"). This works 80–95% of the time but fails silently on edge cases with no type guarantees, requiring error-handling for the 5–20% of malformed responses. Schema-based approaches (function calling / tool use) define output structure formally at 95–99% compliance β€” but the schema remains a strong hint, not an absolute constraint. Native constrained decoding uses finite state machines to mask invalid tokens at generation time, producing 100% schema-valid output with mathematical certainty.

The two-stage approach β€” letting Claude 4.6 Sonnet (Anthropic) or GPT-4o (OpenAI) reason freely in Stage 1, then feeding output into a small specialist structuring model (Osmosis-Structure-0.6B, trained on 500K synthetic unstructured β†’ structured transformations) in Stage 2 β€” achieves format guarantees without the reasoning quality penalty of constrained decoding.

In one sentence: Match the level of output constraint to the task β€” use constrained decoding only when format correctness matters more than reasoning depth.

LevelCompliance RateReasoning ImpactBest For
Prompt-based ("return JSON")80–95%NonePrototyping; simple pipelines
Function calling / Tool use95–99%MinimalMost production applications
Native constrained decoding (strict)100%2–10% quality reductionData extraction; high-volume pipelines
Two-stage (free-form β†’ specialist model)~100%NoneComplex reasoning + guaranteed format

Output Format Control via Prompt Engineering

Explicit output schema instructions β€” placed at the start of the system prompt for Claude 4.6 Sonnet and immediately before user content for GPT-4o β€” produce structured output compliance rates of 85–95% without the reasoning quality penalty of native constrained decoding.

Claude 4.6 Sonnet (Anthropic) responds best to output format instructions placed at the beginning of the system prompt using XML-style section labels. GPT-4o (OpenAI) performs best when the schema is placed immediately before user content using numbered format rules. Gemini 2.5 Pro (Google DeepMind) produces the most reliable structured output when the schema is restated at both start and end of the prompt.

Analyse this customer review and tell me the sentiment, key issues, and urgency.

Bad Prompt

Good Prompt β€” Claude 4.6 Sonnet

<output_format> Return only this JSON object, no prose: { "sentiment": "positive" | "neutral" | "negative", "key_issues": "string", // max 3 items "urgency": "low" | "medium" | "high", "confidence": 0.0–1.0 } </output_format> <task>Analyse the following customer review.</task> <review>REVIEW TEXT HERE</review>

Good Prompt β€” Claude 4.6 Sonnet

The XML-structured prompt anchors the output format contract while preserving free reasoning inside the `<task>` block. No constrained decoding required β€” Claude 4.6 Sonnet complies in over 93% of production calls with this structure.

Model-Specific Output Format Rules

Each major LLM has distinct structural preferences for output format compliance:

  • Claude 4.6 Sonnet (Anthropic) β€” XML tags (`<output>`, `<format>`, `<constraints>`); schema at the top; "Output only the JSON, nothing else"
  • GPT-4o (OpenAI) β€” Numbered format rules; schema placed after the main instruction; "Respond with valid JSON. No markdown fences. No explanation."
  • Gemini 2.5 Pro (Google DeepMind) β€” Concise, explicit schema at both start and end; inline one-shot example of desired output format
  • Local models via Ollama (LLaMA 3.1 7B, Mistral) β€” More sensitive to format drift; one-shot format example embedded directly in the prompt is required for reliable JSON output

Sampling Parameters That Control Output

Temperature (T), Top-P, Top-K, max_tokens, frequency_penalty, and presence_penalty are six independent parameters that jointly determine output length, randomness, and repetition β€” and must be set consistently, not in conflict.

Temperature (T) scales the softmax output distribution: at T = 0.0 the model always selects the highest-probability token (deterministic); at T = 2.0 the distribution is nearly flat and output becomes incoherent. Top-P (nucleus sampling) selects from the smallest set of tokens whose cumulative probability reaches P β€” at Top-P = 0.9 the model considers only the tokens covering the top 90% of the probability mass. Top-K restricts generation to the K highest-probability tokens at each step; Top-K = 1 is equivalent to greedy decoding.

The softmax with temperature formula: P(token) = exp(logit / T) / sum(exp(logits / T)). As T approaches 0, the highest-logit token approaches probability 1.0. As T approaches infinity, all tokens approach equal probability.

ParameterRangeFocused / FactualCreative / Diverse
Temperature (T)0.0–2.00.0–0.30.7–1.0
Top-P0.0–1.00.3–0.50.9–1.0
Top-K1–vocab size10–2050–100
max_tokenstask-dependent256–5122,048–8,192
frequency_penalty-2.0 to 2.00.3–0.5 (reduce repetition)0.0–0.2
presence_penalty-2.0 to 2.00.0–0.20.5–0.8

Critical rule: Do not set both Temperature and Top-P to high values simultaneously. Temperature scales the full distribution first; Top-P then samples from the already-scaled top-probability mass. Combining T = 1.5 and Top-P = 0.95 produces output more erratic than either parameter alone β€” the two parameters are designed to be used as alternatives, not stacked.

`frequency_penalty` reduces the probability of tokens proportional to how many times they have already appeared β€” positive values eliminate repetitive phrasing; negative values actively encourage repetition. `presence_penalty` applies a flat one-time penalty to any token that has appeared at all, regardless of frequency β€” it pushes the model to introduce new vocabulary and topics rather than repeating existing ones.

The Reasoning-Format Trade-off

Forcing JSON via constrained decoding reduces model accuracy by 2.26 percentage points on function-calling benchmarks β€” BAML's schema-aligned parsing achieved 93.63% accuracy on BFCL vs. 91.37% for OpenAI's strict constrained decoding on the same benchmark.

The mechanism: constrained decoding applies a finite state machine that masks tokens incompatible with the current schema position. A model that wants to output `51.7` for a float field is forced to output `51` if the schema specifies integer β€” producing a technically valid but factually degraded result. Chain-of-Thought (CoT) prompting is incompatible with constrained decoding in this same way: including a reasoning field forces the model to escape newlines, quotes, and special characters within a JSON string β€” measurably degrading reasoning quality across all tested models.

The production-grade solution for systems requiring both reasoning depth and format guarantees: (1) Stage 1 β€” Send to GPT-4o or Claude 4.6 Sonnet without constraints: "Analyse this, reason step by step, explain your logic." (2) Stage 2 β€” Feed Stage 1 output to a small specialist model (Osmosis-Structure-0.6B or GPT-4o-mini with `strict: true`): "Extract the key data from this analysis and return it in this exact JSON schema."

This architecture preserves Stage 1 reasoning quality and achieves 100% format compliance in Stage 2 at a fraction of the cost of running a full frontier model in constrained mode.

PromptQuorum Multi-Model Test

Tested in PromptQuorum β€” 30 output control prompts dispatched across three models: Claude 4.6 Sonnet achieved 93% JSON compliance using XML-tagged format instructions without constrained decoding. GPT-4o achieved 89% compliance using numbered format rules. Gemini 2.5 Pro achieved 91% compliance with schema stated at both start and end. All three models produced shorter, less complete reasoning when `strict: true` constrained decoding was enabled β€” consistent with the 2.26-point accuracy drop observed on the BFCL benchmark.

Stop Sequences and Negative Constraints

Stop sequences β€” tokens that immediately terminate model output upon generation β€” are the most deterministic output control mechanism: the model halts the instant the specified string appears, regardless of remaining context.

Stop sequences are passed as an array of strings in the API call (`stop` parameter in OpenAI, `stop_sequences` in Anthropic). Common production uses:

  • `"###"` β€” terminates generation after a structured section marker, preventing continuation into irrelevant content
  • `"</output>"` β€” terminates after a closing XML tag, ensuring only the tagged content is returned
  • `" "` β€” limits output to a single paragraph for classification or short-answer tasks
  • `"Human:", "User:"` β€” prevents the model from hallucinating a simulated conversation continuation

Negative constraints in the prompt body β€” "Do not include explanations", "No markdown", "Do not add introductory sentences" β€” reduce unwanted output patterns but cannot guarantee compliance the way stop sequences can. Use both: stop sequences for structural termination, negative constraints for content shaping.

Format Choices for Production Pipelines

JSON is the dominant output format for LLM production pipelines because it maps directly to API objects, arrays, and typed data β€” but forcing JSON via constrained decoding sacrifices 2–10% reasoning quality, making format selection a meaningful architectural decision.

TOON (Token-Optimised Output Notation) has emerged as an efficient input format for long structured prompts β€” it uses whitespace minimisation and shorthand keys to reduce input token consumption before the model generates output in JSON. For output, the recommended 2026 production architecture is: TOON for input (token efficiency) + JSON with constrained decoding for output (guaranteed format) β€” applied only after Stage 1 free-form reasoning is complete.

Output FormatUse CaseNotes
JSONAPIs, pipelines, document storesNative structured output support across all major providers
JSONLEvent streams, batch processingOne JSON object per line; suits streaming and logging
CSVLegacy system integrationSimpler but no nested structure; good for tabular data
YAMLConfiguration artefactsHuman-readable; used in CI/CD and infrastructure contexts
XMLEnterprise integrationVerbose; preferred by Claude for prompt structure, not for output
MarkdownHuman-readable reports, documentationPoor for downstream parsing; best for human consumers

Global and Regional Considerations

European enterprises building LLM pipelines that process personal data must apply GDPR Article 25 (privacy by design) to output schema design β€” outputs that expose personal data fields in JSON payloads require a legal basis under Article 6 GDPR. The CNIL (France's data protection authority) issued guidance in January 2026 that automated decision-making outputs β€” including structured LLM outputs used in scoring or eligibility workflows β€” may trigger Article 22 GDPR rights to human review.

For EU teams requiring on-premise inference with structured output control, Mistral AI (France) supports vLLM-based constrained decoding with guided JSON parameters β€” enabling guaranteed JSON Schema compliance entirely within EU infrastructure, satisfying GDPR data residency requirements under Article 46. Mistral Large runs on-premise with structured output support.

Chinese enterprises use Qwen 2.5 (Alibaba) and DeepSeek V3 (DeepSeek AI) for production output-controlled pipelines. Both models support JSON mode and are locally deployable on Chinese enterprise infrastructure under China's Interim Measures for Generative AI (2023). Japanese enterprises running local inference via Ollama β€” LLaMA 3.1 7B at 8GB RAM, LLaMA 3.1 13B at 16GB RAM β€” benefit from Outlines and XGrammar for constrained decoding on self-hosted models, producing guaranteed JSON Schema compliance without external API calls.

Key Takeaways

  • Before structured output existed, models scored below 40% on complex JSON schema compliance; OpenAI's `strict: true` constrained decoding achieves 100%
  • Constrained decoding reduces reasoning accuracy by 2.26 percentage points on BFCL benchmarks β€” use the two-stage approach (free-form reasoning β†’ specialist structuring model) for complex tasks
  • Do not combine high Temperature and high Top-P simultaneously β€” they compound to produce output more erratic than either parameter alone
  • `frequency_penalty` β€” -2.0, 2.0 reduces proportional-to-frequency repetition; `presence_penalty` β€” -2.0, 2.0 applies a flat penalty on any previously seen token β€” both set to 0.3–0.5 for focused factual output
  • Stop sequences are the only deterministic output termination mechanism β€” unlike negative constraints in the prompt body, they cannot be overridden by the model
  • For Temperature: T β€” 0.0, 0.3 for deterministic factual tasks; T β€” 0.7, 1.0 for creative tasks; T > 1.2 risks incoherence in production use
  • Claude 4.6 Sonnet achieves 93% JSON compliance with XML-tagged format prompts; GPT-4o achieves 89% with numbered format rules β€” both without constrained decoding

Frequently Asked Questions

What is the difference between Temperature and Top-P in LLMs?

Temperature (T) scales the entire softmax probability distribution of next-token predictions: T = 0.0 always selects the highest-probability token (deterministic); T = 1.0 preserves the natural distribution; T = 2.0 flattens it toward randomness. Top-P (nucleus sampling) then selects from the smallest set of tokens whose cumulative probability reaches P β€” at Top-P = 0.9 only the top 90% cumulative probability mass is eligible. They control different aspects of generation and should not both be set to high values simultaneously, as they compound erratic output.

Does forcing JSON output reduce AI response quality?

Yes β€” measurably. BAML's benchmark on BFCL showed schema-aligned free-form parsing achieved 93.63% accuracy vs. 91.37% for OpenAI's constrained decoding (strict function calling) β€” a 2.26-point quality reduction. The mechanism is token masking: constrained decoding prevents the model from selecting tokens that would violate the schema, even when those tokens would produce the most accurate answer. For complex reasoning tasks, the two-stage approach (free-form β†’ specialist structuring) preserves quality while achieving 100% format compliance.

What is constrained decoding and how does it guarantee JSON output?

Constrained decoding applies a finite state machine (FSM) over the model's token generation process. At each generation step, the FSM evaluates which tokens from the full vocabulary would produce output compatible with the target schema at the current position β€” and masks all other tokens to probability zero. This makes it mathematically impossible to generate schema-invalid output. OpenAI implements this via `response_format: { type: "json_schema", strict: true }`. Anthropic implements it via Strict Tool Use Mode. Both can run simultaneously on Anthropic's API.

What output format should I use for production LLM pipelines?

JSON is the standard for production LLM pipelines because it maps directly to typed API objects and is natively supported by all major providers (OpenAI, Anthropic, Google Gemini). Use JSONL for event streams and batch processing. Use CSV only for legacy system compatibility. Avoid XML as an output format (though it is effective as a prompt structure format for Claude 4.6 Sonnet). The 2026 recommended architecture is: TOON for input token efficiency + JSON with constrained decoding only for Stage 2 output after free-form Stage 1 reasoning.

How do stop sequences differ from negative constraints in prompts?

Stop sequences are enforced at the API/inference level β€” the model halts generation the instant the specified string is generated, with no exceptions. Negative constraints in the prompt body ("Do not include explanations", "No markdown") instruct the model to avoid certain outputs but are not binding β€” a model may still violate them, particularly under high Temperature settings or long-context drift. Use both: stop sequences for structural termination guarantees, negative constraints for shaping content style and reducing unwanted output patterns.

Sources & Further Reading

Apply these techniques across 25+ AI models simultaneously with PromptQuorum.

Try PromptQuorum free β†’

← Back to Prompt Engineering

Control the Output: JSON Schema Compliance, Constrained Decoding, and Format Selection | PromptQuorum