Points clés
- Always show schema + 2–3 examples before asking for extraction
- Add field constraints in prompt: "phone must be +XX-XXX-XXX-XXXX format"
- Use XML or JSON templates in the prompt; models fill slots better than free-form generation
- Mark optional fields explicitly: "If field unknown, use null, not empty string"
- Test on edge cases (typos, missing data, non-English input) before production
Rule 1: Declare Schema First
Define the exact output schema in the system prompt before asking the model to extract.
- Schema format: JSON schema, Pydantic model text, or XML tag definitions
- Placement: System prompt or first user message, before the extraction task
- Be explicit about types: string (max 100 chars), enum (one of: A, B, C), number (0–100), boolean, or null
- Example: {"name": "string (required)", "phone": "string (optional, E.164 format)", "age": "integer (18–120)"}
Rule 2: Provide 2–3 Examples
In-context examples reduce parsing errors by 50–70%.
- Format: "Example input | Expected output"
- Vary inputs: Normal case, edge case (missing field), invalid case (wrong format)
- Include type conversions: "John (age text) → 25 (integer)"
- Explicitly show how to handle missing data: "Phone unknown → null"
Rule 3: Add Field-Level Constraints
Every field needs explicit rules: format, length, allowed values, and null handling.
- Enum fields: "status must be one of: pending, active, completed, failed"
- Format fields: "email must be valid RFC 5322 format" or "phone must be +XX-XXX-XXX-XXXX"
- Length fields: "description max 500 characters" or "title exactly 50–80 characters"
- Null handling: "If unknown, return null, not empty string or N/A"
Rule 4: Use Template Injection
Pre-fill a template with placeholders; model fills slots instead of generating free-form.
- XML template: "<name>_____</name><phone>_____</phone>" → model fills blanks
- JSON template: `{"name": "____", "phone": "____"}` easier to parse than unstructured text
- Constraint in template: Show expected format: `<phone>+1-555-123-4567</phone>`
- Reduces hallucination: Model focuses on extraction, not format invention
Rule 5: Test Edge Cases in Prompt
Include edge cases in examples; prompt explicitly about non-English, typos, and missing data.
- Typos: "What if name is misspelled? Extract as given."
- Missing data: "If field not found, use null, not guessed values."
- Non-English: "If text is German/French, still extract to English schema."
- Conflicting fields: "If two names given, take the first; note discrepancy in error field."
Rule 6: Add Validation Instructions
Tell the model how to validate its own output before returning.
- Self-check: "Before responding, verify: (1) All required fields present, (2) No null in required fields, (3) Types match schema"
- Error field optional: "Add error field if extraction uncertain; include confidence 0–100"
- Retry pattern: "If validation fails, re-read input and retry"
- Fallback: "If unable to extract, return all nulls, not guesses"
Common Mistakes
- No schema in prompt—model invents fields or uses inconsistent types
- Examples too simple—only happy path; no edge cases or missing data handling
- Ambiguous field names—"contact" could be name, email, or phone; use exact names
- Null/empty string confusion—models default to empty string; must explicitly say "use null"
- Type coercion ignored—"2025-04-05" vs "April 5, 2025"; must show canonical format
Sources
- OpenAI structured outputs guide, April 2026
- Anthropic prompt best practices for JSON extraction
- Pydantic documentation: Field constraints and validation