What Is Prompt Testing?
Running predefined inputs through prompts, checking if output meets quality criteria. Unlike software (pass/fail), prompt testing measures quality on a scale.
Manual vs. Automated Testing
Manual is slow: 10 prompts × 20 cases = 200 evals. Automated uses scoring (exact match, regex, LLM-as-Judge, similarity).
Best Tools for Developers
- Promptfoo: YAML, git-friendly, open-source
- LangSmith: LangChain integration, observability
- GitHub Actions: Maximum control, requires setup
Best for Non-Technical Teams
- Braintrust: UI-based test creation
- PromptQuorum: Browser testing, comparison
- Munch: Simple test management
Common Testing Scenarios
- Regression: Updated prompt still handles old tests
- Edge cases: Unusual inputs
- Cross-model: Same prompt on GPT, Claude, Gemini
- Performance: Cost and latency comparison
Sources
- Promptfoo documentation
- LangSmith evaluation guide
- OpenAI Evals repository
Common Mistakes
- Not testing edge cases (happy paths only)
- Test set too similar to training
- Grading subjectively, not criteria-based
- Testing success, not failure scenarios