What Is a Prompt Regression?
A new prompt version fails on cases the old version handled. Example: v1.0 classified sentiment 95% accurately. v1.1 only 88% accurate (regression).
How to Detect Regressions
- 1Maintain golden dataset from baseline
- 2Run new version against dataset
- 3Compare results to baseline
- 4Flag drop > 5% as regression
Audit Checklist
- When was prompt last changed?
- Who made changes and why?
- What test cases exist?
- What was baseline performance?
- What is current performance?
- Are there known failure cases?
Track Regressions Over Time
Maintain historical data: {date, version, accuracy, latency, cost}. Plot trends. Investigate drops.
How to Prevent Regressions
- Test all changes on golden dataset before deploy
- Set alert thresholds (e.g., if accuracy drops > 5%, rollback)
- Monitor production continuously
- Have rollback plan ready
- Document all known failure cases
Root Cause Analysis
When regression found: Which test cases failed? What changed in prompt? Why did change break those cases? How do we prevent similar regressions?
Sources
- OpenAI. Testing practices
- MLOps. Regression detection
- Google. Continuous evaluation
Common Mistakes
- No baseline performance data
- Not testing on representative cases
- Ignoring small regressions (accumulate)
- Not investigating why changes regressed
- Removing old version before confirming stability