What Does "Quality" Mean?
Accuracy (factual correctness)
Relevance (answers the question)
Tone (matches brand voice)
Structure (proper format)
Safety (no harmful outputs)
Latency (speed acceptable)
Cost (within budget)
Define Success Criteria First
Before evaluating, answer: What does success look like? Accuracy > 90%? Tone = professional? Format = JSON?
Collect Test Cases
- 1Gather 20—50 representative inputs
- 2For each, document expected output
- 3Categorize: happy paths, edge cases, stress tests
- 4Mark criticality (must-pass vs nice-to-have)
Scoring Methods
- Exact match: Output matches expected exactly
- Rubric: Human grades on scale (1-5)
- Metric: BLEU, F1, similarity score
- LLM-as-Judge: Another LLM grades output
Run Evaluation
Feed test cases through prompt. Score each. Calculate pass rate and average quality.
Iterate Based on Results
Analyze failures. Adjust prompt. Re-evaluate. Track improvements over versions.
Sources
- OpenAI. Evaluation strategies
- Anthropic. Quality assessment
- LangChain. Evaluation frameworks
Common Mistakes
- Evaluating on too few examples
- Not defining criteria upfront
- Mixing metrics (accuracy + speed)
- Evaluating manually (inconsistent scoring)
- Not tracking historical performance