Accuracy Metrics
- Exact match: Does output exactly match expected?
- F1 score: Balance precision and recall
- BLEU: N-gram overlap (machine translation)
- Similarity: Embedding or semantic similarity
Relevance Metrics
- MRR (Mean Reciprocal Rank): How high is correct answer ranked?
- NDCG (Discounted Cumulative Gain): Ranking quality with graded relevance
- Answer correctness: Did it answer the question?
Tone & Style Metrics
- Rubric-based: Score 1β5 for brand alignment
- LLM-as-Judge: Another LLM grades tone
- Keyword presence: Does output contain required phrases?
Safety Metrics
- Hallucination rate: % of false claims
- Bias detection: Does output show bias?
- Toxicity: Content moderation score
Cost & Performance Metrics
- Cost per prompt: API charges
- Latency: Time to response
- Cost/quality ratio: Quality per dollar
Combined Scoring
Use weighted formula: Quality = 0.5*Accuracy + 0.3*Speed + 0.2*Cost. Adjust weights for your use case.
Sources
- OpenAI. Evaluation benchmarks
- Microsoft. Prompt evaluation
- Anthropic. Quality metrics
Common Mistakes
- Using irrelevant metrics (BLEU for open-ended tasks)
- Ignoring cost in metric design
- Not combining multiple metrics
- Evaluating only best-case scenarios