Skill

SkillsAI & Agent Engineering › Evaluation & quality

advanced-evaluation

This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment.

Freerisk: low
advancedevaluationarxiv

The full skill

— name: advanced-evaluation description: This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. — # Advanced Evaluation This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems. **Key insight**: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops. ## When to Activate Activate this skill when: – Building automated evaluation pipelines for LLM outputs – Comparing multiple model responses to select the best one – Establishing consistent quality standards across evaluation teams – Debugging evaluation systems that show inconsistent results – Designing A/B tests for prompt or model changes – Creating rubrics for human or automated evaluation – Analyzing correlation between automated and human judgments ## Core Concepts ### The Evaluation Taxonomy Evaluation approaches fall into two primary categories with distinct reliability profiles: **Direct Scoring**: A single LLM rates one response on a defined scale. – Best for: Objective criteria (factual accuracy, instruction following, toxicity) – Reliability: Moderate to high for well-defined criteria – Failure mode: Score calibration drift, inconsistent scale interpretation **Pairwise Comparison**: An LLM compares two responses and selects the better one. – Best for: Subjective preferences (tone, style, persuasiveness) – Reliability: Higher than direct scoring for preferences – Failure mode: Position bias, length bias Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth. ### The Bias Landscape LLM judges exhibit systematic biases that must be actively mitigated: **Position Bias**: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check. **Length Bias**: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring. **Self-Enhancement Bias**: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation. **Verbosity Bias**: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail. **Authority Bias**: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer. ### Metric Selection Framework Choose metrics based on the evaluation task structure: | Task Type | Primary Metrics | Secondary Metrics | |———–|—————–|——————-| | Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ | | Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) | | Pairwise preference | Agreement rate, Position consistency | Confidence calibration | | Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall | The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise. ## Evaluation Approaches ### Direct Scoring Implementation Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format. **Criteria Definition Pattern**: “` Criterion: [Name] Description: [What this criterion measures] Weight: [Relative importance, 0-1] “` **Scale Calibration**: – 1-3 scales: Binary with neutral option, lowest cognitive load – 1-5 scales: Standard Likert, good balance of granularity and reliability – 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics **Prompt Structure for Direct Scoring**: “` You are an expert evaluator assessing response quality. ## Task Evaluate the following response against each criterion. ## Original Prompt {prompt} ## Response to Evaluate {response} ## Criteria {for each criterion: name, description, weight} ## Instructions For each criterion: 1. Find specific evidence in the response 2. Score according to the rubric (1-{max} scale) 3. Justify your score with evidence 4. Suggest one specific improvement ## Output Format Respond with structured JSON containing scores, justifications, and summary. “` **Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches. ### Pairwise Comparison Implementation Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation. **Position Bias Mitigation Protocol**: 1. First pass: Response A in first position, Response B in second 2. Second pass: Response B in first position, Response A in second 3. Consistency check: If passes disagree, return TIE with reduced confidence 4. Final verdict: Consistent winner with averaged confidence **Prompt Structure for Pairwise Comparison**: “` You are an expert evaluator comparing two AI responses. ## Critical Instructions – Do NOT prefer responses because they are longer – Do NOT prefer responses based on position (first vs second) – Focus ONLY on quality according to the specified criteria – Ties are acceptable when responses are genuinely equivalent ## Original Prompt {prompt} ## Response A {response_a} ## Response B {response_b} ## Comparison Criteria {criteria list} ## Instructions 1. Analyze each response independently first 2. Compare them on each criterion 3. Determine overall winner with confidence level ## Output Format JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning. “` **Confidence Calibration**: Confidence scores should reflect position consistency: – Both passes agree: confidence = average of individual confidences – Passes disagree: confidence = 0.5, verdict = TIE ### Rubric Generation Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring. **Rubric Components**: 1. **Level descriptions**: Clear boundaries for each score level 2. **Characteristics**: Observable features that define each level 3. **Examples**: Representative text for each level (optional but valuable) 4. **Edge cases**: Guidance for ambiguous situations 5. **Scoring guidelines**: General principles for consistent application **Strictness Calibration**: – **Lenient**: Lower bar for passing scores, appropriate for encouraging iteration – **Balanced**: Fair, typical expectations for production use – **Strict**: High standards, appropriate for safety-critical or high-stakes evaluation **Domain Adaptation**: Rubrics should use domain-specific terminology. A "code readability" rubric mentions variables, functions, and comments. A "medical accuracy" rubric references clinical terminology and evidence standards. ## Practical Guidance ### Evaluation Pipeline Design Production evaluation systems require multiple layers: “` ┌─────────────────────────────────────────────────┐ │ Evaluation Pipeline │ ├─────────────────────────────────────────────────┤ │ │ │ Input: Response + Prompt + Context │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Criteria Loader │ ◄── Rubrics, weights │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Primary Scorer │ ◄── Direct or Pairwise │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Bias Mitigation │ ◄── Position swap, etc. │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ Confidence Scoring │ ◄── Calibration │ │ └──────────┬──────────┘ │ │ │ │ │ ▼ │ │ Output: Scores + Justifications + Confidence │ │ │ └─────────────────────────────────────────────────┘ “` ### Common Anti-Patterns **Anti-pattern: Scoring without justification** – Problem: Scores lack grounding, difficult to debug or improve – Solution: Always require evidence-based justification before score **Anti-pattern: Single-pass pairwise comparison** – Problem: Position bias corrupts results – Solution: Always swap positions and check consistency **Anti-pattern: Overloaded criteria** – Problem: Criteria measuring multiple things are unreliable – Solution: One criterion = one measurable aspect **Anti-pattern: Missing edge case guidance** – Problem: Evaluators handle ambiguous cases inconsistently – Solution: Include edge cases in rubrics with explicit guidance **Anti-pattern: Ignoring confidence calibration** – Problem: High-confidence wrong judgments are worse than low-confidence – Solution: Calibrate confidence to position consistency and evidence strength ### Decision Framework: Direct vs. Pairwise Use this decision tree: “` Is there an objective ground truth? ├── Yes → Direct Scoring │ └── Examples: factual accuracy, instruction following, format compliance │ └── No → Is it a preference or quality judgment? ├── Yes → Pairwise Comparison │ └── Examples: tone, style, persuasiveness, creativity │ └── No → Consider reference-based evaluation └── Examples: summarization (compare to source), translation (compare to reference) “` ### Scaling Evaluation For high-volume evaluation: 1. **Panel of LLMs (PoLL)**: Use multiple models as judges, aggregate votes – Reduces individual model bias – More expensive but more reliable for high-stakes decisions 2. **Hierarchical evaluation**: Fast cheap model for screening, expensive model for edge cases – Cost-effective for large volumes – Requires calibration of screening threshold 3. **Human-in-the-loop**: Automated evaluation for clear cases, human review for low-confidence – Best reliability for critical applications – Design feedback loop to improve automated evaluation ## Examples ### Example 1: Direct Scoring for Accuracy **Input**: “` Prompt: "What causes seasons on Earth?" Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, different hemispheres receive more direct sunlight at different times of year." Criterion: Factual Accuracy (weight: 1.0) Scale: 1-5 “` **Output**: “`json { "criterion": "Factual Accuracy", "score": 5, "evidence": [ "Correctly identifies axial tilt as primary cause", "Correctly explains differential sunlight by hemisphere", "No factual errors present" ], "justification": "Response accurately explains the cause of seasons with correct scientific reasoning. Both the axial tilt and its effect on sunlight distribution are correctly described.", "improvement": "Could add the specific tilt angle (23.5°) for completeness." } “` ### Example 2: Pairwise Comparison with Position Swap **Input**: “` Prompt: "Explain machine learning to a beginner" Response A: [Technical explanation with jargon] Response B: [Simple analogy-based explanation] Criteria: ["clarity", "accessibility"] “` **First Pass (A first)**: “`json { "winner": "B", "confidence": 0.8 } “` **Second Pass (B first)**: “`json { "winner": "A", "confidence": 0.6 } “` (Note: Winner is A because B was in first position) **Mapped Second Pass**: “`json { "winner": "B", "confidence": 0.6 } “` **Final Result**: “`json { "winner": "B", "confidence": 0.7, "positionConsistency": { "consistent": true, "firstPassWinner": "B", "secondPassWinner": "B" } } “` ### Example 3: Rubric Generation **Input**: “` criterionName: "Code Readability" criterionDescription: "How easy the code is to understand and maintain" domain: "software engineering" scale: "1-5" strictness: "balanced" “` **Output** (abbreviated): “`json { "levels": [ { "score": 1, "label": "Poor", "description": "Code is difficult to understand without significant effort", "characteristics": [ "No meaningful variable or function names", "No comments or documentation", "Deeply nested or convoluted logic" ] }, { "score": 3, "label": "Adequate", "description": "Code is understandable with some effort", "characteristics": [ "Most variables have meaningful names", "Basic comments present for complex sections", "Logic is followable but could be cleaner" ] }, { "score": 5, "label": "Excellent", "description": "Code is immediately clear and maintainable", "characteristics": [ "All names are descriptive and consistent", "Comprehensive documentation", "Clean, modular structure" ] } ], "edgeCases": [ { "situation": "Code is well-structured but uses domain-specific abbreviations", "guidance": "Score based on readability for domain experts, not general audience" } ] } “` ## Guidelines 1. **Always require justification before scores** – Chain-of-thought prompting improves reliability by 15-25% 2. **Always swap positions in pairwise comparison** – Single-pass comparison is corrupted by position bias 3. **Match scale granularity to rubric specificity** – Don't use 1-10 without detailed level descriptions 4. **Separate objective and subjective criteria** – Use direct scoring for objective, pairwise for subjective 5. **Include confidence scores** – Calibrate to position consistency and evidence strength 6. **Define edge cases explicitly** – Ambiguous situations cause the most evaluation variance 7. **Use domain-specific rubrics** – Generic rubrics produce generic (less useful) evaluations 8. **Validate against human judgments** – Automated evaluation is only valuable if it correlates with human assessment 9. **Monitor for systematic bias** – Track disagreement patterns by criterion, response type, model 10. **Design for iteration** – Evaluation systems improve with feedback loops ## Integration This skill integrates with: – **context-fundamentals** – Evaluation prompts require effective context structure – **tool-design** – Evaluation tools need proper schemas and error handling – **context-optimization** – Evaluation prompts can be optimized for token efficiency – **evaluation** (foundational) – This skill extends the foundational evaluation concepts ## References Internal reference: – [LLM-as-Judge Implementation Patterns](./references/implementation-patterns.md) – [Bias Mitigation Techniques](./references/bias-mitigation.md) – [Metric Selection Guide](./references/metrics-guide.md) External research: – [Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators](https://eugeneyan.com/writing/llm-evaluators/) – [Judging LLM-as-a-Judge (Zheng et al., 2023)](https://arxiv.org/abs/2306.05685) – [G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023)](https://arxiv.org/abs/2303.16634) – [Large Language Models are not Fair Evaluators (Wang et al., 2023)](https://arxiv.org/abs/2305.17926) Related skills in this collection: – evaluation – Foundational evaluation concepts – context-fundamentals – Context structure for evaluation prompts – tool-design – Building evaluation tools — ## Skill Metadata **Created**: 2024-12-24 **Last Updated**: 2024-12-24 **Author**: Muratcan Koylan **Version**: 1.0.0