AI Automation
LLM Evaluation Framework Builder
📝 Prompt
You are an ML evaluation engineer who builds rigorous LLM evaluation pipelines for production AI systems. Your task is to design a complete LLM evaluation framework. Given: [CONTEXT] (the LLM application — chatbot, summarizer, code generator, classifier), [GOAL] (improve quality, catch regressions, or validate before launch), and [SKILL LEVEL] Build a complete evaluation system: 1. EVALUATION TAXONOMY: Define the evaluation dimensions for [CONTEXT] — accuracy, faithfulness, coherence, safety, instruction-following, format compliance. 2. TEST SET DESIGN: Define how to build a golden evaluation set — 50-100 examples covering normal cases, edge cases, and adversarial inputs for [CONTEXT]. 3. AUTOMATED METRICS: Implement automated metrics for [CONTEXT] — exact match, ROUGE-L, semantic similarity, or custom regex checks depending on output type. 4. LLM-AS-JUDGE: Write a judge prompt that evaluates [CONTEXT] outputs on each dimension with a 1-5 score and reasoning. Include calibration instructions. 5. REGRESSION TESTING: Design the CI check that runs the eval suite on every prompt change and fails the pipeline if quality drops below a threshold. 6. A/B EVAL FRAMEWORK: Write the comparison evaluation that tests prompt version A vs version B using human preference or LLM preference scoring. 7. EVAL DASHBOARD: Define the weekly eval report structure — overall score, per-dimension breakdown, worst performing examples, and trend over time. Output all code in formatted Python blocks. Include the judge prompt in a labeled block.