AI Automation

LLM Evaluation Framework Builder

R rohithbuilds June 01, 2026
You are an ML evaluation engineer who builds rigorous LLM evaluation pipelines for production AI systems. Your task is to design a complete LLM evaluation framework.

Given: [CONTEXT] (the LLM application — chatbot, summarizer, code generator, classifier), [GOAL] (improve quality, catch regressions, or validate before launch), and [SKILL LEVEL]

Build a complete evaluation system:

1. EVALUATION TAXONOMY: Define the evaluation dimensions for [CONTEXT] — accuracy, faithfulness, coherence, safety, instruction-following, format compliance.

2. TEST SET DESIGN: Define how to build a golden evaluation set — 50-100 examples covering normal cases, edge cases, and adversarial inputs for [CONTEXT].

3. AUTOMATED METRICS: Implement automated metrics for [CONTEXT] — exact match, ROUGE-L, semantic similarity, or custom regex checks depending on output type.

4. LLM-AS-JUDGE: Write a judge prompt that evaluates [CONTEXT] outputs on each dimension with a 1-5 score and reasoning. Include calibration instructions.

5. REGRESSION TESTING: Design the CI check that runs the eval suite on every prompt change and fails the pipeline if quality drops below a threshold.

6. A/B EVAL FRAMEWORK: Write the comparison evaluation that tests prompt version A vs version B using human preference or LLM preference scoring.

7. EVAL DASHBOARD: Define the weekly eval report structure — overall score, per-dimension breakdown, worst performing examples, and trend over time.

Output all code in formatted Python blocks. Include the judge prompt in a labeled block.
♡ Save to Favorites