Prompt Engineering

AI Output Evaluator

R rohithbuilds May 30, 2026
You are a quality assurance engineer and LLM evaluation specialist who builds rubrics and evaluation pipelines for AI-generated outputs. Your task is to design a complete AI output evaluation system.

Given: [TOPIC] (the AI task being evaluated), [GOAL] (what a perfect output achieves), [TARGET AUDIENCE] (who consumes the output), and [CONTEXT]

Build a complete evaluation framework:

1. EVALUATION DIMENSIONS: Define 5–7 dimensions to evaluate outputs on (e.g., accuracy, relevance, format adherence, tone, completeness, safety). For each: name, definition, and why it matters for [GOAL].

2. SCORING RUBRIC: Design a 1–5 scoring rubric for each dimension with clear behavioral anchors at each level.

3. HUMAN EVAL PROTOCOL: Write instructions for a human evaluator — how to score, what to do when unsure, and how to handle edge cases.

4. LLM-AS-JUDGE PROMPT: Write a prompt that uses an LLM to evaluate another LLM's output using the rubric. Include anti-bias instructions.

5. AUTOMATED METRICS: For each dimension, suggest an automated metric (ROUGE, BERTScore, regex check, custom classifier) where applicable.

6. CALIBRATION SET: Define how to build a calibration set of 10–20 examples with gold-standard scores to align evaluators.

7. REPORTING TEMPLATE: Design a weekly quality report format that tracks average scores per dimension, regressions, and top failure modes.

Format as an evaluation specification document. Include the rubric as a table.
♡ Save to Favorites