Building Custom LLM Judge Rubrics
Create domain-specific evaluation criteria and train judge models for your use case.
What is an LLM Judge?
An LLM judge is a powerful language model that evaluates outputs from your target AI system. Instead of relying on expensive human reviewers or brittle regex patterns, judges use sophisticated reasoning to assess quality across multiple dimensions.
When to Use LLM Judges
LLM judges excel at evaluating:
- Open-ended generation: Essays, creative writing, summaries
- Conversational AI: Helpfulness, tone, empathy in chat responses
- Reasoning tasks: Logical coherence, factual accuracy
- Multi-dimensional quality: Scoring outputs on 5+ criteria simultaneously
Anatomy of a Good Rubric
A rubric defines what the judge should evaluate and how. Strong rubrics include:
1. Clear Evaluation Dimensions
Break down "quality" into specific, measurable criteria:
- Accuracy: Is the information correct?
- Relevance: Does it address the user's question?
- Completeness: Are all aspects covered?
- Clarity: Is it easy to understand?
- Safety: Does it avoid harmful content?
2. Scoring Scale
Define what each score means:
- 1: Completely fails the criterion
- 2: Partially meets the criterion with major issues
- 3: Meets the criterion with minor issues
- 4: Fully meets the criterion
- 5: Exceeds expectations
3. Concrete Examples
Show the judge what good and bad look like:
Example for "Clarity" (Score: 5/5)
"To reset your password, click 'Forgot Password' on the login page, enter your email, and follow the link we send you."
Example for "Clarity" (Score: 2/5)
"There's a thing you can do with the account settings or maybe the profile area to change your credentials."
Creating Your First Rubric
Navigate to the LLM Judge page and click "Create New Rubric".
Step 1: Define Your Task
Goal: Ensure responses are helpful, accurate, and empathetic
Step 2: Choose Evaluation Dimensions
For a customer support bot, you might evaluate:
- Helpfulness (1-5)
- Accuracy (1-5)
- Empathy (1-5)
- Conciseness (1-5)
Step 3: Write Detailed Instructions
Example Judge Prompt:
"You are an expert evaluator of customer support chatbot responses. Evaluate the following response on four dimensions: Helpfulness, Accuracy, Empathy, and Conciseness. For each dimension, provide a score from 1-5 and a brief justification. Consider that excellent support responses should directly address the customer's issue, provide accurate information, show understanding of the customer's frustration, and avoid unnecessary verbosity."
Step 4: Add Few-Shot Examples
Include 3-5 example evaluations showing your judge how to score. This dramatically improves consistency and alignment with your standards.
Training and Validating Judges
Before deploying a judge at scale:
1. Collect Human Ground Truth
Have human annotators evaluate 50-100 test cases. These ratings become your "gold standard" for measuring judge performance.
2. Measure Judge Alignment
Run your judge on the same test cases and calculate:
- Correlation: How closely do judge scores track human scores?
- Agreement rate: What % of cases do judge and humans agree on?
- Bias: Does the judge systematically over- or under-score?
Our platform automatically computes these metrics on the Alignment Dashboard.
3. Iterate on the Rubric
If alignment is low (<70% agreement), refine your rubric:
- Add more specific criteria definitions
- Include edge case examples
- Simplify scoring scales (3-point vs. 5-point)
- Use a more powerful judge model (GPT-4 over GPT-3.5)
Best Practices
- Use powerful models: GPT-4, Claude 3.5, or Gemini Pro make better judges
- Keep rubrics focused: 3-5 dimensions is ideal; too munknown dilutes quality
- Provide context: Give judges access to user intent, conversation history
- Chain-of-thought reasoning: Ask judges to explain their scores
- Calibrate regularly: Re-validate alignment when your task changes
- Monitor drift: Track judge performance over time for consistency
Advanced Techniques
Constitutional AI
Train judges to enforce specific values or policies (e.g., "never recommend medical advice" or "always prioritize user privacy"). Embed these as hard constraints in your rubric.
Multi-Judge Consensus
Use 3-5 different judge models and aggregate their scores. This reduces individual model biases and increases reliability.
Specialized Judges
Fine-tune smaller models on your domain for faster, cheaper evaluation. Our platform supports custom model deployment.
Common Pitfalls
Vague criteria: "Good quality" is not actionable. Be specific.
Anchor bias: Judges may favor responses similar to examples in the rubric.
Leniency bias: Some models tend to over-score; calibrate with human data.
Context blindness: Ensure judges have access to all relevant information.
Real-World Example
Use Case: Technical Documentation Generator
Dimensions:
- Technical accuracy (1-5)
- Code correctness (1-5)
- Clarity for beginners (1-5)
- Completeness (1-5)
Result: Achieved 82% agreement with human reviewers, saving 20 hours/week of manual review.