AI
Evaluating the Evaluators
Shahul Es
Aug 18, 2025
If you’re building AI products today, chances are you use LLMs as judges. They grade correctness, faithfulness, or goal success so your team can ship faster without relying on human annotators for every iteration. But here’s the catch: are these LLM-as-judges actually reliable, and can we make them better?
We ran a study to find out. We built a benchmark dataset closer to real-world outputs than academic sets like TriviaQA, and then tested in-context learning strategies that many teams already use. The results were not what we expected: bigger reasoning-focused models improved by as much as 10 points in F1, smaller distilled models got worse when “optimised,” and Anthropic’s models turned out to be the most stable.
This blog breaks down how we built the benchmark, what we tested, and what these results mean for anyone relying on LLMs as evaluators.
How We Designed the Benchmark
We began by narrowing the scope to a single but widely used evaluation metric: answer correctness. The task is simple to state: given a model response and expert-written grading notes, the judge must decide whether the response sufficiently covers all the required points.
From this definition, we designed the dataset schema:
Question: a domain-specific query
Grading notes: expert-written checklist of points that a good answer must contain
Response: an AI-generated answer
Verdict: a human-written pass/fail judgment
To build the dataset, we adopted a synthetic data generation pipeline and then applied human review for verification. This ensured both scale and quality. The final dataset contains about 200 samples, each resembling the longer and more nuanced responses found in real-world systems rather than the short trivia-style answers in academic benchmarks.
Example sample in EvalsBench Dataset
With the dataset ready, we evaluated a range of in-context learning strategies that AI teams often experiment with in practice. These included
Vanilla: baseline LLM as judge with only simple prompt defined by the user.
Fixed few-shot : always show the same examples
Random few-shot: randomly chose examples from a list of expert annotated samples.
Dynamic few-shot : retrieve closest examples to current task from a list of expert annotated samples.
Automated prompt optimisation: a coding agent optimising prompt based on expert annotated samples.

Baseline LLM as judge performance compared to various in context learning approaches
What We Found
Provider | Model | Vanilla | Fixed few shot | Random few shot | Dynamic Few shot | Automatic prompt optimisation |
Anthropic | claude-3.5-haiku | 66.95 | 67.8 | 66.95 | 66.1 | 74.42 |
Anthropic | claude-4-opus | 86.02 | 94.34 | 91.14 | 95.18 | 95.12 |
Anthropic | claude-4-sonnet | 87.43 | 92.12 | 90.12 | 90.57 | 93.57 |
gemini-2.5-flash | 87.91 | 87.78 | 90.4 | 89.89 | 98.16 | |
gemini-2.5-flash-lite | 89.53 | 80.77 | 76.83 | 79.49 | 56.64 | |
gemini-2.5-pro | 86.19 | 92.4 | 94.05 | 92.94 | 95.71 | |
OpenAI | gpt-4o | 80.63 | 82.35 | 80.63 | 80.0 | 92.86 |
OpenAI | gpt-4o-mini | 84.49 | 79.79 | 75.86 | 76.65 | 51.38 |
OpenAI | o3-mini | 87.43 | 95.76 | 93.02 | 91.43 | 95.81 |
F1 score for various models and in context learning strategies. Higher is better.
When we put the models through these strategies, the results surprised us. There was no single strategy that worked across the board. Instead, the effectiveness of each approach depended heavily on the underlying model.
Here are the most important lessons:
No silver bullet. No in-context learning method consistently improved performance across all models. Teams should expect to tune their strategy depending on the model they use.
Bigger “thinking” models benefited the most. Claude Opus, Gemini Pro, and o3-mini saw gains of up to +10 points in F1 score when paired with the right optimisation strategy.
Smaller, distilled models behaved unpredictably. Models like GPT-4o-mini and Gemini Flash-Lite sometimes performed worse after optimisation. This shows that strategies designed for larger models do not transfer well to smaller ones.
Anthropic models were the most stable. Their performance improved in ways that aligned with intuition, making them easier to work with.
In short, the experiment showed that while in-context learning can make LLM-as-judges more effective, the “how” is very model dependent. The code and data is available here.
Recommendations for Practitioners
After running this experiment, here are our recommendations for AI engineers and PMs working with LLM-as-judges:
Validate on your own data, not just benchmarks: Academic datasets often don’t reflect real-world complexity. Always test LLM-as-judges on your production-like data to understand how they behave in practice.
Start with review and alignment: LLM-as-judges rarely perform their best out of the box. Always run at least one iteration where human experts review and correct the judge’s outputs to align it with your domain.
Feed feedback back into the judge: Don’t let review data sit idle. Incorporate it as few-shot examples or use it for prompt optimization so the judge continuously improves.
Favor larger models for critical evaluation: Use reasoning-focused models for offline evaluation and CI/CD pipelines where accuracy matters most, while smaller models can be used for faster iteration.
Prioritize reasoning in verdicts: Structure prompts so judges provide not just “pass/fail” labels but reasoning. This makes it easier for teams to debug, trust, and refine the evaluation process.
Conclusion and Future work
For teams using LLM-as-judges today, the main takeaway is simple: don’t assume your evaluation method is universally reliable. The effectiveness of prompts and few-shot strategies depends heavily on the model you choose.
Looking ahead, we want to extend this work beyond correctness and into other key metrics like faithfulness and goal success. We also plan to scale the dataset to cover more domains and explore advanced optimisation methods like DSPy and PhaseEvo, which may help smaller models behave more like larger ones.
If you’re looking to implement evaluation in your organisation, we do ragas office hours to help community and enterprises to evaluate and improve AI systems. Schedule on here or write to us at founders@explodinggradients.com