LLM-as-Judge Is Broken. Here's What We Do Instead.

If you're evaluating LLM outputs in 2026, you're almost certainly using LLM-as-judge. Every major framework defaults to it — DeepEval's G-Eval, RAGAS, Braintrust's scorers, LangSmith's evaluators. The idea is simple: ask a stronger model to grade a weaker model's output.

It's also unreliable in ways that are easy to miss.

The Reproducibility Problem

We ran an experiment: same system prompt, same model, same test cases, same judge model. We ran the evaluation 10 times.

The overall scores ranged from 62 to 89.

Same inputs, same outputs, wildly different scores. The judge LLM was in a different “mood” each time — sometimes generous, sometimes harsh, sometimes fixated on formatting issues while ignoring hallucination.

This isn't a bug in any specific tool. It's inherent to using a probabilistic system as a measurement instrument. LLMs don't have stable preferences — they have token probability distributions that shift with every generation.

The Inflation Problem

LLM judges consistently over-score. We found:

A prompt that leaked its system instructions on direct request scored 78/100 on “safety”

A prompt that fabricated a company policy scored 71/100 on “hallucination”

A prompt that answered when it should have refused scored 82/100 on “refusal accuracy”

Why? The judge sees a well-written, professional response and defaults to scoring it highly. The failure is subtle — the output sounds right but is factually wrong or behaviorally inappropriate. LLMs are optimized to produce plausible text. They're equally good at producing plausible evaluations of plausible-sounding failures.

The Batch Problem

Most frameworks send all test results to the judge in a single call: “Here are 20 test cases, score each one.” This creates:

Anchoring bias — the first few judgments influence the rest
Fatigue effects — quality of judgment degrades for later items
Lost context — the judge can't deeply analyze each test when processing 20 simultaneously
Token limit pressure — long contexts force the judge to be superficial

What We Do Differently

We redesigned the evaluation pipeline to minimize LLM subjectivity:

Step 1: Per-dimension test generation

6 parallel LLM calls, each specialized. The hallucination expert tests for fabricated statistics and false citations. The safety expert tests prompt injection and social engineering. A single generic call produces surface-level tests; specialized calls target your prompt's specific constraints.

Step 2: Real model execution

30 parallel API calls against your actual model. Not simulated, not approximated. If you selected GPT-4o, your test cases hit the OpenAI API. This catches model-specific behaviors that simulation misses.

Step 3: Individual judging

Each test case gets its own dedicated judge call. One test, one judgment, one pass/fail decision. No anchoring, no fatigue, no context pressure. 30 separate calls, batched in groups of 5.

Step 4: Deterministic scoring

dimension_score = (passed_tests / total_tests) × 100
overall_score = average(all_dimension_scores)

With 5 tests per dimension, possible scores are 0, 20, 40, 60, 80, or 100. No floating-point subjectivity. The LLM never determines the score — it only judges individual pass/fail.

Step 5: Findings generation

The LLM receives the already-computed scores and specific failed tests. Its job: explain patterns and suggest prompt edits. It never produces the score — it explains it.

The Tradeoff

This approach uses significantly more API calls than a single LLM-as-judge invocation:

Test generation6 calls

Model execution30 calls

Individual judging30 calls

Findings synthesis1 call

Total67 calls per eval

But the scores are reproducible. Run the same eval twice and the scores are identical (assuming the model under test behaves consistently). That's not something any single-call LLM-as-judge approach can guarantee.

When LLM-as-Judge Is Fine

This methodology is designed for rigorous evaluation. If you need a quick directional signal — “is this prompt roughly okay?” — a single LLM-as-judge call is adequate.

For production systems where quality matters, where scores need to be comparable across runs, where you need to know if a prompt change made things better or worse — deterministic scoring from individual pass/fail judgments is the more reliable approach.

Try it at beameval.com. Free, 60 seconds, 30 tests across 6 dimensions.