How to Evaluate Your AI — From Prompts to Agents

Most teams shipping AI products have the same dirty secret: they don't actually know if their AI works.

A 2026 LangChain survey found that 32% of organizations cite quality as their #1 barrier to deploying AI agents. A RAND study found 80-90% of AI agent projects fail in production. And yet the most common evaluation strategy remains what practitioners call “vibes testing” — try a few inputs, looks good, ship it.

We've spent months building BeamEval and testing hundreds of real production prompts. Here's the evaluation methodology we've converged on, and why the standard approaches fall short.

The Problem With LLM-as-Judge

The default approach in the industry: ask GPT-4 to score your AI's output 0-100. Nearly every eval tool (DeepEval, RAGAS, LangSmith) relies on this.

The problem: it doesn't work reliably.

We tested the same prompt, same model, same judge — scores varied by ±15 points across runs. A prompt that clearly failed safety tests would score 85/100 because the judge was “being generous.” LLM judges hallucinate their evaluations just like the models they're judging.

Anthropic's engineering team has acknowledged this directly — they recommend combining LLM judges with deterministic checks and human review, not relying on any single method.

Our Approach: Per-Dimension Adversarial Testing

Instead of asking one LLM to score everything at once, we break evaluation into 6 independent dimensions, each with its own specialized test generation:

1. Hallucination

Does your AI invent facts? We generate tests that tempt fabrication: asking for statistics not in the prompt, leading questions with false premises, requests for fake case studies.

2. Instruction Following

Does your AI obey its rules under pressure? We test scope boundaries, off-topic requests, and urgent framing. We also include normal in-scope requests — an eval that only tests failures misses whether the AI actually works.

3. Refusal Accuracy

The dimension most tools ignore. It's not just “does it refuse bad requests?” — it's also “does it answer legitimate requests it shouldn't refuse?” Over-refusal is as damaging as under-refusal.

4. Output Consistency

Ask the same question three different ways. Does the AI give the same information? Inconsistency erodes user trust faster than any other failure mode.

5. Safety

Prompt injection, social engineering, PII leakage, jailbreaks. We craft attacks specific to your system — if your AI handles customer data, we'll try to extract it.

6. Format Compliance

Does the AI respect output structure requirements? Downstream systems parsing AI output break silently when format drifts.

Why Separate Calls Matter

Each dimension gets its own LLM call to generate test cases. A single “generate 30 tests” call produces generic, surface-level tests. Six specialized calls, each with deep guidance, produce tests that target the specific constraints in your prompt.

If your prompt has a $500 refund limit, the hallucination expert won't test for it — but the instruction following expert will test $499, $500, and $501.

Deterministic Scoring

After running tests against the real model (actual API calls, not simulated), each response is individually judged pass/fail. The dimension score is:

Score = (passed tests / total tests) × 100

5 tests per dimension, so scores can be 0, 20, 40, 60, 80, or 100. No subjective “72.3” — either the test passed or it didn't. The LLM never produces the final score. It only generates test cases and judges individual pass/fail decisions.

What Changes When You Move to Agents

Everything above works for single-turn prompt evaluation. But AI agents introduce fundamentally different challenges:

Cascading failures. If an agent achieves 85% accuracy per action, a 10-step workflow only succeeds about 20% of the time (0.85¹&sup0;). One wrong tool selection early in the chain corrupts everything downstream.

Non-determinism. Two identical requests can produce different tool-call sequences while both arriving at correct answers. Traditional pass/fail testing can't distinguish between an efficient solution and one that got lucky.

Ghost actions. Agents sometimes claim to have executed tools they never called. The agent confidently reports success built on fabricated intermediate results.

The industry is converging on a three-layer evaluation model for agents:

Reasoning layer

Is the plan logical? Does the agent pick the right tools?

Action layer

Are tool calls correct? Are arguments valid?

Outcome layer

Was the task actually completed? How efficiently?

We're building toward this at BeamEval — starting with the strongest foundation in prompt-level evaluation and expanding into multi-step agent testing.

Getting Started

The most practical advice from Anthropic's engineering team: start with 20-50 test cases sourced from actual user failures, not synthetic perfection. Don't try to build a comprehensive benchmark. Start with the failures you've already seen, then expand.

Or just paste your prompt at beameval.com and we'll generate the tests for you. 60 seconds, free.