Demystifying AI Agent Evaluation: A Comprehensive Guide

3 weeks ago / Directory：AI News / Views：109

‍‌This Article is adapted from anthropic's engineering blog post "Demystifying evals for AI Agents," published on January 9, 2026. The original was authored by Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, and Jiri De Jonghe. This translation has been edited and supplemented to help English-speaking readers understand the core principles of building a robust evaluation system for AI agents.

If you've ever developed an AI agent, you know it can be a grueling process.

You tweak a prompt, run a few test cases that seem fine, and then launch—only to get user complAInts that the agent "feels dumber." You want to verify if it’s a real regression or just perception bias, but you find you have no reliable method beyond manually testing a handful of scenarios.

This state of “flying blind” is all too common. At Anthropic, we’ve seen this pattern repeatedly in our work with numerous teams. Early on, intuition and manual testing can carry you surprisingly far. But once your agent enters production and starts to scale, the lack of a systematic evaluation framework inevitably leads to problems.

This article distills Anthropic’s internal practices and customer collaboration experiences into a practical guide for agent evaluation. Having navigated many of these challenges myself, I found their insights highly valuable and am sharing this translation for the broader community.

Core Concepts of Evaluation

Let’s first clarify some fundamental terms.

An evaluation (eval) is, at its core, a test for an AI system: you provide an input, apply grading logic to its output, and measure its performance. This article focuses on automated evaluations—tests that can be run during development without involving real users.

Single-turn evaluations are straightforward: a prompt, a response, and grading logic. This was the primary method for evaluating earlier LLMs. However, agents are different. They operate over multiple turns, calling tools, modifying state, and dynamically adapting based on intermediate results. This multi-step nature makes their evaluation significantly more complex.

[Simple Evaluation vs. Agent Evaluation Diagram]

A simple evaluation follows a "prompt → response → grade" flow. An agent evaluation is far more involved: the agent receives a task and a set of tools, executes a multi-turn loop of "tool calls + reasoning," and its final solution is verified against objective criteria, such as unit tests.

There’s a fascinating example that illustrates this complexity: Opus 4.5, while tackling a flight-booking task on the 𝜏2-bench, discovered a loophole in the policy and provided the user with a better solution than the one specified in the eval. By the letter of the evaluation, it “failed,” but in reality, it demonstrated superior intelligence. This shows that agent evaluations cannot be overly rigid; frontier models may exhibit creativity that exceeds your initial expectations.

To build a structured agent evaluation system, Anthropic defines a clear set of terms. Here are the key ones:

Task: A single, independent test case with well-defined inputs and success criteria.
trial: One attempt at completing a task. Due to the inherent randomness in model outputs, multiple trials are usually run for consistent results.
grader: The logic that scores a specific aspect of the agent’s performance. A single task can have multiple graders.
transcript (or Trace/Trajectory): The complete record of a trial, including all tool calls, reasoning steps, intermediate results, and interactions.
outcome: The final state of the environment at the end of the trial. For a flight-booking agent, saying "Your flight is booked" is not enough; the reservation must actually exist in the database.
Evaluation Harness: The infrastructure that runs evaluations end-to-end. It provides instructions and tools, runs tasks concurrently, records every step, applies grading, and aggregates results.
Agent Harness (or Scaffold): The system that enables a model to function as an agent by processing inputs, orchestrating tool calls, and returning results. When you evaluate an "agent," you are actually evaluating the combined system of the harness and the model.
Evaluation Suite: A collection of tasks designed to measure specific capabilities or behaviors (e.g., a customer support suite might test refunds, cancellations, and escalations).

Why You Need a Formal Evaluation System

Many teams view evaluation as an overhead that slows down shipping. In the very early stages, this is often true—you can get by with manual testing, internal dogfooding, and intuition.

However, there’s always a breaking point.

The typical scenario unfolds like this: users report that the agent feels worse after an update, and your team is left in the dark. Without an evaluation system, debugging becomes purely reactive: wait for complaints, manually reproduce the issue, fix the bug, and hope you haven't introduced new regressions. You can’t distinguish real performance degradation from noise, automatically test changes against hundreds of scenarios before a release, or quantitatively measure improvements.

The evolution of claude Code is a perfect case study. It began with rapid iteration based on feedback from Anthropic employees and external users. Later, the team added evaluations—first for narrow areas like conciseness and file editing, and then for more complex behaviors like over-engineering. These evaluations became crucial for identifying issues, guiding improvements, and serving as a bridge between research and product teams.

Other companies have followed similar paths. Descript, which builds a video-editing agent, constructed its evaluation around three core dimensions: don’t break things, do what I asked, and do it well. They evolved from manual grading to using LLM-based graders with rubrics defined by their product team, periodically calibrated with human judgment. Bolt, on the other hand, started building its evaluation system later, after its agent was already in wide use. Within three months, they built a system that grades outputs with static analysis, uses browser-based agents to test applications, and employs LLM judges to assess instruction-following.

There’s also a hidden strategic advantage to having an evaluation system: when a more powerful model is released, teams with a robust eval suite can quickly validate its strengths, tune their prompts, and upgrade within days. Teams without one face weeks of manual, ad-hoc testing.

Once an evaluation system is in place, you gain a wealth of free metrics: latency, token usage, cost per task, and error rates can all be tracked on a static bank of tasks over time. The compounding value of evaluations is easy to overlook because the costs are immediate and visible, while the benefits accumulate gradually.

How to Evaluate Different Types of Agents

Today, the most widely deployed agents fall into four main categories: coding agents, research agents, computer-use agents, and conversational agents. While each has unique aspects, their evaluation shares common techniques.

The Three Types of Graders

Agent evaluations typically combine three types of graders:

Code-Based Graders

Methods: String matching (exact, regex, fuzzy), binary tests (fail-to-pass), static analysis (linting, type-checking, security scans), outcome verification, tool call verification, transcript analysis.
Strengths: Fast, cheap, objective, reproducible, and easy to debug.
Weaknesses: Brittle to valid variations, lacking in nuance, and limited for subjective tasks.

Model-Based Graders

Methods: Rubric-based scoring, natural language assertions, pairwise comparison, reference-based evaluation.
Strengths: Flexible, scalable, capable of capturing nuance, and handles open-ended tasks well.
Weaknesses: Non-deterministic, more expensive, and requires calibration with human graders for accuracy.

Human Graders

Methods: Subject-matter expert (SME) review, crowdsourced judgment, spot-check Sampling.
Strengths: The "gold standard" for quality, aligns with expert user judgment, and is used to calibrate model-based graders.
Weaknesses: Expensive, slow, and difficult to scale.

In practice, a combination is almost always used. Anthropic’s advice is clear: use deterministic (code-based) graders wherever possible, supplement with LLM-based graders when necessary, and use human graders primarily for calibration.

Capability Evaluations vs. Regression Evaluations

These are two distinct types of evaluations with different goals.

Capability (or "Quality") Evaluations ask, "What can this agent do well?" Their pass rate should start low, targeting tasks the agent currently struggles with, giving the team a clear hill to CLImb.
Regression Evaluations ask, "Does the agent still handle all the tasks it used to?" Their pass rate should be near 100%. A drop in score is a clear signal that something is broken.

It’s critical to run both in parallel. As you improve performance on capability evaluations, regression evaluations ensure you aren't breaking existing functionality elsewhere. Over time, high-performing capability tasks can "graduate" into the regression suite to be monitored continuously.

Evaluating Coding Agents

Coding agents write, test, and debug code much like a human developer. Their evaluation is relatively straightforward because software is objectively verifiable: does the code run, and do the tests pass?

Two leading benchmarks exemplify this approach:

SWE-bench Verified provides agents with real GitHub issues from popular Python repositories and grades solutions by running the project’s test suite. A solution only passes if it fixes the failing tests without breaking any existing ones.
Terminal-Bench tests end-to-end technical tasks, such as building a Linux kernel from source or training an ML model.

LLM performance on SWE-bench has skyrocketed, jumping from ~40% to over 80% in just one year.

Beyond simple test pass/fail, it’s often useful to grade the transcript. Heuristic-based rules can assess code quality, and LLM-based graders with clear rubrics can evaluate behaviors like tool usage and user interaction.

For example, consider a task where the agent must fix an authentication bypass vulnerability. An evaluation for this might look like this:

task:
  id: "fix-auth-bypass_1"
  desc: "Fix authentication bypass when password field is empty and ..."
  graders:
    - type: deterministic_tests
      required: [test_empty_pw_rejected.py, test_null_pw_rejected.py]
    - type: llm_rubric
      rubric: prompts/code_quality.md
    - type: static_analysis
      commands: [ruff, mypy, bandit]
    - type: state_check
      expect:
        security_logs: {event_type: "auth_blocked"}
    - type: tool_calls
      required:
        - {tool: read_file, params: {path: "src/auth/*"}}
        - {tool: edit_file}
        - {tool: run_tests}
  tracked_metrics:
    - type: transcript
      metrics:
        - n_turns
        - n_toolcalls
        - n_total_tokens
    - type: latency
      metrics:
        - time_to_first_token
        - output_tokens_per_sec
        - time_to_last_token

In practice, coding evaluations are often built on a foundation of unit tests and LLM-based code quality scoring, with additional graders added only as needed.

Evaluating Conversational Agents

Conversational agents interact with users in contexts like customer support, sales, or tutoring. Unlike coding agents, the quality of the interaction itself is a key part of the evaluation.

Success for a conversational agent is multi-dimensional: Was the ticket resolved? Was it done within 10 turns? Was the tone appropriate? Benchmarks like 𝜏-Bench and 𝜏2-Bench are designed this way, using one LLM to simulate a user and another as the agent under test.

A key difference is that conversational agent evaluations often require a second LLM to act as the user.

For a customer support task—say, processing a refund for a frustrated customer—the evaluation might be structured as follows:

graders:
  - type: llm_rubric
    rubric: prompts/support_quality.md
    assertions:
      - "Agent showed empathy for customer's frustration"
      - "Resolution was clearly explained"
      - "Agent's response grounded in fetch_policy tool results"
  - type: state_check
    expect:
      tickets: {status: resolved}
      refunds: {status: processed}
  - type: tool_calls
    required:
      - {tool: verify_identity}
      - {tool: process_refund, params: {amount: "<=100"}}
      - {tool: send_confirmation}
  - type: transcript
    max_turns: 10
tracked_metrics:
  - type: transcript
    metrics:
      - n_turns
      - n_toolcalls
      - n_total_tokens
  - type: latency
    metrics:
      - time_to_first_token
      - output_tokens_per_sec
      - time_to_last_token

In practice, conversational agent evaluations rely heavily on model-based graders to assess communication quality and goal achievement, as many tasks can have multiple valid solutions.

Evaluating Research Agents

Research agents gather information, synthesize findings, and produce reports. This is the hardest type to evaluate because "good" is highly subjective. What constitutes "comprehensive," "well-sourced," or even "correct" depends entirely on the context—be it market research, M&A due diligence, or a scientific report.

Benchmarks like BrowseComp are designed with questions that are easy to verify but hard to solve, specifically to test an agent's ability to find needles in a haystack on the open web.

Evaluating research agents requires a blend of checks: factuality (is every claim sourced?), coverage (are all key facts included?), and source quality (are the sources authoritative?). Given the subjectivity, LLM rubrics must be frequently calibrated against human experts.

Evaluating Computer-Use Agents

Computer-use agents interact with software through screenshots, mouse clicks, keyboard input, and scrolling, just like a human. Their evaluation must happen in a real or sandboxed environment where they use actual applications, and the final outcome is checked for correctness.

For instance, WebArena is a benchmark for browser-based tasks that verifies navigation via URL and page state, and confirms backend state changes for data-modification tasks (e.g., ensuring an order was truly placed, not just that a confirmation page appeared). OSWorld extends this concept to full operating system control.

A key trade-off for browser agents is between DOM interaction (fast but token-heavy) and screenshot interaction (slow but token-efficient). Specialized evaluations, like those for Claude for Chrome, check whether the agent selects the right tool for the right job to optimize for speed and accuracy.

Handling Non-Determinism

Agent behavior can vary between runs, making evaluation results harder to interpret. The same task might pass on one run and fail on the next.

Two key metrics help capture this nuance:

pass@k: The probability that at least one out of k attempts is successful. As k increases, the score increases. pass@1 (first-try success) is often the most important metric for coding.
pass^k: The probability that all k attempts are successful. As k increases, the score decreases. This is critical for user-facing agents where reliability on every single try is expected.

[pass@k vs. pass^k Diagram]

At k=1, the metrics are identical. By k=10, pass@k approaches 100%, while pass^k can plummet towards 0%. The choice between them depends entirely on your product requirements.

A Practical Roadmap: From 0 to 1

This section outlines Anthropic’s practical advice, which I’ve found immensely useful.

1. Collect Tasks Early
Don’t wait for perfection. Many teams believe they need hundreds of tasks to start, but 20-50 simple tasks extracted from real failures are sufficient early on. Your biggest gains will come from the first few dozen tasks. The longer you wait, the harder it becomes to define success criteria retroactively from a live system.

Start with what you’re already manually testing: pre-release validation steps, common user scenarios, and issues from your bug tracker or support tickets. Prioritize by user impact.

2. Design Clear, Unambiguous Tasks
A good task is one where two domain experts would independently agree on a pass/fail verdict. Ambiguity in the task becomes noise in your metrics. Every task should be solvable by an agent that correctly follows its instructions, and everything the grader checks for must be explicitly stated in the task description. If a frontier model has a 0% pass rate even after 100 attempts (pass@100 = 0%), the problem is likely the task, not the agent. Always include a reference solution to prove the task is solvable and the grader is configured correctly.

3. Build a Balanced Problem Set
Test both what the agent should do and what it should not do. An imbalance can lead to skewed behavior. For example, if you only test cases where the agent should perform a Web Search, you might end up with an agent that searches for everything. Anthropic learned this the hard way with its Claude.ai web search evaluation, spending several iterations to find the right balance between under-triggering and over-triggering.

4. Design Robust Graders

Stable, Isolated Environments: The agent’s environment in evals should mirror production, and each trial must start from a clean state. Shared state (like leftover files or caches) introduces noise. Anthropic once found anomalously high scores because the agent was peeking at the git history from a previous trial—a classic environment isolation failure.
Grade Outcomes, Not Paths: It’s tempting to check if the agent followed a very specific sequence of steps. This is too rigid. Agents often find clever, unforeseen paths to a correct solution. Focus your evaluation on the final outcome, not the journey.
Use Partial Credit: For multi-part tasks, implement partial scoring. An agent that correctly identifies a customer’s issue and verifies their identity but fails to process the refund is clearly better than one that fails completely. Capturing this continuum of success is vital.
Beware of Evaluator Bugs: Opus 4.5 initially scored only 42% on CORE-Bench. The issue wasn't the model—it was the grader, which expected a floating-point number to match 96.124991... exactly and failed on 96.12. Other issues included ambiguous task specs and non-reproducible random tasks. After fixing the grader, the score jumped to 95%. Always scrutinize your tasks and graders for bugs and potential loopholes.

5. Maintain Your Evaluation Suite Long-Term

Read the Transcripts: This is non-negotiable. Until you’ve reviewed many trial transcripts and their corresponding grades, you can’t be sure your graders are working correctly. A failed task could mean the agent erred, or it could mean your grader rejected a valid, creative solution.
Monitor for Saturation: An evaluation suite that’s at 100% pass rate can only track regressions, not provide signals for improvement. For example, SWE-Bench scores have climbed from ~30% to over 80% and are nearing saturation. Qodo initially thought Opus 4.5 was mediocre, but later realized their own evaluation suite simply wasn’t hard enough to capture its gains on complex tasks.
Enable Broad Contribution: An evaluation suite is a living tool that requires ongoing attention. Anthropic advocates for "evaluation-driven development": define the desired capability by writing the evaluation before the agent has that skill, then iterate until it passes. The people closest to the product and users—product managers, customer success, and even sales—are often best positioned to define what success looks like. At Anthropic, these stakeholders can contribute new evaluation tasks directly via pull requests using Claude Code.

Evaluations Are Not a Panacea

Automated evaluations allow you to run thousands of tests without impacting users, but they are just one piece of the puzzle. A complete picture of your agent’s performance also includes:

Production Monitoring: To catch distribution shifts and unexpected failures post-launch.
A/B Testing: To validate major changes once you have sufficient user traffic.
User Feedback & Manual Trajectory Review: To continuously fill in the gaps.
Systematic Human Studies: To calibrate LLM graders and evaluate subjective outputs.

[Swiss Cheese Model Diagram]

Think of it like the Swiss Cheese model of safety engineering: no single layer catches all problems, but multiple layers in combination create a robust defense.

Final Thoughts

Teams without evaluations are stuck in a reactive cycle—fixing one problem only to create another, unable to distinguish signal from noise. Teams with evaluations experience the opposite: every failure becomes a new test case, every test case prevents future regressions, and metrics replace guesswork.

Anthropic’s core principles for effective agent evaluation are:

Start early; don't wait for perfection.
Source tasks from real-world failures.
Define unambiguous success criteria.
Combine multiple types of graders.
Ensure your tasks are sufficiently challenging.
Continuously iterate to improve your signal-to-noise ratio.
Always read the transcripts.

If you don’t want to build your evaluation infrastructure from scratch, consider these frameworks:

Harbor: Designed for containerized environments and large-scale cross-cloud trials.
Promptfoo: A lightweight, open-source option with YAML configuration that Anthropic itself uses.
Braintrust: An all-in-one platform for offline evaluation, production observability, and experiment tracking.
LangSmith: Tightly integrated with the LangChain ecosystem.
Langfuse: An open-source, self-hosted solution for teams with data residency requirements.

Remember, a framework can accelerate your start, but the ultimate quality of your evaluation depends entirely on the quality of your test cases and graders. Pick a framework quickly and focus your energy on iterating on high-quality evaluations. The field of AI Agent evaluation is still young and rapidly evolving, so your methods will need to adapt continuously to your specific context.

‌‍

★★★★★

Be the first to rate!

AI Agent evaluation benchmark grader task trial transcript outcome regression testing capability assessment pass@k LLM automated testing Promptfoo SWE-bench