ResearchOct 7, 2025

Climbing the Hills That Matter

Exploring the challenges with current evaluation methods and proposing a new approach grounded in production data.

Category
Research
We cannot improve on what we cannot measure. Most teams aren't measuring what matters.

Evals have become the steering wheel that determines whether agent systems improve in the ways we want. In many cases, optimization is now less challenging than finding and measuring the signals that surface the behaviors of useful and high-quality agents. If AI engineers have reliable metrics to track, they will find ways to make those numbers move up and to the right.

However, despite the potential of evals, it doesn't feel like AI teams are deriving proper value from them yet. What explains the gap between expectation and reality for these important but nebulous metrics?

What are evals?

Evals are notoriously difficult to discuss because the umbrella of use cases and techniques it describes is increasingly broad. For the purposes of this blog, we subscribe to Hamel Husain's definition of evals as the 'systematic measurement of application quality'.

Evals infra has a diverse ecosystem of tooling for measurement (autograders, verifiers), data collection (logging, benchmarks), and more. At Judgment Labs, we focus on Agent Behavior Monitoring (ABM) tooling, enabling teams to judge and alert on agent behavior in production. We primarily deal with evals in relation to how we reliably measure agent behavior from production data.

What is wrong with Evals?

Despite their potential, we observe two issues with how evals are implemented in practice:

1. Poor generalization of measurement methods.

Many teams and eval providers rely on 'out-of-the-box LLM judges' using a single prompt (e.g., for 'correctness' or 'tone') and apply it universally. This 'plug-and-play' approach is attractive but fails due to the 'non-uniform nature of agent quality across use cases.' The meaning of criteria like 'correct' or 'clear and concise' varies significantly depending on the domain (e.g., a legal brief versus code review).

For example, in Code Review, an LLM judge checking code correctness might be instructed to examine 'whether the code samples produce the same outputs, experience the same errors on edge cases, and have the same time/space complexity.' In Legal Briefings, a judge measuring legal correctness would instead look for 'accurate citation of precedent, faithful representation of case details, and alignment with jurisdiction-specific standards.'

This specificity extends even within a single domain, like legal, where boundaries for factuality or taste depend on the task and the type of law (e.g., 'European v. American') or case (e.g., 'putative class action v. product liability action').

A one-size-fits-all judge often measures poorly and its scores quietly point you at the wrong hill. Customization over exactly what agent behaviors you check for and how you measure them is key, even if you want to measure the same metric as another company.

All you get from using these prefab evals is you don't know what they actually do and in the best case they waste your time and in the worst case they create an illusion of confidence that is unjustified.
Hamel Husain


2. Evaluation methods and datasets are being grounded in vibes instead of production environments.

Even domain-specific evaluations often rely on "handcrafted rubrics fed into LLM judges." These human-written rubrics are attractive but "frequently encode biased notions of quality that do not reflect what users actually value in production." Rubrics "quickly go stale as production data shifts," becoming irrelevant to user needs.

The best teams continuously examine their own data, discover what truly matters in their agent's setting. Scaled analysis of real-world examples and user interactions help shape use-case-specific rubrics that reflect the quality indicators that matter most.

However, many teams lack methods for "selecting and upkeeping their eval sets and metrics to reflect usage drift across product releases." Evaluation processes often revolve around a "single-time selection of examples and metrics that may remain unchanged for months." This leads to "blind spots in deployment" and "distributional shift of user interactions causes coverage of behaviors to inevitably drift and regress."

Engineers need to keep the eval dataset up to date with the distribution of actual user behavior.... Having a good system in place to constantly sample the right test cases from real-world past usage is one of the most critical new pieces of work required today to build great AI products.
Gian SegatoAnthropic

These mistakes lead to slower iteration cycles, false confidence in product quality, and ultimately, user churn. The key lesson: success with users in production lives in the messy bits of human interaction and domain/context specificity that are continuously updated. Agent behavior monitoring (ABM) is one of the ways to address these problems.

How do we fix this?

Defining success metrics, LLM-judge systems, and rubrics in advance assumes prior knowledge of what to optimize. For AI products, this approach is not feasible through 'arm-chair pontification.' You need to track and measure what's happening with your agents and end users in production.

The source of truth is what you capture in production. High-signal behaviors are how users react, what they correct, what they prefer. Every interaction is feedback. Every correction is a vote. Weaponizing these signals to improve agent behavior is the clearest path to building robust AI agent products.

User preferences in relation to agent behavior are king in evals. As discussed in Building AI Products in the Probabilistic Era, evaluation and production performance do not live in separate lanes. They collapse into a single system where agent behavior shapes the entire funnel. This means updating measurements based on observed user behavior in production rather than inventing evals in advance.

Interaction data is the map that shows where users go, whether they succeed, and what they value. This was notably executed by Cursor in their latest play with capturing user preferences in their tab completion model work, improving code acceptance rates by 28%.

This paradigm shift requires robust infrastructure. It must enable a systematic way to capture dynamic, messy end user preferences — the only signal that matters. Then transform those preferences into scalable evals by surfacing the criteria users implicitly apply and converting those into stable, interpretable metrics of agent behavior.

Towards an Agent Behavior Monitoring (ABM) Layer

At heart, an ABM system tracks and mines user preferences at scale, then converts them into custom evaluation systems that give development teams signals to alert on and improve from.

If the interaction data embeds the evaluation, the practical question arises: how do we harness the data? Raw production logs/trajectories are noisy and overwhelming. To turn them into a compass for improvement, we need a systematic way of distilling user behavior into interpretable signals. Our pursuit of an agent behavior monitoring (ABM) layer consists of four infrastructure blocks.

1. Trajectory Capture

All ABM work begins by collecting permissioned logs of how users interact with agents. These logs, or 'trajectories,' include reasoning, actions, environment responses, context, and outcomes. They form a raw record of agent and human behavior, highlighting successes and friction points.

2. Bucketing and analysis

Interactions are grouped into 'buckets' that align with real-world scenarios to reveal significant patterns. Key metadata mentioned includes task type, agent behavior issues, tool use statistics, user satisfaction, and end-state outcome, which collectively provide a framework for analysis. The goal is to identify 'cohorts of similar trajectories' to understand where models perform consistently and where they fail.

3. Preference mining and rubric discovery

User preferences and feedback from the grouped trajectories are leveraged to define 'behavior criteria.' Implicit judgments are uncovered through approvals, edits, retries, and pairwise comparisons. By analyzing these contrasts using 'LLM-driven analysis at scale,' candidate dimensions of behavioral quality are identified and then refined into 'small, stable rubrics' with operational definitions that teams can validate and align on.

4. Scores and reward

Once the behavioral dimensions are established, 'judge models' can score agent trajectories, with a focus on transparency. Well-orchestrated scorers with embedded logic can operate online and identify regressions in real time. If a judge model proves reliable, it can be integrated into a 'reward model' for post-training optimization workflows like reinforcement learning or supervised fine-tuning. This process involves combining rewards and trajectories before feeding them into distributed training frameworks such as Fireworks, OpenAI, or Thinking Machine's 'new Tinker library' to enable agents to improve based on user preferences in production.

Each of these steps requires customization to the nuances of different agents and end user categories. A proper agent behavior monitoring layer is tailored to the agent action and user preference data that runs through it.

What happens when we get this right?

Proper ABM grounds evaluations in the nitty-gritty nuances of production data. This leads to continuous improvements in product quality and higher ROI. Teams move away from generic benchmarks or handcrafted, potentially biased criteria towards specialized evals rooted in the unique contexts and preferences of their own customers. Customized evals are a tool to convert distribution advantages into product advantages, turning usage into a compounding source of product superiority. Evals actually become your moat.

This creates a new kind of flywheel: the post-building flywheel. Just as post-training makes a model useful, post-building makes an agent reliable. Post-training uses data to refine a model's skills; post-building uses data to evaluate, monitor, and optimize assembled agents so those skills are applied consistently, safely, and effectively in practice.

The signals are already in your data. Find the right hill.

Written by Andrew Li and Alex Shan

Thank you to James Alcorn, Dakota McKenzie, and the Judgment Labs team for reviewing and debating ideas with us.