Paper Club

Verifiers & Trustworthy Evals

One of the big challenges on our roadmap at Vibrant Labs is scaling agent benchmarks while maintaining a strong reward signal.

Good verifiers are the bedrock of usable agent benchmarks.

Bad verifiers can inflate model failure rates, and they often hide actual capability gaps.

This week, in Paper Club, we covered SWE-Bench Verified and OSWorld-Verified, which both focus on this topic.

We also covered ComputerRL and BenchGuard for some of the techniques they proposed.

Released in Aug 2024, SWE-Bench Verified attempts to solve several issues with the original (very popular) SWE-Bench:

Unit tests were sometimes overspecified (e.g., requiring an exact deprecation message string that the agent couldn’t have known)
Issue descriptions were sometimes underspecified (some context lived in PR discussion threads the agent couldn’t see)
Docker reliability issues caused valid solutions to be graded as incorrect

Obviously, none of these are genuine model failures, they’re limitations of the benchmark and its reward setup.

OpenAI fixed this with a huge human annotation campaign. They coalesced 93 devs across ~1,700 samples and had them rate the samples on:

The quality of the issue description and
The validity of the FAIL_TO_PASS unit tests

Each sample was labeled by multiple annotators independently and they ultimately produced 500 verified samples.

While SWE-Bench Verified is deprecated and saturated/contaminated today (models have seen the OSS code and can now recall exact fixes), it has some utility in its core ideas.

Even <2 years later, the well-specified short tasks that relied on human annotators can be handled by LLMs-as-judges, which is how we will approach this type of problem.

OSWorld-Verified revealed some similar findings, but for CUAs instead of coding agents.

The original OSWorld had problems with:

Uncontrollable env drift (real websites change structure, add bot detection, block network access, etc.)
Flaky task setup (some apps need to boot in a specific state)
Task ambiguity (the verifier didn’t account for multiple valid solution paths)

Again, not actually things the model has control over (esp the latter).

In building this updated benchmark (in July 2025), the team at @ XLangLab fixed the verifiers instead of the tasks (b/c changing a task changes the dataset).

They focused on:

Comparing (fuzzy when needed) final states rather than the trajectory/exact UI
Fixing env/setup-related issues that caused false negatives
A variety of changes to address the env drift (proxy support, better IP handling, some simulated pages instead of the real ones)

Nowadays, OSWorld-Verified is starting to get saturated as well.

This is where we plan to provide better (”verified”) versions of existing CUA gyms/datasets, especially for issues that are known and remain hard. More to come on this front.

ComputerRL from the team at Tsinghua and Z.ai combines GUI actions with API calls, and creates an API-GUI action space that can be used as infra for online RL training.

GUIs are meant for humans, and insisting that agents interact with websites purely through the lens of a human-oriented paradigm has its limitations.

This translation device helped train GLM (AutoGLM-OS-9B) to achieve (then SOTA) 48.9% on OSWorld.

In future work that we’re building in the CUA space, we can take lessons from this research by having an agent still solve a task using the UI, but by having the benchmark use the API/structured state as a task/verifier-mining surface.

This will help us create tasks that are hard for the agent to complete, but easy for the env to grade.

BenchGuard from the teams at the Allen School and Phylo Inc is focused on automatically auditing LLM-driven benchmarks. It reaffirms the notion that there are false failures in many benchmarks that are actually the result of a benchmark issue rather than an agent issue.

They use LLMs to audit various inconsistencies (instructional, environmental, gold-solution-based, and evaluation-logic-based) and classify the defects into a variety of categories.

While this makes sense, benchmark repair can lead to overspecifying tasks. If you keep adding context until the verifier passes, you may leak information that the agent should have found out on its own.

In this realm, we’re also exploring atomic fact decomposition. We plan to break every task instruction into individual facts, then classify each one based on whether or not the agent can look it up in the env.

The next generation of agent benchmarks won’t be won by scale alone; they’ll require significant advancements in verifier quality improvement as well.

We plan to create better reward signal derived from some of the techniques listed above (atomic task decomposition, drift-resistant envs, a collapsed GUI-API action space, continuous QA, etc.).

In doing so, we’ll deliver higher-quality post-training datasets for the next step forward in frontier model capability.

We’ll be releasing datasets that reflect these insights soon. Stay tuned.

Papers:

SWE-Bench Verified

Authors: Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, Aleksander Madry

OSWorld-Verified

ComputerRL

Authors: Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, Jie Tang

BenchGuard

Authors: Xinming Tu, Tianze Wang, Yingzhou (Minta) Lu, Kexin Huang, Yuanhao Qu, Sara Mostafavi

Vibrant Labs is proudly backed by