World Models for Agents
1/
Yesterday, our internal Paper Club did a deep dive into one of the hottest topics in our industry: world models.
In the past few months, companies like Simile and World Labs have raised huge sums of money to build out world models capable of complex simulations.


2/
This week, we read and discussed APEX-Agents, CoreCraft, and Agent World Model (AWM) to better understand how we can leverage world models to automate environment and task creation.

3/
Mercor's APEX and Surge's CoreCraft are benchmarks testing how well SOTA models can operate in realistic enterprise environments (doing human-created long-horizon tasks in fields like IB, consulting, corporate law, enterprise customer support, etc.).
Snowflake's AWM takes a different approach: instead of handcrafting these envs/tasks, it attempts to mass-scale env creation by feeding an LLM a high-level scenario (see examples below) and instructing the LLM to use the scenario to generate its own tasks, then to build out the necessary components of a world (a database, APIs, reward logic, etc.) from those tasks.


4/
A potential limitation with this approach is that if the initial scenario is too unsophisticated, the tasks the LLM generates will be similarly unexceptional. Even with large environment diversity, AWM’s tasks may not be complex enough to meaningfully challenge a state-of-the-art model.
To successfully automate the creation of high-quality post-training environments and tasks, we’ll make use of a combination of these approaches: we need to build a sufficiently complex world model so that it can create tasks as complex as the human-generated ones in APEX and CoreCraft.

5/
In practice, making agents better at long-horizon tasks will depend highly on how well we can scale the worlds that the models act within.
Papers:
Authors
APEX-Agents: Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist Zach Richards, Chirag Mahapatra, Brendan Foody, Osvald Nitski
EnterpriseBench Corecraft: Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, Edwin Chen
Agent World Model: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He