Paper Club

World Models: High-Fidelity Enterprise Environments

Last week, we did an internal deep dive into enterprise environments/benchmarks like τ²-Bench and CoreCraft. This type of high-fidelity RL env is becoming increasingly popular as frontier labs push their models into more and more agentic capabilities.

Right now, however, the largest bottleneck is human data operations.

For the majority of these envs, time-intensive human labor has been necessary so far to create the world, the synthetic data inside of it, the tasks, and the verifiers needed to do an RL run.

While this has resulted in high-quality data, the human bottleneck is inherently limiting and prevents us from scaling up the number of envs and tasks that could train a model to the fullest extent.

Papers like Snowflake’s AWM, AutoEnv, and Prime Intellect’s τ²-synth have explored creating synthetic data using LLMs to generate their own envs, tasks, and evaluation logic given some basic context. So far, these methods seem to be limited - the environments they produce are simpler than human-designed ones like τ²-Bench and CoreCraft.

At Vibrant Labs, our goal is to recreate something as complex and high-fidelity as CoreCraft, but completely synthetically.

When we can successfully recreate the complexity of envs like CoreCraft without needing extensive human input, we’ll unlock the next stage of model improvement.

Vibrant Labs is proudly backed by