SAGE: Hierarchical Exploration for Web Agents
1/ A crucial component of our business is constructing methods for scaling task/verifier creation efficiently without losing realism. This week, we focused on SAGE, which proposes a method of scaling web agent training by using unique data and training recipes inspired by curriculum learning.
2/ With SAGE, the system creates an easy task, then increases task difficulty until the agent fails, then regresses to a simpler task before trying again. It’s like finding the Goldilocks zone for agent learning. With this, SAGE hit 80% of human-level performance. Does that seem too good to be true?
3/ We’re honestly still a bit skeptical that LLMs-as-a-judge can verify complex state-changes as reliably as raw back-end data (especially for production-grade browser agents).
4/ Despite that, the core technique is too sound to ignore. By isolating exactly where a trajectory fails, we can stop throwing away "good enough” data that just needs a specific mid-course correction.
5/ This is a path toward reliably achieving success on long-horizon tasks. If an agent intuitively understands a UI, it can theoretically reverse-engineer a path to a goal human labelers haven't even mapped themselves.
6/ We plan to continue experimentation with the SAGE task-composition method and curriculum training recipes on some of our environments (benchmark coming soon!). If we can apply these learnings correctly, we can scale our training data 10x without human labeling.
7/ SAGE Paper
Authors: Qianlan Yang, Xiangjun Wang, Danielle Perszyk