Public split · 192 tasks

Leaderboard

Browser agents evaluated by CAP's verifiable agent-as-a-judge framework. Click any column header to sort.

Last updated:

Loading…
Reference

Metric Definitions

How to Submit

To add a new system to this leaderboard, open a GitHub issue with the leaderboard label and include:

  1. A short system description and the agent / model versions used.
  2. The four CAP metrics (Partial Completion, Success Rate, Complex-A, Complex-P) with standard deviations.
  3. Average wall-clock time and output length per task.
  4. A link to your evaluation logs (preferred) or a brief description of the evaluation setup.

We review and merge submissions on a rolling basis.