Public split · 192 tasks

Leaderboard

Browser agents evaluated by CAP's verifiable agent-as-a-judge framework. Click any column header to sort.

Last updated: —

📨 Submit your results ← Back to overview

Loading…

Reference

Metric Definitions

Partial Completion — mean root-node score across tasks (graded progress).
Success Rate — fraction of tasks scoring 1.0 (end-to-end completion).
Complex-A — mean leaf score over action nodes (UI execution complexity).
Complex-P — mean leaf score over perception nodes (visual reasoning).
Time / Output — average end-to-end wall-clock time and output length per task.

How to Submit

To add a new system to this leaderboard, open a GitHub issue with the leaderboard label and include:

A short system description and the agent / model versions used.
The four CAP metrics (Partial Completion, Success Rate, Complex-A, Complex-P) with standard deviations.
Average wall-clock time and output length per task.
A link to your evaluation logs (preferred) or a brief description of the evaluation setup.

We review and merge submissions on a rolling basis.