Reference
Metric Definitions
- Partial Completion — mean root-node score across tasks (graded progress).
- Success Rate — fraction of tasks scoring 1.0 (end-to-end completion).
- Complex-A — mean leaf score over action nodes (UI execution complexity).
- Complex-P — mean leaf score over perception nodes (visual reasoning).
- Time / Output — average end-to-end wall-clock time and output length per task.
How to Submit
To add a new system to this leaderboard,
open a GitHub issue with the
leaderboard label and include:
- A short system description and the agent / model versions used.
- The four CAP metrics (Partial Completion, Success Rate, Complex-A, Complex-P) with standard deviations.
- Average wall-clock time and output length per task.
- A link to your evaluation logs (preferred) or a brief description of the evaluation setup.
We review and merge submissions on a rolling basis.