CAP-Bench: Cross-Site Browser Agents with Complex Actions and Perception

Abstract

Real web browsing is harder than benchmarks suggest.

Large language models are increasingly deployed as autonomous agents that interact with the web through browsers. While recent progress has been driven by benchmarks evaluating end-to-end task success, these evaluations largely overlook two fundamental sources of difficulty in real web browsing: complex actions over rich user interfaces and visual perception of dynamically rendered content, especially in workflows that span multiple websites.

We introduce CAP, a scalable benchmark for browser agents on cross-site, human-like web tasks that require non-trivial UI interactions and visual understanding. We adopt a decomposition-and-recomposition pipeline that abstracts each website into a structured site card capturing user-facing functions, complex execution operations, and perceptual requirements, then recomposes these components into realistic cross-site workflows.

Built on this framework, we construct 420 tasks across 108 real-world websites in 24 domains under careful quality control. Experiments on state-of-the-art browser agents reveal that perception-heavy interactions remain a major bottleneck — exposing substantial gaps between current agents and real-world web browsing demands.

Cross-Site Workflows

Avg 3 sites per task, with sequential, parallel, fan-out and chain information flow.

Complex Actions

Avg 7 complex actions per task. 58 action types — sliders, drag, drawing, video control.

Challenging Perception

Avg 4 perception challenges per task. 50 types — charts, lazy-load, animating elements.

CAP overview — three complexity dimensions C-A-P — **Figure 1.** CAP targets three sources of difficulty in real-world web browsing — Cross-site workflows, complex Actions, and challenging Perception — and evaluates agents through a verifiable agent-as-a-judge framework with explicit execution and perception checkpoints.

Method · §3

Construction Pipeline

A four-phase decomposition-and-recomposition pipeline that turns websites into structured site cards and recomposes them into coherent cross-site task instances.

Site Card Annotation

Decompose each website into s_w = ⟨F, A, P⟩ — functions, complex actions, perception items.

Cluster Sampling

Pick a coherent cluster subset by combining functional affinity with low-frequency historical co-occurrence.

Task Proposal

An LLM proposer composes site functions into a cross-site workflow with a directed information-flow graph.

Task Instantiation

Bind abstract slots to concrete entities, attach covered points for each execution / perception step.

**Figure 2.** Overview of the CAP construction pipeline.

Method · §4

Verifiable Agent-as-a-Judge

Each task is automatically converted into a Python evaluation script that materializes a hierarchical rubric tree of independently checkable execution and perception criteria.

CAP-Eval framework — **Figure 3.** Three stages: (1) Prior knowledge & code generation; (2) tree-structured evaluation framework with action / perception / general leaves; (3) agent-based scoring execution with extraction and URL-grounded verification.

Leaves are typed as Action, Perception, or Other, and may be marked critical — a failed critical child propagates a zero up the tree (gating), whereas non-critical children only contribute their averaged score once all critical prerequisites pass. A judge agent traverses the tree, extracts structured information from the agent's answer, and verifies each claim against live webpages.

Results · §5

Key Findings

Even the best agent reaches only 8.0% Success Rate. Fine-grained scoring isolates where agents fail.

Finding 1

Limited capability overall

Across 7 SOTA browser agents, partial-completion scores cluster around 15–24%. The strongest commercial system (Comet) reaches 48% only by spending an order of magnitude more output tokens.

Finding 2

Perception is the bottleneck

Systems trail humans by ~9 points more on Complex-P than on Complex-A. Visual reasoning over rendered charts, badges, and expansion states is the dominant failure mode — not UI manipulation per se.

Finding 3

Compute alone is not enough

Top systems lean on extensive deliberation (Comet: 100k+ output tokens, Human: 16 minutes per task), but spending time without using it well does not pay off — Browser-Use + GPT-5 spends 15 min for 15% completion.

Finding 4

Domain-specific UIs are hardest

Sites with specialized visualizations (CDC dashboards, Google Flights' price-calendar grid) score much lower than text-heavy sites like Wikipedia or AccuWeather.

Cite

Citation

If CAP-Bench helps your work, please cite our COLM 2026 paper.

@inproceedings{xu2026cap,
    title     = {CAP: A Scalable Benchmark for Evaluating Cross-Site Browser Agents
                 with Complex Actions and Perception},
    author    = {Xu, Zejun and Chen, Taiyi and Li, Jin and Gu, Yongtong and Cheng, Qi
                 and Lv, Aixuan and Zhu, Shuai and Zhu, Pengfei and Yang, Kaichen
                 and Sun, Boyu and Yang, Yixian and Xie, Mulong and Liu, Xin
                 and Li, Dagang and Ma, Xiaoteng and Wang, Hongru},
    booktitle = {Third Conference on Language Modeling (COLM)},
    year      = {2026},
    url       = {https://openreview.net/forum?id=Q0pFLVle7J}
}

Cross-Site, Actions, and Perception
for Browser Agents