Engineering

We compared Auto Browser to the best browser agents of 2026. Here's what we're shipping.

A five-lens audit of Auto Browser against Surfer 2, Skyvern 2.0, Alumnium, and Manus, plus the four architectural upgrades we're shipping in response.

Last updated April 20, 2026

Auto Browser team · April 2026 · 8-minute read

webmcp-sports.com

Auto Browser in action. The basketball-shopping transcript discussed later in this post is from a run like this one.

Browser-agent research has been moving faster than almost any other area of applied AI. Three months ago, the leading systems on WebVoyager were in the high 80s. Today, Surfer 2 reports 97.1%, Alumnium reports 98.5% across 610 tasks at roughly $5 of API cost, and Skyvern 2.0 reports 85.85%. Meanwhile, most consumer browser-agent extensions (the ones actually shipping to users in Chrome side panels and browser add-ons) don’t publish benchmarks and rarely audit themselves against the published state of the art.

We decided to do it properly.

This post walks through what we learned auditing Auto Browser against the four most rigorously documented 2026 browser agents, the five-axis framework we developed to structure that audit, the specific architectural gaps we found in our own system, and what we’re shipping in response over the coming weeks.

The four systems we studied

We focused on systems with enough public architectural detail to reason about line-by-line, not just marketing pages.

Surfer 2 (Andreux et al. 2025) decomposes the agent into four named roles. An Orchestrator breaks the user goal into verifiable subtasks and is skipped entirely for simple tasks. A Navigator runs a ReAct loop and “perceives the environment purely via screenshots.” A Validator examines the Navigator’s execution trace “to assess task completion before allowing termination”, so post-action verification is structural, not a prompt contract. A Localizer (Holo1.5, 7B or 72B) grounds textual element descriptions to pixel coordinates. The combination hits 97.1% on WebVoyager, 69.6% on WebArena, 60.1% on OSWorld, and 87.1% on AndroidWorld.

Skyvern 2.0 runs a three-role Planner, Actor, Validator loop. The Planner maintains “a working memory of things it had completed and things that were still waiting to be finished.” The Validator “confirmed whether or not the original goals the ‘Planner’ generates are successfully completed or not” and routes failures back to the Planner for replan. GPT-4o plus GPT-4o-mini. 85.85% on WebVoyager.

Alumnium is the most architecturally radical of the four. The outer agent (Claude Code with Sonnet 4.6) sees only three tools: do(), get(), and check(). It never sees the DOM. An inner subagent (Alumnium MCP with GPT-5 Nano) does all browser-level work. Their centerpiece claim, which we’ll return to later: “Claude Code doesn’t need to know anything about page structure. It only sees the goal it set and a plain-text summary of what changed.” They report 98.5% on 610 WebVoyager tasks for approximately $5 total API cost.

Manus publishes the most practically useful guide to agent context discipline. Four patterns recur in their context-engineering essay. First, todo-list recitation at the end of context, which “pushes the global plan into the model’s recent attention span.” Second, append-only context to preserve KV-cache hits; they note that “cached input tokens cost 0.30 USD/MTok, while uncached ones cost 3 USD/MTok”, a 10x difference. Third, masking tool logits rather than removing tools from the schema. Fourth, the file system as external memory.

Three different architectural families, four different sets of trade-offs. The shared invariants are what we wanted to measure ourselves against.

The five lenses

Before we could compare usefully, we needed a structured frame. Architectural comparisons between agent systems routinely collapse into “my framework is better than yours” because there’s no orthogonal axis-set to check against. We developed one.

Loop anatomy. What runs in what order during one decision turn? What state is persisted, what is ephemeral?
Reasoning surface. Where is the model allowed to think? Where is thinking produced by the model but silently dropped by the framework? Does thinking re-feed into the next turn, or is it write-once to display?
Verification surface. After a mutating action, what does the loop do to confirm it worked? Is verification structurally enforced, or is it English-language policy in a system prompt that the model is free to ignore?
Recovery surface. When a tool errors or the system detects oscillation or no-progress, what happens? Is the user told? Is the model forced to re-plan, or does it quietly give up?
Display / UX coupling. How does internal state map to visible UI elements? Where does emergent reasoning become visible, and where is it hidden?

The axes are orthogonal, so a finding on one shouldn’t collapse into another, and the frame is transferable. We’re publishing it here because we think the consumer browser-agent space needs better shared vocabulary than “faster” and “smarter”.

What we found

Auto Browser is a Chrome MV3 side-panel extension with four LLM backends (OpenRouter, Chrome Built-in AI / Gemini Nano, Gemini, local OpenAI-compat) and a deliberate set of declared invariants. Those invariants hold up well under the audit. The architecture around them has specific, fixable gaps.

What’s working

Per-tab isolation is solid. Every tab keeps its own session state keyed strictly by tab, never by URL. That means the agent can execute multi-step workflows across URLs within a tab without losing context. Incognito tabs skip persistence entirely. Closing a tab garbage-collects its record. A tab replace migrates it.

Origin-drift gate is security-conscious. Every mutating action captures the origin at user-approval time and re-validates against the live tab URL at dispatch. A redirect between “the user said yes to shop.example.com” and “the click dispatches” cannot re-target the action to a different origin. Boring, load-bearing safety plumbing.

WebMCP-first tool routing is forward-looking. The agent treats page-authored WebMCP tools as primary and falls back to its built-in toolkit. When pages author their own tools for agents (a direction the ecosystem is clearly heading), Auto Browser is already set up to consume them correctly, with a name-collision protocol and a safety-critical built-in set that pages can’t override.

Emergent reasoning is already there. The audit’s most interesting finding came from a real user transcript. On a shopping task (“help me buy the cheapest basketball”), the agent went through this sequence: search, read results, identify cheapest, add to cart, checkout, confirm order, fail, take snapshot, take screenshot, realise the modal actually said “LAUNCH CONFIRMED” (the confirm step had returned a false negative), re-issue the confirm to dismiss the success modal, deliver the final answer. That’s a textbook recover-analyze-reinterpret sequence, running inside our current flat schema. The model is already doing the right thing. Our framework just isn’t giving it the structure to make that visible.

Where we fall short

Four concrete architectural gaps. We’re sharing the list publicly because we think transparency about architecture beats marketing, and because we’re shipping the fixes.

Reasoning channel coverage is uneven across providers. Gemini’s thought stream surfaces correctly in our UI. OpenRouter’s does not. When users run Anthropic’s thinking models or OpenAI’s o-series through OpenRouter, those models emit reasoning tokens on a separate SSE channel (delta.reasoning and delta.reasoning_details) that our adapter never reads. The reasoning from the most capable models is silently dropped at the transport layer before the UI ever sees it.

Post-action verification is policy, not structure. Our system prompt asks the model to verify submit-shaped actions in English. Nothing in the control flow enforces it. The model can click “Submit”, declare the task complete, and end the turn without ever reading the page back, and the loop will accept it. Surfer 2 and Skyvern 2.0 both make this a structural gate. We don’t.

Recovery failures are silent. Auto Browser detects repeated-identical-action loops and, after three strikes, forces the agent to stop. But the banner logic for that stop-reason collapses into the same branch as ordinary “I’m done, watching for changes.” The user sees the agent go quiet and doesn’t know whether it finished or gave up. Separately, the no-progress detector was originally built for positional state and silently does nothing on generic web pages, so a stuck “Submit” button that keeps failing the same way three times in a row never triggers the detector at all.

Uniform display hides emergent structure. Every step in our chat log renders as a prose-line plus tool-card. An analysis step, a verification step, and a recovery step all look structurally identical in the UI. The basketball trace above (a recovery from a tool-layer false negative) reads to a user as indistinguishable from ordinary forward progress. The agent’s thinking is there. The framework just doesn’t expose its shape.

What we’re shipping

Four phases of work over the next several weeks. We wrote a detailed internal spec; the summary is below. Each phase is independently mergeable, and each leaves the agent in a working state.

Phase 1: visible reasoning and phase labels. Every step in the chat log gets a phase label (Plan, Analyze, Act, Verify, or Recover) with an icon column that rotates dot, spinner, check/x as the step progresses. The model’s thought field is lifted off the short cap we previously applied, and streams live under each step, collapsible by default. OpenRouter’s delta.reasoning and delta.reasoning_details channels get wired into our streaming pipeline, so Anthropic-thinking and o-series users finally see what their model is reasoning about in real time. This is the biggest user-visible improvement per line of code we’ve touched in a year.

Phase 2: enforced post-action verification. Submit-shaped actions (clicks on buttons named Submit/Buy/Confirm/etc., fills that end with Enter, any navigation) trigger a two-tier verifier. The first tier is deterministic: URL change, title change, network idle, or an expected-text match, any of which short-circuits to a “committed” verdict with no LLM call. Only if the deterministic pre-check is ambiguous does a bounded LLM verify call run, scoped to the intent, pre/post page state, and any console errors. Verdicts surface under each tool card. No more silent false successes.

Phase 3: named recovery. Three new exit reasons (oscillation, verify_failed, recovery_bailed) come with explicit user-visible banners that tell you what happened. Before any of these fires terminally, the agent gets one shot to propose a one-to-three-step replan to break the loop, or to surrender with a named reason. The no-progress detector is being generalised to fingerprint state via URL + accessibility-tree hash + WebMCP tool set, so it fires on stuck submits on any web page, not just positional environments.

Phase 4: editable todo plan. For non-trivial tasks, the agent proposes a 1-to-5-item plan before acting. The plan lives in a pinned panel above the chat composer. You can check items off, edit them, or strike them through to steer the agent mid-run, and the edits round-trip into the agent’s next turn so it knows you re-directed. This is borrowed directly from Manus’s recitation pattern and Alumnium’s “planner doesn’t see the page” invariant: the planner role that emits the initial todo does not see the DOM, only the goal and the available tool summary. This protects the high-level reasoning from being distracted by page-level noise.

The total effect, once all four phases land, is that Auto Browser exposes the structure of its reasoning explicitly. You’ll see it plan, see it act, see it verify, and see it recover. These are the same emergent phases it’s already running through today, surfaced as first-class UI elements rather than uniform chat rows.

Why we’re writing this

Consumer browser-agent extensions are where browser agents actually meet users. The gap between research systems that benchmark at 97% on WebVoyager and the consumer extensions shipping today is not primarily a gap in model capability. The underlying models are the same. It’s a gap in how those models get surfaced, structured, and verified by the framework around them.

We think the consumer browser-agent space deserves the same architectural rigor that research systems get. That starts with being willing to audit our own system honestly against the published state of the art, share what we find, and ship the fixes. If this post gives other consumer-agent teams a frame to run the same audit on their own systems, that’s a win.

The methodology is transferable. The audit frame (loop anatomy, reasoning surface, verification surface, recovery surface, display / UX coupling) applies to any agent system, not just browser agents. We’d love to see other teams use it.

Try it

Auto Browser is in active development for the Chrome Web Store. The four phases above are landing across the next several releases, with Phase 1 first. If you run into cases where the agent goes quiet and you want to know why, tell us. Those are exactly the traces we want to build the verification and recovery surfaces against.

If you’re building a consumer browser-agent extension and want to compare notes on any of the five audit axes, get in touch. The field moves faster when the teams doing this kind of work talk to each other.

References: Surfer 2, Andreux et al. 2025; Skyvern 2.0; Alumnium + Claude Code WebVoyager run; Manus context engineering; ReAct, Yao et al. 2022; WebVoyager, He et al. 2024; Browser Use SOTA report.