Engineering

The perception rewrite, and the eval that finally tells us if we're heading the right way.

What shipped in v1.1: iframe-aware perception, a four-provider abstraction with a behavioural contract test, an action-reliability bundle worth +14 pp accuracy and −54% latency on our internal eval, and the harness that proves it.

Last updated April 24, 2026

Auto Browser team · April 2026 · 12-minute read

Two posts ago we audited ourselves against Surfer 2, Skyvern 2.0, Alumnium, and Manus and committed to four phases of work. One post ago we walked through what actually shipped, what got caught in review, and why one of the four phases looked nothing like the version we sketched.

Then we kept going.

This is the v1.1 release post. The headline is not any single feature. The perception layer was rewritten. It descends into iframes and shadow roots, filters occluded refs, draws numbered labels onto screenshots, and paginates. Underneath sits a four-provider abstraction with a behavioural-Liskov contract test. Underneath that sits an eval harness that finally lets us put numbers on whether a change made the agent better or worse.

Perception first. Then the provider abstraction. Then the action-reliability bundle. Then the harness.


The perception rewrite

The audit’s take_snapshot was a depth-first walk of the top document, accepting every interactive ref it saw and emitting a flat list of uids. It worked on most of the static web. It missed at least four entire categories of page that the modern web is built out of, and we had no number to tell us how badly.

Iframes were invisible. Same-origin iframes embed roughly half of the consumer-facing flows the agent encounters in the wild. Stripe checkout, captcha widgets, embedded video controls, in-app help docs, every Google product surfaced inside another. The walker would arrive at an <iframe> element, emit it as a single ref, and stop. The agent could click into the iframe but could not see anything inside it.

Open shadow roots were invisible. Web Components are now the default for several major design systems (Lit-based component libraries, every <sl-*> and <md-*> element). The walker did not descend shadowRoot, so a button inside a <my-button> custom element was a ref the agent could see only as the host, with no inner state, no inner text, and no nested controls.

Hidden refs leaked through. Our isVisible test consulted offsetParent and inline style.display, which catches inline-style hidden but misses every stylesheet rule. A display:none set in CSS produced refs the agent treated as actionable, then quietly failed to dispatch into them.

Occluded refs leaked through. A button covered by a cookie-consent modal is technically in the DOM and technically passes a visibility test. Click-dispatching it sends the click into the modal overlay instead of into the button. The modal absorbs the click. The agent reports success. The page does nothing. We had reproducible traces of this for months.

The rewrite adds these as opt-in or default-on options on the same take_snapshot tool. Adding new tools forces the agent to learn a new vocabulary and re-enter a new state machine; changing the existing tool’s output lets every existing prompt benefit immediately and lets us compare same-prompt eval scores across before/after.

Iframe traversal

The walker now descends through iframe.contentDocument.body for any same-origin frame, with a single shared uid counter so refs inside frames are indistinguishable from top-doc refs. From the agent’s perspective, the page is one tree.

Cross-origin frames cannot be entered (browser security correctly blocks the cross-document read), but they cannot be silently dropped either, because the visible visual content of the page would not match the snapshot the agent reads. Instead the walker emits a single sanitised marker:

[cross-origin iframe: stripe.com/checkout]

The sanitisation is the unglamorous part. Newlines in the src attribute get neutralised so they cannot inject pseudo-tool-output into the snapshot text the LLM reads. OAuth code and state tokens in query strings are stripped before display. blob:, data:, and javascript: URIs are collapsed to the scheme name only. No payload, no ;base64,, no inline script. None of those mitigations are theoretical; each came from a worked example we wanted not to live with.

Shadow-root descent

Open shadow roots are walked through on the same pass. The same uid counter, the same coordinate system. Closed shadow roots are not entered; that is the platform’s contract and we honour it. In practice, every modern component library we have encountered uses open mode.

Computed-style visibility

isVisible now consults getComputedStyle. display: none is checked along the ancestor chain, since hiding a parent hides every descendant. visibility: hidden and the underused visibility: collapse are checked element-locally, since visibility does not inherit and a child can override a hidden parent. Stylesheet-based hiding stops emitting phantom uids.

Occlusion filter

The filterOccluded flag is on by default. For each candidate ref, the walker resolves the centre point of its bounding box and asks the document elementFromPoint what is actually there. If the answer is a different element entirely (a modal, a cookie banner, a sticky header), the ref is dropped.

Two subtleties came out of the test pass. First, the filter has to be descendant-accepting: a <button><span>Buy</span></button> should not false-positive when elementFromPoint returns the inner <span>, because the inner span is the button as far as click dispatch is concerned. Second, it is recall-biased: when in doubt, the ref passes through. Aggressive occlusion-filtering caused regressions on legitimate refs that happened to sit under a transparent overlay, and we would rather see one extra ref than miss the only correct one.

Snapshot pagination

Long pages no longer one-shot the model. take_snapshot({ offset: N }) skips the first N emittable refs and returns next_offset so the agent can iterate. The pathological case (a max_chars so small that even a single ref doesn’t fit) used to infinite-loop on offset increments that produced empty pages. It now returns page_overflow: true with an actionable error and no further pagination calls.

Numbered-overlay screenshots

take_snapshot({ include_screenshot: true }) pairs the text snapshot with a JPEG that has the snapshot uids drawn directly onto the pixels. Numbered labels, anchored to each ref’s bounding box, clamped to the image bounds for elements at the edge of the viewport. The pipeline runs entirely in the service worker via OffscreenCanvas, so the agent’s tab is never blocked by the rendering. If the labelling pass fails for any reason, the tool degrades to the raw screenshot rather than failing the snapshot.

This is what unlocks the multimodal step in the loop. The model can now write click(ref="r17") and the screenshot it sees has a 17 drawn next to the matching button. That is a much shorter mental hop than parsing the textual snapshot in one pass and the screenshot in another and reconciling them by description.

The default for the screenshot is off. The WebMCP-first pitch is that token-efficient text snapshots are the cheap path; screenshots are an opt-in for the cases that need them. Defaulting to on would drag every agent turn into image-token territory and undermine the whole point.

One coordinate space

We had three different coordinate-space bugs in three different surfaces (the screenshot overlay, the iframe walker, the shadow-root descent) over the course of three weeks. Each was independently fixed, and each fix subtly broke a different surface that depended on the previous coordinate convention. After the third one we paused and consolidated.

A single coordinates module is now the authority for every viewport-coord transform: bboxCenter, iframeChainOffset, hitTestRootFor. Every consumer of “where is this ref on the screen” calls into the same place. The cost is a slightly less ergonomic call site; the benefit is that any future fix lives in exactly one module.

Anywhere a calculation is duplicated across multiple surfaces and the same bug recurs, the fix is not better testing. The fix is a single authoritative module that every surface routes through. Tests are the safety net; the architecture is the actual fix.


Multi-provider, one contract

We went from two LLM backends (OpenRouter + Built-in AI) to four (+ Gemini direct, + Local OpenAI-compat). Doubling the provider set tends to produce divergent code paths and bugs that exist only on one backend. The work that prevented that is below.

A behavioural Liskov contract test

Every provider implements the same LLMProvider interface. The contract test runs the same suite of behavioural assertions against every provider: same input shapes, same expected output shapes, same error semantics. A new provider PR cannot land green unless the contract suite passes against it. This is what made it safe to add Gemini and Local in the same release without a regression on OpenRouter or Built-in AI.

Where ordinary per-provider tests miss interface drift, the contract test catches it. A change to one provider that subtly shifts the meaning of a return value (say, a null content for a tool-only response) gets compared against every other provider’s behaviour for the same case. Drift surfaces immediately rather than weeks later when the framework code starts to assume different things on different backends.

Native tool-use, where the provider supports it

OpenRouter and Gemini now route the agent’s decisions through the provider’s native tool-call channel rather than parsing JSON-in-text. The shared tool catalog flows through one schema; tool names are sanitised to be WebMCP-safe; thinking-mode passthrough is preserved. Local OpenAI-compat falls back to JSON-in-text because most of the local servers we test against do not implement tool-use compatibly yet. Built-in AI does its own thing (next section).

The win is robustness, not capability. JSON-in-text fails when models hallucinate trailing commas, emit unbalanced brackets, or wrap their answer in prose. Native tool-call passes a structured object through a typed channel and the framework never has to guess.

Built-in AI chat-only mode

This was a forced architectural choice. The Auto Browser tool catalogue is 31 tools at full surface. Serialised into a Gemini Nano context window (roughly 6K tokens), the manifest alone consumes most of the available budget before the user has typed a word. The ReAct loop never had room to run.

The fix is structural. Built-in AI now routes around the ReAct loop entirely for chat-only sends. The user types, the model answers, optionally with image or audio attached. There are no tools in the prompt, no decision schema, no validation gate. Images and audio Blobs go straight into session.prompt() via the native multimodal path with zero base64 conversion (Built-in AI accepts Blobs natively; the moment you base64-encode them you triple the token cost on a context that cannot afford it).

The on-device provider cannot drive the agent. It can answer questions, summarise pages the user has open, transcribe audio. ReAct on Nano is not on the roadmap because Nano is not the model for it. The day a larger on-device model ships, the ReAct path will adopt it without the chat-only carve-out. Until then, the carve-out stays.

Parse-failure hardening

For every provider that can still emit malformed structured output (the JSON-in-text fallback, mostly), we hardened the parse pipeline:

  • Grammar-aware balanced-bracket walker. The previous parser was string-based and would happily close a brace inside a string literal. The new walker tracks string-vs-code state and does not.
  • String-aware repair pipeline. Trailing commas, missing closing quotes, and a small handful of well-defined recovery cases are repaired before reparse, with a budget that fails-out rather than looping.
  • Decision-shape gate. The post-parse object is checked against the decision schema. Multi-root output (the model emitted two top-level decisions in one response) is detected and rejected explicitly.
  • Branched retry reminder. When the parse fails, the retry prompt tells the model why (blocked, truncated, or malformed) instead of a generic “try again.” The model conditions on the failure mode and responds accordingly.
  • max_tokens raised from 800 to 2048 wherever the provider allows it. The single most common cause of parse failure was the model getting cut off mid-JSON.

Individually each change is small. Together they cut the unrecoverable parse-failure rate to a number we no longer chart, because it does not move.


The action-reliability bundle, and the first numbers

This is where the eval harness earned its keep.

The audit identified that mutating actions had no settle window, no scroll-into-view, and no graceful path for stale refs (an old uid pointing at an element that no longer exists after a re-render). All three are dispatch-path problems. We fixed them as a single PR because each isolated fix has measurable but small effects, and the compound effect is what we wanted to measure.

Scroll-to-visible before bbox read. Before the dispatcher reads an element’s bounding box for CDP coordinates, it calls el.scrollIntoView({ block: "center", inline: "center", behavior: "instant" }). Below-the-fold targets that used to dispatch at viewport y=5000 (off-screen, no-op) now reliably dispatch at the centre of the viewport.

Settle after success. The dispatcher awaits an injected settle() (default 50 ms) before returning success. Bridges the click→snapshot race for the case where the click opens a modal and the snapshot taken on the next turn would otherwise see the pre-modal page.

Stale-ref recovery. When the dispatcher receives a stale-uid error, it attaches a fresh take_snapshot result inline to the error response, saving the agent a full ReAct round-trip. We deliberately do not auto-redispatch on a name-match (a ref called “Add to cart” might be a different “Add to cart” after a re-render and we will not touch a mutating action without explicit re-confirmation).

The numbers, on a 7-task subset of our internal eval, running llama-3.1-8b in JSON-in-text mode (the hardest configuration we test against, deliberately):

  • Before: 3/7 passing, mean latency 3042 ms.
  • After: 4/7 passing, mean latency 1395 ms.
  • Delta: +14 percentage points accuracy, −54% latency. In the same model, in the same mode, on the same tasks.

The accuracy gain came from the scroll-to-visible fix (one task that had been dispatching off-screen now dispatches correctly). The latency gain came overwhelmingly from stale-ref recovery (no more wasted ReAct round-trip on every transient stale-ref) plus the settle window cutting double-snapshot retries.

A 7-task subset is small. But it is the first run on this codebase where we have a number attached to an architectural change, instead of a vibe and a hope. The next iteration runs the same suite on every provider.


The harness, the panel, and the keepalive

Three pieces of operational infrastructure made the rest of this release possible.

The eval harness

The harness is a flat list of named tasks, each with a tool-call expectation. Every run writes a timestamped JSONL report with per-task latency and verdict. The runner picks a provider and a model from CLI args, executes every task, and emits the report. A compare mode diffs two reports.

This is deliberately not WebVoyager. WebVoyager is a great public benchmark but it is a great research benchmark, not a great regression-safety net. We needed a small, fast, deterministic suite that runs in seconds against the local dev server, that we can re-run on every PR, that points at the internal tool-call behaviour rather than at end-to-end success on a moving public web. The 7-task subset is the smoke test; a larger suite is the next thing on the harness’s roadmap.

The runtime diagnostics panel

The diagnostics panel is the first piece of the side panel that exists for us and not for the user. Mode-scoped collectors (one bag per provider, per phase) gather counts and latencies. Success-only latency is the headline number, because failed turns are noisy and unrepresentative. The panel is reachable from the sidebar and is gated behind a setting; we ship it on by default in dev builds and off by default in releases.

The panel is what told us that stale-ref recovery saves a ReAct round-trip. We watched the round-trip count drop in real time on a chosen test page. The eval harness validated it against a fixed task suite afterward. Both are necessary; either alone is not.

Offscreen keepalive

Service workers in MV3 are aggressively idle-timed. Long agent turns (the kind that walk a five-step checkout) would routinely lose the SW mid-turn and have a tool dispatch returned to a worker that no longer existed. The fix is a refcounted offscreen document that holds the SW alive for the duration of any turn that takes a refcount. Tombstones survive chrome.runtime.onReplaced (extension updates), so a hot-reload during a turn does not orphan the keepalive.


What this changes about how we work

Three things, in increasing order of how much they will affect the next release.

The eval is now part of every PR description. Where a change is expected to move the eval, the PR body includes a before/after run on the same task subset and the same model. Where it is not, the body says so explicitly. “No number” is no longer the default.

The contract test gates the provider surface. Adding a fifth provider is now a small, scoped piece of work, because the behavioural assertions that have to pass are written once and applied uniformly.

The perception layer is one thing now. The take_snapshot tool is the agent’s eyes. Every additional capability (the next pass might be SVG-aware text extraction, or table-structure inference) lands as an option on the same tool, not as a new tool. The agent’s vocabulary stays small and the perception surface gets richer.


Try it

Auto Browser 1.1 is available on the download page while Chrome Web Store review continues. The numbers in this post (accuracy and latency) were generated on this codebase, on this branch, with a public seed. If you run the same configuration on the same suite and get a different number, tell us. Reproducibility is the whole point of the harness.

If you are building a consumer browser-agent extension and want to compare notes on perception, the four-provider abstraction, or the eval methodology, our mail is open. The audit frame from the first post still applies to your system, and we still want to read what you find.


Previous posts in this series: the audit · the four phases shipped.