Engineering

The four phases shipped. Here's what actually landed.

Following up on the audit post: what shipped, where we got caught in review, and why Phase 4 looks nothing like the version we described.

Last updated April 23, 2026

Auto Browser team · April 2026 · 10-minute read

Earlier this week we published an audit of Auto Browser against Surfer 2, Skyvern 2.0, Alumnium, and Manus, and committed to four phases of work in response. The phases are now in the codebase. The commit history has some surprises we didn’t expect when we wrote the plan, so this post is a follow-up: what actually landed, where we got caught in review, and why Phase 4 looks nothing like the version we described.

Phase 1: visible reasoning, unified across four providers

Phase 1 was supposed to do three things. Label every step with a phase. Stream the model’s long-form thought live under each step. Wire OpenRouter’s reasoning channels into our pipeline so Anthropic-thinking and o-series users finally see what they paid for. All three landed. Two of them were straightforward. The third was harder than it sounds, and it exposed a class of bug that had been sitting in our code for months.

Here’s what we found while wiring OpenRouter. Our settings UI only persisted a small bag of fields for each provider (API key, model, a few per-provider extras). The normaliser that runs on every config read stripped anything outside that bag. Net effect: even if you opened DevTools and set the OpenRouter thinking flag by hand, the next save would erase it. OpenRouter was permanently wedged in reasoning-off mode, which defeated the entire headline change of Phase 1 before any user saw it.

We caught this in review. The fix was three changes at once. Extend the normaliser to preserve the thinking keys, thread them through the save payload, and add a checkbox plus an effort dropdown to the sidebar settings so users don’t need DevTools in the first place. After the fix, an Anthropic-thinking model running through OpenRouter streams its reasoning in real time under each step, the same way Gemini always did. The same fix covers local OpenAI-compat reasoning models (DeepSeek R1, QwQ, Ollama, LM Studio). Four providers, one behaviour.

A smaller regression came in on the same PR. In schema-thought mode, where a provider without a native reasoning stream is asked to put its thoughts into a JSON field, we were feeding the model’s chain of thought back to itself twice per turn. Once as a prefix on the assistant message, which is intentional and gives it episodic continuity. Once inside the JSON body, which was a bug. The model was reading its own previous reasoning twice and paying for the tokens twice. Shallow copy the decision object, delete the thought field before serialisation, ship. Small change, meaningful cost drop on Built-in AI where schema mode is the default.

Visible outcome: turn on reasoning in your provider settings, watch the model think in real time under each step. Invisible outcome: context grows roughly half as fast per turn on Built-in AI, and the long-form reasoning from the most capable OpenRouter models is no longer disappearing at the transport layer.

Phase 2: the validator, and the “Confirm password” problem

Phase 2 was the one we were most nervous about. Post-action verification as a structural gate is expensive if you do it wrong. A validator that runs an extra LLM call after every click would roughly double the per-turn cost of the agent and burn through the prompt cache. A validator that only runs sometimes has to know when to run and when to skip, which means a classifier, which means a new class of false positives.

The shape we settled on is two tiers. A cheap pre-pass does deterministic checks first. URL change, title change, expected text appearing, validation-shaped console errors. Any of those short-circuits to a verdict with no LLM call. Only if the pre-pass is inconclusive does the bounded LLM verify call run, scoped to the intent and a tight pre/post state snapshot. In practice the vast majority of submits are settled by the cheap pre-pass. The LLM verdict is the exception, not the rule.

The classifier that decides “is this action submit-shaped” is a pure module. It watches for always-shaped actions (any navigation, Enter chords, fills with trailing Enter, accepted dialogs), plus clicks on buttons whose accessible name starts with a commit verb (submit, buy, confirm, pay, place, checkout, save, create, apply, and so on). First draft shipped in the PR. Review caught something obvious in hindsight.

The first version also ran the commit-verb check on plain fill actions, on the theory that a fill on a button-labelled input might be a commit. That logic was wrong for a category reason, not a coverage reason. On a click, the accessible name labels the action (it’s the button you’re pressing). On a fill, the accessible name labels what you’re typing into, which is the input’s purpose, not the action’s commit-shape. Under the original classifier, every one of these triggered the validator after every keystroke-batch fill:

“Confirm password” (second password field on every signup form)
“Apply code” and “Apply discount” (checkout coupon fields)
“Confirm email”
“Order number”, “Send to”, “Save as draft”

Two consecutive ambiguous verdicts on the same fill could escalate the loop into recovery. False-positive cascade on essentially every form in existence. The fix was four lines. On fill, only flag as submit-shaped when press_enter: true. The trailing Enter is the actual commit. If a page has a rare fill-that-commits-on-blur control, the model can self-flag with verification_required: true in the schema. The page semantics live with the model, not with the field label.

Visible outcome: every submit-shaped action now surfaces a verdict under its tool card. Committed cases (the common one) are a quiet green note. Failed verdicts flip the agent into a recover phase and queue a system note with the evidence that made the verdict. No more silent “task complete” after a click that did nothing.

Phase 3: the environment fingerprint, and why “no progress” was silently a no-op

The audit post called this one out bluntly. We had a no-progress detector that was supposed to catch the stuck-submit case, and it silently did nothing on generic web pages because it was keyed on positional state, the kind of x/y coordinate that WebMCP games put in their tool results. A click() that returns {ok: true} has no positional anchor. The detector degraded to a no-op on every real web page, which is every web page a real user cares about.

Phase 3 replaces the positional-only state key with a three-part environment fingerprint that falls back when positional state is missing. The three ingredients are:

The URL pathname. Protocol plus host plus path. Query and fragment are stripped because session ids and tracking parameters change across otherwise-identical states.
The accessibility-tree snapshot, normalised. Sort the lines, drop the ref-ids (they change between snapshots even when the page is identical), drop transient value="..." markers. What’s left is role plus name plus state, which identifies the page.
The sorted list of WebMCP tool names, because a tool-set flip is a state change even when the URL and DOM don’t move.

Positional state stays primary. WebMCP games that put a roomId or coordinate in their tool result still hit the positional path first, which keeps the pre-Phase-3 behaviour on everything that already worked. The environment fingerprint is the fallback for everything else, which in practice is roughly all of the real web. The working-memory ledger now catches “the same click in the same environment with no visible change” and raises an advisory after two strikes, bails after three.

The other half of Phase 3 is the one-shot recovery call. Every previously-silent bail path (oscillation, repeated validator failure, duplicate-call spam) now runs through a short LLM call before terminal exit. The recovery prompt is tight. Return replan_steps with one to three concrete steps to break the loop, or return an abort_reason and give up on record. The bail becomes named and visible either way.

If recovery returns a plan, the agent continues with a reset counter and the next phase forced to act. The user sees a recovery banner, and the plan gets logged into history so the model can see its own proposal. If recovery gives up, the user sees a specific banner with a named reason. oscillation for stuck-action loops. verification_failed for confirmed failure-to-act. recovery_bailed for “tried a plan and still stuck.” No more quiet disengagement.

One review finding worth calling out. On the validator-recovery path specifically, the first version of the patch skipped a WebMCP tool refresh that every other recovery path went through. If the mutating action changed the page’s WebMCP surface (for example a checkout that registers step-2 tools after step-1 submits), the recovery call would plan against yesterday’s tool list. The fix was a five-line addition. The class of the bug is worth the writeup though: anywhere a new loop branch bypasses a step that every other branch runs, you probably broke something, even if the tests pass.

Phase 4: the part we changed our minds about

This is the honest part of the post. Phase 4 as we described it in the audit was a Surfer-2-style Orchestrator running an Alumnium-style “planner doesn’t see the page” invariant. A separate planner role would emit the initial todo, never touch the DOM, and hand off to the Navigator. We were going to surface the plan as an editable todo panel above the chat composer.

We didn’t ship that.

What changed was a realisation during implementation, reinforced by a competitor-parity screenshot review. Alumnium’s separation earns its keep because the outer planner (Claude Code with Sonnet) and the inner executor (GPT-5 Nano) are different models at different price points. The plan-time expensive model reasons without burning itself on DOM noise; the act-time cheap model grinds through page mechanics. Our planner and executor would be the same provider, running the same model. That same-provider version of the separation costs one extra LLM call per turn and produces a purity win the user can’t see.

The second thing we noticed is that the phase pill we introduced in Phase 1 (that “VERIFY” / “ACT” / “ANALYZE” label on every step) was solving a problem we didn’t actually have. The information was already in the narration the model writes for each step. A step that says “Checking the cart total against the product page” is already visibly an analysis beat. The label “ANALYZE” above it added no information and doubled the vertical space.

So Phase 4 became a different piece of work. The phase taxonomy stays internal to the loop. It still drives recovery routing, working-memory observations, the submit-shape check. But the chat surface is now one step-row per step: a status icon on the left (pending, running spinner, done, error), the narration as the row’s title, a collapsed tool card nested inside the row, the validator verdict nested below it. One row per step. The status icon carries the state. The narration carries the story.

The writeup of that PR includes the line “Why this and not the Orchestrator (Phase 4),” which is the kind of commit message that would have been embarrassing to push without a good answer. The answer we ended up with is that Phase 4’s cost/benefit didn’t pencil out against our architecture, and the user-visible problem the phase pill was supposed to solve was a problem we’d invented. We deleted the pill and nested the step content into a single compact row.

One other thing that fell out of the redesign. The typing-dots bubble, the thoughts disclosure, and the step-row spinner had all grown independently and were stacking on top of each other as three separate indicators for the same “model is thinking” signal. We dropped the typing bubble, nested the thoughts disclosure inside the step row’s body, and opened the row eagerly on stream start with a provisional narration. One indicator per step.

To be clear about what this means for the audit framework: the axis Phase 4 was supposed to close (display and UX coupling) is closed. Not via planner/executor separation, but via a better chat surface. Different solution to the same gap. We’d rather ship the thing that helps users than the thing that matches a paper we admired.

What the numbers say

The tests tell a consistent story. 740 passing before Phase 1. 792 after. 870 after Phase 2. 902 after Phase 3. 902 after the Phase-4-that-wasn’t, because sidebar changes don’t touch the logic layer. Every phase adds testable surface. Every bug caught in review was caught because there was a test that exercised the new code path, and another test that proved the old behaviour hadn’t regressed.

The other signal from this cycle is what review actually caught. Across four PRs, review found three bugs that tests had not covered:

Config keys that never reached the wire (the OpenRouter thinking-off case).
A classifier boundary that was right on one tool and wrong on a related tool (the submit-shape fill/click distinction).
A branch that skipped a refresh that every other branch did (the validator-recovery tool-refresh miss).

All three are the shape of bug that lives between modules rather than inside one. They’re exactly the kind of thing that code review is better at than unit tests. We’re not rushing to let tests replace review any time soon.

What this changes about the audit

The audit post introduced a five-lens frame: loop anatomy, reasoning surface, verification surface, recovery surface, display and UX coupling. After shipping, every lens has something concrete attached to it.

Loop anatomy now includes the validator and recovery branches, not just Navigator steps.
Reasoning surface is unified across four providers and re-fed into history for episodic continuity.
Verification surface is a structural gate, not an English prompt contract.
Recovery surface is named, visible, and gets one LLM-assisted replan attempt before terminal exit.
Display and UX coupling is tight enough that the chat surface shows pending/running/done/error per step without a separate phase taxonomy layer.

The audit frame held up. The specific Phase 4 implementation didn’t. That’s a reasonable ratio, and it’s the reason we publish posts like this one.

What’s next

Two things we’re watching.

The first is Phase 1.5 surface work. The Built-in AI “opt out of thinking visibility” path and the local-provider reasoning toggle both work today via DevTools seeding, but neither has a proper sidebar UI. We deferred that to a follow-up on purpose, because the right home for it is the existing per-model editor on local providers, and that surface deserves its own piece rather than being bolted onto a reasoning PR.

The second is more audits on our own work. The review catches on this cycle were all in the class of inter-module bugs that slip through unit tests. That’s the same lens we applied to Surfer 2, Skyvern 2.0, and the others in the audit post. If you run a browser agent and you want a second pair of eyes on the same axes, our mail is open.

If you’re an Auto Browser user and you’ve hit the silent-stuck case the audit post described, you should be seeing specific banners now instead of mystery silence. If you’re not, tell us. Those are the traces we built the verification and recovery surfaces for.

Previous post: We compared Auto Browser to the best browser agents of 2026.