Engineering

How to tell if a browser agent is any good. A field guide.

A vendor-neutral way to judge any browser agent: why the public benchmark isn't enough, the five lenses that actually predict real-world reliability, and ten probes you can run on any agent in an afternoon.

Last updated June 7, 2026

Auto Browser team · June 2026 · 11-minute read

Every browser agent demo works. That is the problem.

You watch the side panel book a flight, fill a form, buy a basketball, and it lands clean every time. Then you point the same agent at your own messy web, and it clicks into a cookie banner, reports success, and does nothing. The gap between the demo and the Tuesday-afternoon task is the only thing worth measuring, and almost nobody publishes a way to measure it.

We have spent the better part of this year building an eval harness for our own agent, auditing ourselves against the published state of the art, and putting numbers on changes that used to be vibes. Along the way we built up a way of judging browser agents that works from the outside, on a system whose code you cannot read. This post is that method, written down. It is deliberately not about Auto Browser. You can run all of it against any agent you are considering, including ours, and you should.

Why the leaderboard number won’t save you

WebVoyager is the number everyone quotes. The leading systems now report into the high 90s on it. Surfer 2 reports 97.1%, Alumnium reports 98.5%. Those are real results from serious teams, and the benchmark is a genuinely good research artifact. It is also close to useless for deciding whether a given agent will hold up on your work, for three reasons.

First, it is a single scalar. “Finished the task” collapses planning, perception, action, verification, and recovery into one bit. Two agents can post the same score while failing in completely different ways, one of which you can live with and one of which you cannot.

Second, it is public, which means it is a target. The tasks are known. Once a suite is known, systems get tuned toward it, not always on purpose. A number that everyone optimizes against stops measuring the thing it originally measured.

Third, it is a moving web. The sites in any public suite change under it. A score from three months ago was run against a different internet than the one your agent will touch today. Reproducibility quietly erodes even when nobody does anything wrong.

None of this means throw the benchmark out. It means treat it as a floor, not a verdict. A team that publishes a WebVoyager number at least cares about measurement. A team that publishes nothing is asking you to trust the demo. But the real evaluation is the one you run yourself, on tasks you care about, looking at more than the final bit.

The five lenses

When we audited our own system, we needed axes that did not bleed into each other, so a finding on one would not secretly be a finding on another. These five held up. They apply to any agent, browser or otherwise, and you can score each one by watching the agent work, without ever seeing its source.

Loop anatomy. What happens in one decision turn, and in what order? Does the agent plan before it acts, or does it act and narrate after? What does it carry between turns, and what does it throw away? You can read most of this off the transcript. An agent that re-derives the whole plan every turn behaves very differently under a long task than one that keeps a running plan and edits it.

Reasoning surface. Where is the model allowed to think, and do you get to see it? This matters more than it sounds. Plenty of agents run capable thinking models and then drop the reasoning on the floor before it reaches the screen, because the framework around the model never reads the channel the reasoning came in on. If you are paying for a thinking model, the test is simple: can you see what it thought? If not, you are paying for tokens you never receive.

Verification surface. After a mutating action, what does the agent do to confirm it worked? This is the single highest-leverage axis and the one most agents fail. There is a world of difference between an agent that clicks “Submit” and declares victory, and one that reads the page back and checks. Verification is either built into the control flow or it is a polite request in a system prompt that the model is free to ignore. From the outside you can tell which, because the prompt-only kind reports false successes and the structural kind does not.

Recovery surface. When something breaks, a tool errors, the agent loops on the same action, no progress is being made, what happens next? Does it tell you, or does it just go quiet? An agent that fails loudly with a named reason is far more useful than one that silently stops and leaves you guessing whether it finished or gave up. Going quiet is the worst failure mode in the category, because it costs you the most time to even notice.

Display and UX coupling. How does what the agent is doing map to what you see? If a planning step, a verification step, and a recovery step all render as the same generic row, the interface is hiding the structure of the work from you. You cannot supervise what you cannot distinguish. The agent might be doing exactly the right thing and you would never know, which means you cannot trust it when it counts.

These five are orthogonal on purpose. Score each one separately, and you get a profile instead of a number. The profile is what predicts how the agent behaves on the task the demo never showed you.

Ten probes you can run this afternoon

Lenses are how you think. Probes are how you test. Here are ten concrete tasks you can run against any agent in an hour or two, each chosen because it targets a specific failure that the happy-path demo will never surface. None of them need access to the agent’s code. You are a user with a hard task.

1. The iframe probe. Find a flow with a form embedded in a same-origin iframe (in-app help widgets, embedded booking or dashboard panels, many survey and newsletter embeds) and ask the agent to fill the embedded fields. A large fraction of consumer flows live inside iframes. An agent whose perception stops at the iframe boundary can see the frame but nothing inside it. It will click toward the form and then stall, or hallucinate fields it cannot actually read. Cross-origin frames are a separate case worth checking too: a Stripe Elements card field, say, cannot be read into by anyone, because the browser blocks the cross-document read for good security reasons. A good agent should tell you it cannot see inside rather than pretend it can.

2. The shadow-DOM probe. Find a site built on a modern web-component library (the <sl-*> and <md-*> element families are everywhere now) and ask the agent to operate a button or input that lives inside a custom element. If the agent’s perception does not descend open shadow roots, it sees the host element and none of the controls inside it.

3. The cookie-banner probe. Load a page with a consent modal covering the content, and immediately ask the agent to click something behind the modal. This is the occlusion test. A weak agent sees the target in the DOM, passes a naive visibility check, dispatches the click into the overlay that is actually on top, and reports success while the page does nothing. Watch for the false success specifically. The click “working” and the page not changing is the tell.

4. The below-the-fold probe. Ask the agent to click something far down a long page without scrolling to it first. Some agents read an element’s coordinates while it is off-screen and dispatch a click into empty space. If the agent does not scroll the target into view before acting, this fails silently and looks like the agent just decided not to do it.

5. The false-negative probe. Run a checkout or submission where the confirmation appears in a modal with unexpected wording. We have a real transcript where a confirm step returned a false negative, the success modal actually said “LAUNCH CONFIRMED,” and the agent had to take a snapshot, re-read, realize the action had in fact succeeded, and recover. A good agent re-checks instead of trusting its own first read. A weak one either gives up or fires the action a second time and creates a duplicate.

6. The silent-give-up probe. Hand the agent a task that cannot be completed, a button that is permanently disabled, a form that always rejects, and watch the ending. Does it tell you it failed and why, or does it just stop and start “watching for changes” in a way indistinguishable from finishing? Name-the-reason behavior here is one of the strongest signals of a mature recovery surface.

7. The oscillation probe. Give it a task where the obvious action keeps failing the same way, a “Submit” that errors identically every time. A good agent notices it is repeating itself, stops after a small number of strikes, and says so. A weak one loops until something times out, burning your tokens on a wall.

8. The model-swap probe. This one only applies to agents that let you choose the model, and the fact that many do not is itself a finding. Run the same task on a frontier model and on a cheaper one. The point is not which wins. The point is whether the agent’s architecture is actually model-agnostic, or whether it quietly assumes one provider’s quirks. Watch for behavior that falls apart the moment you switch, especially around tool calls and structured output. An agent that only works well on the model its team happened to build against is an agent betting your reliability on one vendor’s pricing and capability decisions.

9. The long-task probe. A five-step task that takes a couple of minutes is a different animal from a one-shot click, especially in a Chrome extension where the service worker is aggressively idle-timed. Run something that walks a multi-step checkout end to end. Agents that do not actively keep their worker alive can lose the whole turn partway through and return a tool result to a worker that no longer exists. The symptom is a task that dies in the middle for no visible reason.

10. The reasoning-visibility probe. Turn on a thinking-capable model and look for the thinking. If the agent supports reasoning models but you cannot see any reasoning, the framework is probably dropping the reasoning channel before it reaches the UI. You are paying frontier prices for a capability you never get to use or inspect. We can name this one from experience, because we shipped it: one of our own providers emitted reasoning on a separate channel our adapter never read, and the thinking was silently dropped at the transport layer until we wired it through. It is an easy bug to ship and an invisible one to the user, which is exactly why it is worth probing for.

Run these ten and you will know more about an agent than any leaderboard will tell you. Most of them take a single task each. Together they cover all five lenses.

How to run a real eval, not a demo

If you are evaluating seriously, past the afternoon probes, a few principles separate a number you can trust from a number you cannot. We learned most of these the hard way.

Hold everything constant except the one thing. The only honest comparison is same tasks, same model, same mode, before and after. We once measured an action-reliability change at +14 percentage points of accuracy and a 54% latency cut, and the only reason that number meant anything is that it was the same model running the same tasks in the same mode on both sides. Change two things at once and you have learned nothing about either.

Small and deterministic beats large and realistic. For a regression net, you do not want WebVoyager. You want a handful of tasks with known-good tool-call expectations that run in seconds against a local server and give the same answer every time. Public end-to-end benchmarks are great for research and terrible for catching a regression on Tuesday, because the web underneath them moves. A seven-task deterministic suite that you actually run on every change will protect you more than a sprawling suite you run once a quarter.

Measure latency on successes only. Failed turns are noisy and unrepresentative, full of retries and timeouts that tell you nothing about how fast the agent is when it works. Average those into your latency number and the number swings around for reasons unrelated to anything you changed.

Publish your seed. A result nobody can reproduce is a marketing claim, not a measurement. If you cannot hand someone the exact configuration and have them get your number back, you do not actually have the number. This applies to vendors making claims and to you checking them.

Write the number into every change. Once you have a harness, the discipline that pays off is making “what did this do to the eval” a required field. When a change is expected to move the number, attach the before and after. When it is not, say so. “No number” stops being the default, and the quality of every decision after that goes up.

A rubric you can actually use

If you want a single sheet to score an agent against, here it is. Rate each lens 0 to 2: 0 if absent, 1 if present but weak, 2 if structural and visible.

Loop anatomy        plans before acting, keeps state across turns      [0 1 2]
Reasoning surface   model's thinking is produced AND shown to you      [0 1 2]
Verification        mutating actions are checked by the control flow   [0 1 2]
Recovery            failures stop with a named, visible reason         [0 1 2]
Display coupling    plan / act / verify / recover are distinguishable  [0 1 2]

A 10 is a research-grade system someone shipped to users. Most consumer agents today land between 3 and 6, and the missing points cluster in verification and recovery, the two lenses that matter most when the task is real and you are not watching. That clustering is not a coincidence. Verification and recovery are the unglamorous parts. They do not show up in the demo, so they are the first things to get cut.

The honest limits of all this

A profile is not a guarantee. An agent can score well on every lens and still trip on your specific site, because the web is adversarially weird and no eval covers all of it. The probes catch categories of failure, not every instance. And a deterministic suite, by being deterministic, deliberately gives up on the realism that a live public benchmark has. These are the right trade-offs for catching regressions and comparing systems, but they are trade-offs, and pretending otherwise is exactly the kind of overclaiming this whole post is arguing against.

The point is not to find a perfect agent. There isn’t one. The point is to stop being sold by the demo, and to know precisely how the agent you choose is going to let you down, so you can decide whether you can live with it.

Why we wrote this

The gap between a research system that benchmarks at 97% and a consumer extension shipping to a side panel today is not mostly a gap in model capability. The models are the same. It is a gap in how those models get surfaced, structured, and verified by the framework around them. That gap is invisible on a leaderboard and obvious the first time an agent clicks into a cookie banner and tells you it succeeded.

We think the consumer browser-agent space deserves the same rigor research systems get, and that starts with users having a way to judge for themselves instead of trusting the demo or the leaderboard. The five lenses, the ten probes, and the rubric are all transferable. Run them on us. Run them on whoever you are comparing us against. If you turn up a failure mode we should be probing for and aren’t, tell us, because that is exactly the kind of thing we want to add to the list.

Related reading: our audit against Surfer 2, Skyvern 2.0, Alumnium, and Manus, where the five lenses first appeared, and the perception rewrite and the eval, where the harness and the numbers in this post come from.