I agree that there is communication overhead between the agents, although so far it looks like they can communicate very effectively and efficiently. We’re also working on efficient ways to transfer more contextual information
I think our main differentiator is that we have been building browser agents for over a year and our technology is today 5x faster and 7x cheaper than browser-use, while also being significantly more reliable
we benchmarked on webvoyager and have a few more benchmarks coming up (see https://docs.smooth.sh/performance), we’ll be publishing the full results shortly
we run headful browsers with a fingerprinting that is as stealth as possible and on top of that we can use your ip
anecdotally, making all requests originate from your own residential address has been a major success compared to other cloud-only solutions
it will be interesting to see how this will play out, having to “hide the agent” feels like a temporary work around until society accepts that agents actually do exist
it's interesting to see how things will play out, but I really believe that doing Claude Code (maybe with Opus 4.6) + click tool + move_mouse tool + snapshot page tool + another 114 more tools is definitely not the best approach
the main issue with this interface is that the commands are too low-level and that there is no way of controlling the context over time
once a snapshot is added to the context those tokens will take up very precious context window space, leading to context rot, higher cost, and higher latency
that's why agents need to use very large models for these kind of systems to work and, unfortunately, even then they're very slow, expensive, and less reliable than using a purpose-made system
I wonder if a standardized interface will organically emerge over time. At the moment SKILL.md + CLI seem to be the most broadly adopted interface - even more than MCP maybe
the Claude --chrome command has a few limitations:
1. it exposes low-level tools which make your agent interact directly with the browser which is extremely slow, VERY expensive, and less effective as the agent ends up dealing with UI mechanics instead of thinking about the higher-level goal/intents
2. it makes Claude operate the browser via screenshots and coordinates-based interaction, which does not work for tasks like data extraction where it needs to be able to attend to the whole page - the agent needs to repeatedly scroll and read one little screenshot at the time and it often misses critical context outside of the viewport. It also makes the task more difficult as the model has to figure out both what to do and how to do it, which means that you need to use larger models to make this paradigm actually work
3. because it uses your local browser, it also means that it has full access to your authenticated accounts by default which might not be ideal in a world where prompt-injections are only getting started
if you actively use the --chrome command we'd love to hear your experience!
I am sure they measured the difference but i am wondering why reading screenshots + coordinates is more efficient than selecting aria labels? https://github.com/Mic92/mics-skills/blob/main/skills/browse.... the JavaScript snippets should at least more reusable if you want semi-automate websites with memory files
It still may not be quite ideal. For example, right now I was building a clone of Counter Strike. There's such large files that tunneling would be cumbersome.
I agree that there is communication overhead between the agents, although so far it looks like they can communicate very effectively and efficiently. We’re also working on efficient ways to transfer more contextual information
reply