Why is this not a Launch YC (or at least mention it?) since you seem to be part of the current batch?
The record/replay is definitely and interesting direction. The browser automation space is getting super crowded though (even within YC), so curious to hear how you differentiate from:
We're YC S25, launched in summer. Demonstrate Mode is a new feature we recently added to our platform and thought it would be worth sharing here.
Re differentiation: The space is crowded and feature sets converge. But like LLM providers, we feel there's room for multiple players with different positioning long term (enterprise, developers, etc.) Right now, we're now focused on making the product that feels most exciting to build with - hope people can tell that :)
Yes, you can use BrowserBook to write e2e test automations, but we don't currently include playwright assertions in the runtime - we excluded these since they are geared toward a specific use case, and we wanted to build more generally. Let us know if you think we should include this though; we're always looking for feedback.
> For scraping, how do you handle Cloudflare and Captchas?
Cloudflare turnstiles/captchas tend to be less of an issue in the inline browser because it’s just a local Chrome instance, avoiding the usual bot-detection flags from headless or cloud browsers (datacenter IPs, user-agent quirks, etc.). For hosted browsers, we use Kernel's stealth mode to similar effect.
> Do you respect robots.txt instructions of websites?
We leave this up to the developer creating the automations.
AI crawlers have lead to a big surge in scraping activity, and most of these bots don't respect any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits, user agents, etc.).
This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported here on HN (and experienced myself).
Does Webhound respect robots.txt directives and do you disclose the identity of your crawlers via user-agent header?
We currently use Firecrawl for our crawling infrastructure. Looking at their documentation, they claim to respect robots.txt, but based on user reports in their GitHub issues, the implementation seems inconsistent - particularly for one-off scrapes vs full crawls.
This is definitely something we need to address on our end. Site owners should have clear ways to opt out, and crawlers should be identifiable. We're looking into either working with Firecrawl to improve this or potentially switching to a solution that gives us more control over respecting these standards.
The third quote is from a VC who has never founded a startup himself and has a clear interest in pushing founders to trade work-life balance for his own quick returns.
So none of these people worked on anything longer than 2 years. I wonder what will happen if we check back in 5–10 years. Will they still be doing and promoting 996, or will they be burned out and have changed their minds? Make your bets.
Every one of these quotes is from someone who would be junior or midlevel at best at any company. Not trying to be ageist but mid twenty somethings are filled with enthusiasm and fantastical ideas which are yet to be vetted or guided by real world experience. I agree with your skepticism here
Sure, but they don't have the absolute most sought-after skills at the peak of the AI bubble? That's the issue. The dude is asking for 996 to work on an LLM/Patchright wrapper library that also works in the cloud. And with these skills, you can get twice or more at more mature corporations.
Technically, they are also writing their own CDP implementation now.
Why work for less if you can work for more, with a better work-life balance?
Spending 8+ hours a day training to avoid death is a pretty good motivator. Mental requirements lower but not gone, physical requirements much, much higher.
Thanks! To clarify, we launched our document processing APIs a while ago. This launch is specifically for a new platform we're building around our API based on all of the things our customers previously had to build internally to support their use of Reducto (eval tools, monitoring etc).
Generally speaking, my view on the space is that this was crowded well before LLMs. We've met a lot of the folks that worked on things like drivers for printers to print PDFs in the 1990s, IDP players from the last few decades, and more recent cloud offerings.
The context today is clearly very different than it was in the IDP era though (human process with semi-structured content -> LLMs are going to reason over most human data), and so is the solution space (VLMs are an incredible new tool to help address the problem).
Given that I don't think it's surprising that companies inside and outside of YC have pivoted into offering document processing APIs over the past year. Generally speaking we don't see differentiation in the sense of just feature set since that'll converge over time, and instead primarily focus on accuracy, reliability, and scalability, all 3 of which have a very substantive impact from last mile improvements. I think the best testament I have to that is that the customers we've onboarded are very technical, and as a result are very thorough when choosing the right solution for them. That includes a company wide roll out at one of the 4 biggest tech companies, one of the 3 biggest trading firms, and a big set of AI product teams like Harvey, Rogo, ScaleAI etc.
At the end of the day I don't see VLM improvements as antagonistic to what we're doing. We already use them a lot for things like an agentic OCR (correcting mistakes from our traditional CV pipeline). On some level our customers aren't just choosing us for PDF->markdown, they're onboarding with us because they want to spend more of their time on the things that are downstream from having accurate data, and I expect that there'll be room for us to make that even more true as models improve.
To clarify, our API was already fully launched and in prod with customers when we raised our series A. This launch is specifically for the platform we're building around the API :)
Founder of Extend (https://www.extend.ai/) here, it's a great question and thanks for the tag. There definitely are a lot of document processing companies, but it's a large market and more competition is always better for users.
In this case, the Reducto team seems to have cloned us down to the small details [1][2], which is a bit disappointing to see. But imitation is the best form of flattery I suppose! We thought deeply about how to build an ergonomic configuration experience for recursive type definitions (which is deceptively complex), and concluded that a recursive spreadsheet-like experience would be the best form factor (which we shipped over a year ago).
> "How do you see the space evolving as LLMs commoditize PDF extraction?"
Having worked with a ton of startups & F500s, we've seen that there's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.
The prompt engineering / schema definition is only the start. You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it takes time and effort — and that's where we come in. Our goal is to give AI teams all of that tooling on day 1, so they hit accuracy quickly and focus on the complex downstream post-processing of that data.
Hey, we've never used or even attempted to use your platform. Respectfully I think you know that, and that you also know that your team has tried to get access to ours using personal gmail accounts dating back to 2024.
A schema builder with nested array fields has been part of our playground (and nearly every structured extraction solution) for a very long time and is just not something that we even view as a defining part of the platform.
Thanks for the reply. Not sure what you're referring to, but I don't believe we've ever copied or taken inspo from you guys on anything — but please do let me know if you feel otherwise.
It's not a big deal at the end of the day, and excited to see what we can both deliver for customers. congrats on the launch!
I agree. I don't know either company, schema builder is a very common feature in many data platforms. Nested or otherwise. Neither is claiming this is a big deal though.
AI crawlers have lead to a big surge in scraping/crawling activity on the web, and many don't use proper user agents and don't stick to any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits). This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported on HN (and experienced myself).
Do you have any built-in features that address these issues?
I work in the adtech ad verification space and this is very true. the surge in content scraping has made things very very hard in some instances. I can’t really fault the website owners either.
AI agents have lead to a big surge in scraping/crawling activity on the web, and many don't use proper user agents and don't stick to any scraping best practices that the industry has developed over the past two decades (robots.txt, rate limits). This comes with negative side effects for website owners (costs, downtime, etc.), as repeatedly reported on HN.
Do you have any built-in features that address these issues?
Yes, some hosting services have experienced a 100%-1000% increase in hosting costs.
On most platforms, browser use only requires the interactive elements, which we extract, and does not need images or videos. We have not yet implemented this optimization, but it will reduce costs for both parties.
Our goal is to abstract backend functionality from webpages. We could cache this, and only update the cache if eTags change.
Websites that really don't want us will come up with audio captchas and new creative methods.
Agents are different from bots. Agents are intended as a direct user clone and could also bring revenue to websites.
>Websites that really don't want us will come up with audio captchas and new creative methods.
Which you or other AIs will then figure a way around. You literally mention "extract data behind login walls" as one of your use cases so it sounds like you just don't give a shit about the websites you are impacting.
It's like saying, "If you really don't want me to break into your house and rifle through your stuff, you should just buy a more expensive security system."
imo if the website doesn't want us there the long term value is anyway not great (maybe exception is SERP apis or sth which live exlusively because google search api is brutally expensive).
> extract data behind login walls
We mean this more from a perspective of companies wanting it, but there is a login wall. For example (actual customer) - "I am a compliance company that has system from 2001 and interacting with it really painful. Let's use Browser Use to use the search bar, download data and report back to me".
I believe in the long run agents will have to pay for the data from website providers, and then the incentives are once again aligned.
> imo if the website doesn't want us there the long term value is anyway not great
Wat? You're saying if a website doesn't want your scraping their data then that data has low long-term value? Or are you saying something else because that makes no fucking sense.
Haha no, I am saying that if websites don’t want you there they will find a way to block you in the long run, so betting the product on extracting data from those websites is a bad business model (probably)
It would be really nice if you made some easy way for administrators to tell that a client is using browser use so they can administratively block that tool. I mean, unless you want to pay for the infrastructure improvements to the websites your product is assaulting.
In my experience these web agents are relatively expensive to run and are very slow. Admittedly I don’t browse HN frequently but I’d be interested to read some of these agent abuse stories, if any stand out to you.
(I’ve been googling for ai agent website abuse stories and not finding anything so far)
Yeah sure! We mentioned a little bit in the post that we stumbled upon this problem when working on a synthetic data contract. Internally, we were building software at the time to fully automate the process of creating synthetic data, all the way from purchasing assets to building scene layouts to rendering images. We realized after 2 months of building that there wasn't a great need for this problem nor could we build something minimal but feature-complete that we could give to people and iterate off of. We also saw the writing on the wall re: building 3D world models (see WorldLabs and NVIDIA's new 3D world models). We're excited to see where this goes!
It's quite a crowded market, browser workflow automation. What are you guys trying to do differently? For me, as someone who does a bunch of browser automation tasks, the real issue is stability. So many tools fall over on non-trivial workflows like complex forms with tabs. Fix that and you may open up a large testing automation market.
Some of the things we're trying to differently are 1) making flows deterministic by using vision models that have a consistent output given an image and an input query and 2) breaking up flows into these smaller atomic actions in order to improve consistency as well.
RE: workflows w/ complex forms and tabs -- do you have some sites that are good examples of this? We'd love to see how Simplex does.
I'm noticing a big increase in crawling activity on the sites I manage, likely from bots collecting data for LLMs. Most of them don't use proper user agents and of course don't stick to any scraping best practices that the industry has developed over the past two decades.
This trend is creating a lot of headaches for developers responsible for maintaining heavily scraped sites.
You have LinkedIn and Twitter examples, where you're very likely violating their TOS as they prohibit any scraping.
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
Scraping is semi-controversial, but in this case it's just a user with a Chrome extension visiting the site. LinkedIn has lots and lots of shady patterns around showing different results to Google Bot vs. regular users to encourage logged in sessions. Many other sites like Pinterest and Twitter/X employ similar annoying patterns.
Imo, users should be allowed to use automation tools to access websites and collect data. Most of these sites thrive off of user generated content anyways, for example Reddit is built on UGC. Why shouldn't people be able to scrape it?
If let's say I built an extension that allows people to scrape things on demand and the extension sends that data also to my servers, removing PII in the process, would that be allowed?
Technically it's acting on behalf of a proactive user in Chrome so IMHO is non-"robotic". But heh tbf this was also the excuse of Perplexity where they argued they are a legitimate non-robotic-user-agent (thus don't need to respect robots.txt) because they only make requests at the time of a user query. We need a new way of understanding what it even means to be a legitimate human user-agent. The presence of AIs as client-side catalysts will only grow.
The parent didn't say the scraping was "illegal", but that it violated ToS.
These are entirely different things. The upshot of the proceedings is that while the courts ruled there wasn't sufficient for an injunction to stop the scraping, it was nonetheless still injurious to the plaintiff and had breached their User Agreement -- thus allowing LinkedIn to compel hiQ towards a settlement.
From Wikipedia:
The 9th Circuit ruled that hiQ had the right to do web scraping.[1][2][3] However, the Supreme Court, based on its Van Buren v. United States decision,[4] vacated the decision and remanded the case for further review in June 2021. In a second ruling in April 2022 the Ninth Circuit affirmed its decision.[5][6] In November 2022 the U.S. District Court for the Northern District of California ruled that hiQ had breached LinkedIn's User Agreement and a settlement agreement was reached between the two parties.[7]
I see scraping to be equivalent to a Cherry Tree Shaking machine :-) If you are authorized to pick cherries from a tree then why not use a tree shaker and do the job in seconds but yeah make sure you don't kill the tree in the process. Also the tree owner must have right to deny you from using the tree shaker machine.
The record/replay is definitely and interesting direction. The browser automation space is getting super crowded though (even within YC), so curious to hear how you differentiate from:
- BrowserUse
- Browserbase
- BrowserBook
- Skyvern