I don't know a single website owner that is hostile to any search engine web cra...

ianbutler · on Oct 20, 2020

Reddit, twitter, facebook are just three to start. There are plenty that disallow crawlers except google. We've crawled a significant amount now and just because you are unaware of them doesn't mean they don't exist. I can attest they're there.

I'll also add plenty of sites don't block any engine but confer special privileges to google bot which depending on the site and their size are almost the same thing.

Edit: And I'll add to limit confusion Reddit hides the sitemap and denies access there's is not an outright ban -- it just makes it a lot harder.

threatofrain · on Oct 21, 2020

Content kingdoms have their own reasons to be hostile to Everyone searching, including Google. Even when their content is "searchable" by Google, they'll tease you with something and gate almost all content.

They are part of the story of why Google is degrading, not why they're doing well.

oh_sigh · on Oct 20, 2020

Can't you just set your useragent to googlebot? Or something which isn't googlebot but which matches the most common regexes like "Not-Googlebot/2.1"

ianbutler · on Oct 20, 2020

A lot of sites do some variation of these when you set googlebot as your UA, certainly the larger more sophisticated sites do.

https://support.google.com/webmasters/answer/80553?hl=en

So unless you have a google domain your sol, it's also just generally frowned upon. We have our own UA WhizeBot with an email contact so you can let us know if our crawler is doing anything you'd rather it not.

There have been a few legal cases that protect scraping publicly available information on the web but we'd rather follow robots.txt to avoid the potential for shenanigans in any case.

ericd · on Oct 21, 2020

wrt legal cases, are you referring to HiQ?

ianbutler · on Oct 21, 2020

ricardo81 · on Oct 21, 2020

That's "poor man's cloaking". Most people that genuinely care if it's Googlebot or not will verify it appropriately by doing a DNS lookup and a reverse DNS of that IP to ensure it's one of Google's IPs.

'Faking' the UA is much more likely to annoy site owners and end up with a permanent block.

temp667 · on Oct 20, 2020

I do, some third party search crawlers are just badly programmed, and after you get burned a bunch of times you just want to deny anyone who isn't one of the main players. I think they are basically startups with a lot of money to spend on crawl compute, but who haven't really figured out their crawl engine, and it can go wild on your site.

You also have bots that seem to be credential stuffing, bots that seem to be content scrapping (stuff with same typos shows up elsewhere after their visits, really obvious on new / fresh articles, bots that seem to be exploring for copyright claims, rando bots (maybe comment sentiment analysis for stock trading) etc.

Google is much more welcome by comparison.

fastball · on Oct 20, 2020

Right, but that's not Google's fault, that's explicitly everyone who does a shit job of crawling's fault.

Generally I wouldn't qualify a lot of those as search engine web crawlers but more web scrapers looking to re-use data, not just surface it.

kobalsky · on Oct 21, 2020

Try to crawl amazon.

fastball · on Oct 21, 2020

Right, but that’s not really anti-search-engine, that’s anti-people-who-want-our-actual-data.

Like I said in another comment, if you’re someone that just wants to surface Amazon results they love you. It’s the people that want to take advantage of amazon data in some other way they’re trying to stop.