Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems like we might need a section of internet that is off limits to robots.


everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt


There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).


What prevents someone from getting access and then running an authenticated headless browser to scoop the data?


Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).


Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).


That's entirely possible, especially if the site is small and not run by people with access to resources like physical security, legal etc.


It’s here and it’s called discord.


I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.

Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.

If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.


Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set


Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.


Why would you accept a lie?


I'm not sure what you mean but I'm trying to say our current LLMs are not artificially intelligent and calling them "AI" has confused a lot of the lay public.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: