Seems like we might need a section of internet that is off limits to robots.

Centigonal · 2025-03-25T18:47:23 1742928443

everyone with limited bandwidth has been trying to limit site access to robots. the latest generation of AI web scrapers are brutal and do not respect robots.txt

varispeed · 2025-03-25T19:31:54 1742931114

There are websites where you can only register to in person and have two existing members vouch for you. Probably still can be gamed, but sounds like a great barrier to entry for robots (for now).

tmpz22 · 2025-03-25T20:40:18 1742935218

What prevents someone from getting access and then running an authenticated headless browser to scoop the data?

varispeed · 2025-03-25T21:56:04 1742939764

Admins will see unusual traffic from that account and then take action. Of course it will not be perfect as there could be a way to mimic human traffic and slowly scrape the data anyway, that's why there is element of trust (two existing members to vouch).

tmpz22 · 2025-03-26T00:58:11 1742950691

Yeah don’t get me wrong I believe raising the burden of extraction is an effective strategy I just think it’s been solved at scale ie voting rings and astro turfing operations on Reddit - and at the nation state level I’d just bribe or extort the mods and admins directly (or the IT person to dump the database).

varispeed · 2025-03-26T12:00:53 1742990453

That's entirely possible, especially if the site is small and not run by people with access to resources like physical security, legal etc.

baq · 2025-03-25T18:58:43 1742929123

It’s here and it’s called discord.

Zandikar · 2025-03-25T19:06:24 1742929584

I have bad news for you if you think non paywalled / non phone# required discord communities are immune to AI scraping, especially as it costs less than hammering traditional websites as the push-on-change event is done for you in real time chat contexts.

Especially as the company archives all those chats (not sure how long) and is small enough that a billion dollar "data sharing" agreement would be a very inticing offer.

If there isn't a significant barrier to access, it's being scraped. And if that barrier is money, it's being scraped but less often.

Davidzheng · 2025-03-25T19:08:06 1742929686

Honestly someone should scrape the algebraic topology Discord to AI it'll be a nice training set

kylebenzle · 2025-03-25T22:48:37 1742942917

Or we could just accept that LLMs can only output what we have put in and calling them, "AI" was a misnomer from day one.

eru · 2025-03-26T06:41:09 1742971269

Why would you accept a lie?

kylebenzle · 2025-03-27T14:18:31 1743085111

I'm not sure what you mean but I'm trying to say our current LLMs are not artificially intelligent and calling them "AI" has confused a lot of the lay public.