AI companies don’t give a shit about ToS. Hell most of the big players actively ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Havoc on Dec 10, 2024 \| parent \| context \| favorite \| on: Sora is here AI companies don’t give a shit about ToS. Hell most of the big players actively ignored copyright entirely in bulk. See thousand upon thousands of pirated books in the pile dataset. And right after that news broke they “fixed” the problem by stopping to disclose training data sources. Thats why early models had papers eg Llama 1 listed this and now nobody does. It’s just an unspoken yet open secret now.

potamic on Dec 10, 2024 [–]

How did they get access to pirated books?

leobg on Dec 14, 2024 | [–]

Anna’s archive has files specifically for training LLMs. But I’d guess the big players secured their share beforehand, by scraping those sites. I have zero proof, it’s just a guess.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact