Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Surely there's AI usage that's not morally reprehensible.

Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...





How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.

[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested


> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

* https://arxiv.org/pdf/2402.00159


Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:

https://huggingface.co/datasets/allenai/dolma

https://huggingface.co/models?dataset=dataset:allenai/dolma




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: