Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...
[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested
reply
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159
https://huggingface.co/datasets/allenai/dolma
https://huggingface.co/models?dataset=dataset:allenai/dolma
Models that are trained only on public domain material. For value add usage, not simply marketing or gamification gimmicks...