Surely there's AI usage that's not morally reprehensible. Models that are traine...

qingcharles · 2025-12-01T07:17:02 1764573422

How many models are only trained on legal[0] data? Adobe's Firefly model is one commercial model I can think of.

[0] I think the data can be licensed, and not just public domain; e.g. if the creators are suitably compensated for their data to be ingested

Eisenstein · 2025-12-01T08:20:17 1764577217

> How many models are only trained on legal[0] data?

None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is

1. Common crawl

2. Github

3. Wikipedia, Wikibooks

4. Reddit (pre-2023)

5. Semantic Scholar

6. Project Gutenberg

austinjp · 2025-12-01T09:17:22 1764580642

Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma: