Isn’t this before any curation has happened? I looked at it, I can see why it looks bad, but if they’re really being open about the whole pipeline, they have to include everything. Giving them a hard time for it only promotes keeping models closed.
That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview
It says it’s common crawl, I interpret it to mean this is a generic web scrape dataset, presumably they filter stuff out they don’t want before pretraining. You’d have to do do some ablation testing to know what value it adds
That said I like to think of it was my dataset I would have shuffled that part down in the list so it didn’t show up on the hf preview