What a terrible take. Stefan Baack is definitely not an ml researcher and I can't imagine any read over this piece.
> Common Crawl’s mission as an organization does not easily align with the needs of trustworthy AI development. Its guiding principle is that less curation of the provided data enables more research and innovation by downstream users. Common Crawl therefore deliberately does not remove hate speech, for example, because it wants its data to be useful for researchers studying hate speech. However, such data is undesirable when training LLMs because it might lead to harmful outputs by the resulting models.
If Common Crawl did this then how would we train models to not be toxic? It's simply false that by curating data enough we can avoid creating toxic models.
We've know this for many years now. https://arxiv.org/pdf/1811.08489.pdf is one early and popular example. Even carefully curating the data to balance it by gender doesn't eliminate gender biases. No amount of curation will fix toxic models, nowhere close.
> It could also enforce more transparency around generative AI by requiring AI builders to attribute their usage of Common Crawl.
This is a practically worthless level of transparency. Ok, a model used a subset of common crawl. Now what? What practical knowledge have we gained? Nothing. This is the kind of low effort stamp collecting approach to transparency that helps no one.
> By themselves, pre-trained LLMs are not very useful to most people.
Nonsense. There are countless applications. They are not great chatbots, but that's totally different.
> there should be a greater diversity of filtered Common Crawl versions
This is the main message. It appears like 10 times in the writeup.
Not only is it technical nonsense, removing this data is known to not make a difference. It's even worse.
Whose filters should they apply? The author's? The US right? The Chinese communist party? Etc.
It's easy to say "I want fairness and everything nice". But when the rubber hits the road it's just nonsense.
I say this as an ML researcher working in this area.
> Common Crawl’s mission as an organization does not easily align with the needs of trustworthy AI development. Its guiding principle is that less curation of the provided data enables more research and innovation by downstream users. Common Crawl therefore deliberately does not remove hate speech, for example, because it wants its data to be useful for researchers studying hate speech. However, such data is undesirable when training LLMs because it might lead to harmful outputs by the resulting models.
If Common Crawl did this then how would we train models to not be toxic? It's simply false that by curating data enough we can avoid creating toxic models.
We've know this for many years now. https://arxiv.org/pdf/1811.08489.pdf is one early and popular example. Even carefully curating the data to balance it by gender doesn't eliminate gender biases. No amount of curation will fix toxic models, nowhere close.
> It could also enforce more transparency around generative AI by requiring AI builders to attribute their usage of Common Crawl.
This is a practically worthless level of transparency. Ok, a model used a subset of common crawl. Now what? What practical knowledge have we gained? Nothing. This is the kind of low effort stamp collecting approach to transparency that helps no one.
> By themselves, pre-trained LLMs are not very useful to most people.
Nonsense. There are countless applications. They are not great chatbots, but that's totally different.
> there should be a greater diversity of filtered Common Crawl versions
This is the main message. It appears like 10 times in the writeup.
Not only is it technical nonsense, removing this data is known to not make a difference. It's even worse.
Whose filters should they apply? The author's? The US right? The Chinese communist party? Etc.
It's easy to say "I want fairness and everything nice". But when the rubber hits the road it's just nonsense.
I say this as an ML researcher working in this area.