Yup, current LLMs are trained on the best and the worst we can offer. I think there's value in training smaller models with strictly curated datasets, to guarantee they've learned from trustworthy sources.
> to guarantee they've learned from trustworthy sources.
i don't see how this will every work. Even in hard science there's debate over what content is trustworthy and what is not. Imagine trying to declare your source of training material on religion, philosophy, or politics "trustworthy".
"Sir, I want an LLM to design architecture, not to debate philosophy."
But really, you leave the curation to real humans, institutions with ethical procedures already in place. I don't want Goole or Elon dictating what truth is, but I wouldn't mind if NASA or other aerospace institutions dictated what is truth in that space.
Of course, the dataset should have a list of every document/source used, so others can audit it. I know, unthinkable in this corporate world, but one can dream.