I love what he is doing but really hope the voting interface was better. Also I wonder what the results would be if there are AI-assisted stories, but maybe real authors would hate to do that.
1. Open datasets for pretrains, including the tooling used to label and maintain
2. Open model, training, and inference code. Ideally with the research paper that guides the understanding of the approach and results. (Typically we have the latter, but I've seen some cases where that's omitted.)
3. Open pretrained foundation model weights, fine tunes, etc.
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.
These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.
It is pirated material / material that breaks various terms of service but as I understand it is the stuff you can see in Anna's Archive and a bunch of "artificial" training data from queries to OpenAI ChatGPT and other LLMs.
Many Chinese tech giants already had A100 and maybe some H100 before the sanction. After the first wave of sanction (bans A100 and H100), NVIDIA released A800 and H800, which are nerfed versions of A100 and H100.
Then there was a second round of sanction that bans H800, A800, and all the way to much weaker cards like A6000 and 4090. So NVIDIA released H20 for China. H20 is an especially interesting card because it has weaker compute but larger vram (96 GB instead of the typical 80 GB for H100).
And of course they could have smuggled some more H100s.
If national security interests drove development, the US would have local manufacturing of current-process-node chips instead of being dependent on TSMC.
If 5nm semiconductors were essential for some military core capabilities, they'd start building the whole production chain domestically in a hurry at eyewatering cost and probably finish it over budget and 15 years late.
That could be a completely different problem. In China many people run PCDN (p2p CDN) for profit. The ISPs detect (and ban) such PCDN nodes by checking your uploaded / downloaded ratio. To increase this ratio thus avoid being detected, these people download popular torrents again and again without uploading at all.