> There is no coherent data corpus (compressed or not) in ChatGPT. I disagree. I...

cthalupa · on June 5, 2024

> If you can get the model to output an article verbatim, then that article is stored in that model.

You can't get it to do that, though.[1]

The NYT vs OpenAI case, if anything, shows that even with significant effort trying to get a model to regurgitate specific work, it cannot do it. They found articles it had overfit on due to snippets being reposted elsewhere across the internet, and they could only get it to output those snippets, and not in correct order. The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

No one who knows anything about these models disagrees that overfitting can cause this sort of behavior, but the overwhelming majority of the data in these models is not overfit and they take a lot of care to resolve the issue - overfitting isn't desirable for general purpose model performance even if you don't give a shit about copyright laws at all.

People liken it to compression, like the GP mentioned, and in some ways, it really is. But in the most real sense, even with the incredibly efficient "compression" the models do, there's simply no way for them to actually store all this training data people seem to think is hidden in there, if you just prompt it the right way. The reality is only the tiniest fraction of overfit data can be recovered this way. That doesn't mean that the overfit parts can't be copyright infringing, but that's a very separate argument than the general idea that these are constantly putting out a deluge of copyrighted material.

(None of this goes for toy models with tiny datasets, people intentionally training models to overfit on data, etc. but instead the "big" models like GPT, Claude, Llama, etc.)

1. https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...

bckr · on June 5, 2024

> The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

> Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

OK, that’s less material than I believed, which shows the details matter. But we agree that the overfit material, while limited, is stored in the model.

Of course, this can be (and surely is) mitigated by filtering the output, as long as the product is the output and not the model itself.

semi · on June 5, 2024

>Just because you need an algorithm such as a specialized prompt to retrieve this memorized data, is also irrelevant.

I disagree. Granted I'm a layman and not a lawyer so I have no clue how the court feels. But I can certainly make very specialized algorithms to produce whatever output I want from whatever input I want, and that shouldn't let me declare any input as infringing on any rights.

For the reducto ad absurdum example: I demand everyone stops using spaces, using the algorithm 'remove a space and add my copyrighted text' it produces an identical copy of my copyrighted text.

For the less absurd example.. if I took any clean model without your copyrighted text, and brute forced prompts and settings until I produced your text, is your model violating the copyright or is my inputs?

bckr · on June 6, 2024

Well, I think this is why it’s not settled yet. However, the law depends on many reasonability tests.