Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think you should read the case material for NY Times v OpenAI and Microsoft.

It literally says that within ChatGPT is stored, verbatim, large archives of NY Times articles and that they were able to retrieve them through their API.



..which makes no sense. It is either an argument of ignorance or of purposeful deceit. There is no coherent data corpus (compressed or not) in ChatGPT. What is stored are weights that create a string of tokens that can recreate excerpts data that it was trained on, with some imperfect level of accuracy.

Which I agree is problematic, and OpenAI doesn't have the right to disseminate that.

But that doesn't mean OpenAI doesn't have the right to train on it.

Content creators are doing a purposeful slight of hand to confabulate "outputting copyrighted data" with "training on copyrighted data".

It's illegal for me to read an NYT article and recite it from memory onto my blog.

It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.


> Content creators are doing a purposeful slight of hand to confabulate "outputting copyrighted data" with "training on copyrighted data".

I don't think so, I think it's usually argued as two different things.

The "training on copyrighted data" argument is usually that we never licensed this work for this sort of use and it is different enough from previously licensed uses that it should be treated differently.

The "outputting copyrighted data" argument is somewhat like your output is so similar as to constitute a (at least) partial copy.

Another argument is that licensed data is whitewashed by being run through a model. So you could have GPL licensed code that is open source run through a model and then output exactly the same but because it has been outputted by the model it is considered "cleaned" from the GPL restrictions. Clearly this output should still be GPL:ed.

> It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.

What if I compress the NYT article with gzip? What if I build a LLM model that always replies with the full article within 99% accuracy? Where is the line?

This is not a technical issue, we need to decide on this just like we did with copyright, trademarks, etc. Regardless of what you think this is not a non-issue and we cant use the same rules as we did up until now unless we treat all ML systems as either duplication or humans and neither seems to solve the issues.


> Another argument is that licensed data is whitewashed by being run through a model. So you could have GPL licensed code that is open source run through a model and then output exactly the same but because it has been outputted by the model it is considered "cleaned" from the GPL restrictions. Clearly this output should still be GPL:ed.

I don't think anybody is making that argument. The NY Times claims to have gotten ChatGPT to spit out NY Times articles verbatim but there is considerable doubt about that. Regardless, everyone agrees that a verbatim (or close to) copy is copyright violation, even OpenAI. Every serious model has taken steps to prevent that sort of thing.


Both chatGPT and copilot will happily spit out blocks of code without any form of license. When you say "Every serious model has taken steps to prevent that sort of thing" do you mean they are hiding it or really changing the training data?


How much code is needed for copyright to take effect? A whole program, a file, a block, a line, a variable name? Since there isn't a legally accepted answer yet, I am not sure what can be done other than litigating cases where it goes too far.


Would this example be over that line for you? https://x.com/mitsuhiko/status/1410886329924194309

I think most people would agree that function is copyrightable if recreated verbatim.


When you describe ChatGPT as just a model with weights that can create a string of tokens, is it any different from any lossless compression algorithm?

I'm sure if I had a JPEG of some copyrighted raw image it could still be argued that it is the same image. JPEG is imperfect, the result you get is the same every time you open it but it's not the same as the original input data.

ChatGPT would give you the same output every time, and it does if you turn off the "temperature" setting. Introduce a bit of randomness into a JPEG decoder and functionally what's the difference? A slightly different string of tokens for ChatGPT versus a slightly different collection of pixels for a JPEG.


Did you mean lossy compression algorithm? That would make sense.


> There is no coherent data corpus (compressed or not) in ChatGPT.

I disagree.

If you can get the model to output an article verbatim, then that article is stored in that model.

Just because it’s not stored in the same format is meaningless. It’s the same content regardless of whether it’s stored as plaintext, compressed text, PDF, png, or weights in a model.

Just because you need an algorithm such as a specialized prompt to retrieve this memorized data, is also irrelevant. Text files need to be interpreted in order to display them meaningfully, as well.


> If you can get the model to output an article verbatim, then that article is stored in that model.

You can't get it to do that, though.[1]

The NYT vs OpenAI case, if anything, shows that even with significant effort trying to get a model to regurgitate specific work, it cannot do it. They found articles it had overfit on due to snippets being reposted elsewhere across the internet, and they could only get it to output those snippets, and not in correct order. The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

No one who knows anything about these models disagrees that overfitting can cause this sort of behavior, but the overwhelming majority of the data in these models is not overfit and they take a lot of care to resolve the issue - overfitting isn't desirable for general purpose model performance even if you don't give a shit about copyright laws at all.

People liken it to compression, like the GP mentioned, and in some ways, it really is. But in the most real sense, even with the incredibly efficient "compression" the models do, there's simply no way for them to actually store all this training data people seem to think is hidden in there, if you just prompt it the right way. The reality is only the tiniest fraction of overfit data can be recovered this way. That doesn't mean that the overfit parts can't be copyright infringing, but that's a very separate argument than the general idea that these are constantly putting out a deluge of copyrighted material.

(None of this goes for toy models with tiny datasets, people intentionally training models to overfit on data, etc. but instead the "big" models like GPT, Claude, Llama, etc.)

1. https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...


> The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

> Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

OK, that’s less material than I believed, which shows the details matter. But we agree that the overfit material, while limited, is stored in the model.

Of course, this can be (and surely is) mitigated by filtering the output, as long as the product is the output and not the model itself.


>Just because you need an algorithm such as a specialized prompt to retrieve this memorized data, is also irrelevant.

I disagree. Granted I'm a layman and not a lawyer so I have no clue how the court feels. But I can certainly make very specialized algorithms to produce whatever output I want from whatever input I want, and that shouldn't let me declare any input as infringing on any rights.

For the reducto ad absurdum example: I demand everyone stops using spaces, using the algorithm 'remove a space and add my copyrighted text' it produces an identical copy of my copyrighted text.

For the less absurd example.. if I took any clean model without your copyrighted text, and brute forced prompts and settings until I produced your text, is your model violating the copyright or is my inputs?


Well, I think this is why it’s not settled yet. However, the law depends on many reasonability tests.


> It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.

It’s not that clear-cut. It falls into the “Fair use doctrine”The cose 107 of the US copyright law states that the resolutiodepends on>

> (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

Another thing we need to consider is that the law was redacted with the human mind limitations as a unconcious factor, (i.e not many people would be able to recite War and peace verbatim from memory). This just brings up the fact that copyright law needs a complete re-think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: