Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The idea that AI trained on artist created content is theft is kind of ridiculous anyway. Transformers aren't large archives of data with needles and thread to sew together pieces. The whole argument is meant to stifle an existential threat, not to halt some illegal transgression. If they cared about the latter a simple copyright filter on the output of the models would be all that's needed.


I fail to see how the argument is ridiculous; and I'll bet that a jury would find the idea that "there is a copy inside" at least reasonable, especially if you start with the premise that "the machine is not a human being."

What you're left with is a machine that produces "things that strongly resemble the original, that would not have been produced, had you not fed the original into the machine."

The fact that there's no "exact copy inside" the machine seems a lot like splitting hairs; like saying "Well, there's no paper inside the hard drive so the essence of what is copyable in a book can't be in it"


If a made a bot that read amazon reviews and then output a meta-review for me, would that be a violation of amazon's copyright? (I'm sure somewhere in the amazon ToS they claim all ownership rights of reviews).

If it output those reviews verbatim, sure I can see the issue, the model is over fitting. But if I tweak the model or filter the output to avoid verbatim excerpts, does an amazon lawyer have a solid footing for a "violation of copyright" lawsuit?


As far as I understand, according to current copyright practices: If you sing a song that someone else has written, or pieces thereof, you are in violation. This is also the case of you switch out the instrumentation completely, say play trumpet instead of guitar, or a male choir sings a female line. If on would make a medley many such parts, it is not automatically not violation anymore either. So we do have examples of things being very far from verbatim copy, being considered violations.


Generally yes. You're talking about "derivative works."


Having exact copies of the samples inside the model weights would be an extremely inefficient use of space, and also it would not generalize, unless it generated a copy so close to the original that it would violate copyright law if used, I wouldn't find it very reasonable to think that there is a memorized copy inside the model weights somewhere.


An MP3 file is a lossy copy, but is still copyright infringement.

Copyright infringement doesn't require exact copies.


I didn't say it takes an exact copy for copyright infringement.


A program that can produce copies is the same as a copy. How that copy comes into being (whether out of an algorithm or read from a support) is related, but not relevant.


>A program that can produce copies is the same as a copy.

A program that always produces copies is the same as a copy. A program that merely can produce copies categorically is not.

The Library of Babel[1] can produce copyrighted works, and for that matter so can any random number generator, but in almost every normal circumstance will not. The same is true for LLMs and diffusion models. While there are some circumstance that you can produce copies of a work, in natural use that's only for things that will come up thousands of times in its training set -- by and large, famous works in the public domain, or cultural touch-stones so iconic that they're essentially genericized (one main copyrighted example are the officially released promo materials for movies).

[1] https://libraryofbabel.info/


A human illustrator can also copy existing works. As a result they are not criminalized for making other non-copies. The output of an AI needs to be considered independently of its input. Further, the folly of copyright ought to be also considered, since no work -- whether solely human in origin (such as speech, unaccompanied song, dance etc.) or built with technological prosthesis/instrumentality -- is ever made in a cultural vacuum. All art is interdependent. Copyright exists to allow art (in the general sense) to be compensated for. But copyright has grown into a voracious entitlement for deeply monied interests, and has long intruded upon the commons and fair use.


human illustrator != machine owned by a Moloch powered mega tech corp


Yeah that's right, I doubt that a model would generate an image or text so close to a real one to violate copyright law just by pure chance, the image/text space is incredibly large.


But you're a tech person. I'm trying to think of this from the point of view of e.g. a likely potential jury.

Again: Imagine two AI machines, different in one way: One of them has been fed "Article X" and the other hasn't.

You press buttons on the machine(s) in the same way.

The machine that was fed "Article X" spits out something that looks like Article X, and the one that wasn't, doesn't.

The magic inside, I don't think will much matter.


>But you're a tech person. I'm trying to think of this from the point of view of e.g. a likely potential jury.

Courts can call experts to testify on matters requiring specialized knowledge or expertise.


Absolutely; but again, if I'm the lawyer on the other side, I'm pretty confident that I'm beating any "expert" on this with the simple logic of:

- You put thing into the machine

- You press buttons, it makes obvious derivative work

- You don't put thing into the machine, and it can't do that anymore.

There is "something" in there GENERATING COPIES and we see exactly where it came from, even if we can't identify it in the code or whatever.


But you already have the same situation with people -- 'what you're left with is an artist that produces things that strongly resemble the original, that would not have been produced, had the artist not studied the original work'. Yes it is ridiculous.


Right. But broadly, the law also strongly tends to distinguish humans from machines in this space. (Which, imho, is a very good idea)


I am curious about models like encodec or soundstream. They are essentially meant to be codecs informed by the music they are meant to compress to achieve insane compression ratios. The decompression process is indeed generative since a part of the information that is meant to be decoded is in the decoder weights. Does that pass the smell test from a copyright law's perspective? I believe such a decoder model is powering gpt-4o's audio decoding.


Our copyright model isn't sufficient yet. Is putting a work through a training/model sufficient to clear the transformative use bar? That doesn't make you safe from Trademarks. If the model can produce outputs on the other side that aren't sufficiently transformative then that single instance is a copyright violation.

Honestly, instead of trying to cleanup the output, it's much safer to create a licensed input corpus. People haven't because it's expensive and time consuming. Every time I engage with an AI vendor, my first question is do you indemnify from copyright violations of your output. I was shocked that Google Gemini/Bard only added that this year.


Nothing will ever protect you from trademark violations because trademarks can be violated completely by accident without knowledge of the real work. Copying is not the issue.


I'm honestly surprised AI-washing hasn't become way more widespread then it is at this point.

I mean recording a good song is hard. Generating a good song almost impossible. But my gut feeling would've been that recreating a popular song for plausible deniability would be a lot easier.

Same with republishing bestselling books and related media. (I.e. take Lord of the rings and feed it paragraph for paragraph into an LLM that you've prompted to rephrase each to a currently bestselling author.)


I think the definition between "Lossy Compression" and "Trained AI" is... vague according to the current legal definitions. Or even "lossless" in some cases - as shown by people being able to get written articles output verbatim.

While the extremes are obvious, there's a big stretch of gray in the middle. A similar issue occurs in non-AI art, the difference between inspiration and tracing/copying isn't well defined either, but the current method of dealing with that (being on a case-by-case basis and a human judging the difference) clearly cannot scale to the level that many people intend to use these tools.


Has anyone been able to actually get a verbatim copy of a written article? The NYT got a ~100 word fragment made up of multiple snippets of a ~15k word article, with the different snippets not even being in order. (The Times had to re-arrange the snippets to match the article after the fact)

I am simply not aware of anyone successfully doing this.


The amount of content required to call it a "Copy" is also a gray area.

Same with the idea of "prompting" and the amount required to generate that copywritten output - again there's the extremes of "The prompt includes copywritten information" to "Vague description".

Arguably some of the same issues exist outside AI, just it's accessibility, scale, and lack of a "Legal Individual" on one side complicates things. For example, if I describe Micky Mouse sufficiently accurately to an artist they reproduce it to the degree it's considered copyright infringement, is it me or the artist that did the infringement? Then what if the artist /had/ seen the previously copywritten artwork, but still produced the same output from that same detailed prompt?


What's good for the goose is good for the gander. It may or may not be like theft, but either way, if one of us trained an AI on Hollywood movies, you best believe we'd get sued for eleventy billion dollars and lose. It's only fair that we hold corporations to the same standard.


I think you should read the case material for NY Times v OpenAI and Microsoft.

It literally says that within ChatGPT is stored, verbatim, large archives of NY Times articles and that they were able to retrieve them through their API.


..which makes no sense. It is either an argument of ignorance or of purposeful deceit. There is no coherent data corpus (compressed or not) in ChatGPT. What is stored are weights that create a string of tokens that can recreate excerpts data that it was trained on, with some imperfect level of accuracy.

Which I agree is problematic, and OpenAI doesn't have the right to disseminate that.

But that doesn't mean OpenAI doesn't have the right to train on it.

Content creators are doing a purposeful slight of hand to confabulate "outputting copyrighted data" with "training on copyrighted data".

It's illegal for me to read an NYT article and recite it from memory onto my blog.

It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.


> Content creators are doing a purposeful slight of hand to confabulate "outputting copyrighted data" with "training on copyrighted data".

I don't think so, I think it's usually argued as two different things.

The "training on copyrighted data" argument is usually that we never licensed this work for this sort of use and it is different enough from previously licensed uses that it should be treated differently.

The "outputting copyrighted data" argument is somewhat like your output is so similar as to constitute a (at least) partial copy.

Another argument is that licensed data is whitewashed by being run through a model. So you could have GPL licensed code that is open source run through a model and then output exactly the same but because it has been outputted by the model it is considered "cleaned" from the GPL restrictions. Clearly this output should still be GPL:ed.

> It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.

What if I compress the NYT article with gzip? What if I build a LLM model that always replies with the full article within 99% accuracy? Where is the line?

This is not a technical issue, we need to decide on this just like we did with copyright, trademarks, etc. Regardless of what you think this is not a non-issue and we cant use the same rules as we did up until now unless we treat all ML systems as either duplication or humans and neither seems to solve the issues.


> Another argument is that licensed data is whitewashed by being run through a model. So you could have GPL licensed code that is open source run through a model and then output exactly the same but because it has been outputted by the model it is considered "cleaned" from the GPL restrictions. Clearly this output should still be GPL:ed.

I don't think anybody is making that argument. The NY Times claims to have gotten ChatGPT to spit out NY Times articles verbatim but there is considerable doubt about that. Regardless, everyone agrees that a verbatim (or close to) copy is copyright violation, even OpenAI. Every serious model has taken steps to prevent that sort of thing.


Both chatGPT and copilot will happily spit out blocks of code without any form of license. When you say "Every serious model has taken steps to prevent that sort of thing" do you mean they are hiding it or really changing the training data?


How much code is needed for copyright to take effect? A whole program, a file, a block, a line, a variable name? Since there isn't a legally accepted answer yet, I am not sure what can be done other than litigating cases where it goes too far.


Would this example be over that line for you? https://x.com/mitsuhiko/status/1410886329924194309

I think most people would agree that function is copyrightable if recreated verbatim.


When you describe ChatGPT as just a model with weights that can create a string of tokens, is it any different from any lossless compression algorithm?

I'm sure if I had a JPEG of some copyrighted raw image it could still be argued that it is the same image. JPEG is imperfect, the result you get is the same every time you open it but it's not the same as the original input data.

ChatGPT would give you the same output every time, and it does if you turn off the "temperature" setting. Introduce a bit of randomness into a JPEG decoder and functionally what's the difference? A slightly different string of tokens for ChatGPT versus a slightly different collection of pixels for a JPEG.


Did you mean lossy compression algorithm? That would make sense.


> There is no coherent data corpus (compressed or not) in ChatGPT.

I disagree.

If you can get the model to output an article verbatim, then that article is stored in that model.

Just because it’s not stored in the same format is meaningless. It’s the same content regardless of whether it’s stored as plaintext, compressed text, PDF, png, or weights in a model.

Just because you need an algorithm such as a specialized prompt to retrieve this memorized data, is also irrelevant. Text files need to be interpreted in order to display them meaningfully, as well.


> If you can get the model to output an article verbatim, then that article is stored in that model.

You can't get it to do that, though.[1]

The NYT vs OpenAI case, if anything, shows that even with significant effort trying to get a model to regurgitate specific work, it cannot do it. They found articles it had overfit on due to snippets being reposted elsewhere across the internet, and they could only get it to output those snippets, and not in correct order. The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

No one who knows anything about these models disagrees that overfitting can cause this sort of behavior, but the overwhelming majority of the data in these models is not overfit and they take a lot of care to resolve the issue - overfitting isn't desirable for general purpose model performance even if you don't give a shit about copyright laws at all.

People liken it to compression, like the GP mentioned, and in some ways, it really is. But in the most real sense, even with the incredibly efficient "compression" the models do, there's simply no way for them to actually store all this training data people seem to think is hidden in there, if you just prompt it the right way. The reality is only the tiniest fraction of overfit data can be recovered this way. That doesn't mean that the overfit parts can't be copyright infringing, but that's a very separate argument than the general idea that these are constantly putting out a deluge of copyrighted material.

(None of this goes for toy models with tiny datasets, people intentionally training models to overfit on data, etc. but instead the "big" models like GPT, Claude, Llama, etc.)

1. https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...


> The NYT, knowing the correct order, re-arranged them to fit the ordering in the article.

> Even doing this, they were only able to get a hundred or so words out of the 15k+ word articles.

OK, that’s less material than I believed, which shows the details matter. But we agree that the overfit material, while limited, is stored in the model.

Of course, this can be (and surely is) mitigated by filtering the output, as long as the product is the output and not the model itself.


>Just because you need an algorithm such as a specialized prompt to retrieve this memorized data, is also irrelevant.

I disagree. Granted I'm a layman and not a lawyer so I have no clue how the court feels. But I can certainly make very specialized algorithms to produce whatever output I want from whatever input I want, and that shouldn't let me declare any input as infringing on any rights.

For the reducto ad absurdum example: I demand everyone stops using spaces, using the algorithm 'remove a space and add my copyrighted text' it produces an identical copy of my copyrighted text.

For the less absurd example.. if I took any clean model without your copyrighted text, and brute forced prompts and settings until I produced your text, is your model violating the copyright or is my inputs?


Well, I think this is why it’s not settled yet. However, the law depends on many reasonability tests.


> It's not illegal for me to read an NYT article and write my own summary of the article's contents on my blog. This has been true forever and has forever been a staple in new content creation.

It’s not that clear-cut. It falls into the “Fair use doctrine”The cose 107 of the US copyright law states that the resolutiodepends on>

> (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

Another thing we need to consider is that the law was redacted with the human mind limitations as a unconcious factor, (i.e not many people would be able to recite War and peace verbatim from memory). This just brings up the fact that copyright law needs a complete re-think.


Using the product of someone's labor to build for profit systems that are trying to put them out of work remains fucked up. And no, a simple copyright filter isn't the problem. The problem is that the work was used without permission.


NY Times v OpenAI and Microsoft says the opposite, that verbatim, large archives of NY Times articles were retrieved via API. This may or may not matter to how LLMs work, but "large archive" seems accurate, other than semantic arguments (e.g. "Compressed archive" may be semantically more accurate).


> NY Times v OpenAI and Microsoft says the opposite, that verbatim, large archives of NY Times articles were retrieved via API.

This does not match my understanding of the information available in the complaint. They might claim they were able to do this, but the complaint itself provides some specific examples that OpenAI and Microsoft discuss in a motion to dismiss... and I think the motion does a very strong job of dismantling that argument based on said examples.

https://fingfx.thomsonreuters.com/gfx/legaldocs/byvrkxbmgpe/...


Yet before “safeguards” were added a prompt could say “in the style of Studio Ghibli” and you could get exactly that.

Would it be possible if Studio Ghibli images had not been used in the training?


if it was trained on a sufficient amount of fan art made in the studio Ghibli style and tagged as such, yes.

otherwise those would just be unknown words, same as asking an artist to do that without any examples.

though I am curious how performance would differ between training on only actual studio Ghibli art, only fan art, or a mix. Maybe the fan art could convey what we expect 'studio Ghibli style' to be even more, whereas actual art from them could have other similarities that that tag conveys.


I don't understand. If I make a painting (hell, or a whole animated movie) in the style of Studio Ghibli, am I infringing their copyright? I don't think so. A style is just an idea, if you want to protect an idea to the point of no one even getting inspired by it just don't let it out of your brain.

If the produced work is not a copy, why does it matter if it was generated by a biological brain or by a mechanical one?


When will programmers get it through their thick skulls that an artist taking inspiration from a style and a well funded tech corporation downloading 400 million images to train on are two different things that shouldn't be compared. GPT is not a brain and humans correctly have different rights than computing systems.


I don't know how long you have been in HN, but I would recommend you familiarise with / refresh yourself on the community guidelines (https://news.ycombinator.com/newsguidelines.html). While HN looks and works similar to Reddit (and other such online communities), the tone and culture is not the same.


Yeah you're right we should be justifying the non-consentual exploitation of millions of workers instead of saying mildly mean words.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: