Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I mean that's part of the conversation that needs to be had. I would argue libraries are an unadulterated good, but it is generally considered at best unethical and at worst illegal to re-use content that isn't your own, at least without a proper citation.

Then there's also the issue with things like art, music, and code. Where does the line fall with scraping Github, Soundcloud, DeviantArt, or Instagram and using things like that without permission? Most of the code on Github is open source, but there's a lot of difference between the GPL and BSD licenses.



> but it is generally considered at best unethical and at worst illegal to re-use content that isn't your own, at least without a proper citation.

No it's not at all, except in extremely limited circumstances.

When George Lucas made Star Wars, did he cite all the Westerns and space opera serials and movies that influenced him? When you give a presentation at work on why you should move to a sharded database, do you cite the history of academic work on sharded databases? When you use Times New Roman in a document, do you cite the British newspaper The Times, or Robert Granjon's prior serif designs from the 1500's?

Of course not.

Legally, you can do whatever you want with ideas and styles and whatnot, which is what AI is about. Legally, you only run into problems when you reproduce sections of copyrighted works verbatim, without a license, in a manner that's not considered fair use. Your answer to "where does the line fall" is quite clear legally -- it's the line demarcated by fair use, which has nothing to do with licenses. AI doesn't change that.


I am not a lawyer, but it seems right to me to say that the weights are a derivative work of the training set.

> A “derivative work” is a work based upon one or more preexisting works, such as a translation, musical arrangement, dramatization, fictionalization, motion picture version, sound recording, art reproduction, abridgment, condensation, or any other form in which a work may be recast, transformed, or adapted. A work consisting of editorial revisions, annotations, elaborations, or other modifications, which, as a whole, represent an original work of authorship, is a “derivative work”.

As I understand it, derivative works must be created with the legal use of the original work, or be fair use, otherwise they are infringing.


No, as you can see from your very definition. But here's a good example:

If you take a book and turn it into a movie, that's a derivative work. Anyone can see the direct resemblance -- the transformation or adaptation.

But if you take a book, convert each letter to a number, add up the numbers that make each sentence, and then sell that as a list of "random" numbers, that's not a derivative work. The end result is sufficiently transformed that copyright no longer applies. Ownership of the original work has no relevance.

And AI weights are like that. They're a complete transformation. They're not a derivate work. The only thing you have to make sure of is that they haven't been overtrained to the extent that they can regurgitate whole chapters of the texts they were trained on, for example. But that's not something they're currently able to do, and obviously copyright law will force companies to ensure it stays that way. (Not to mention that companies would do it anyways, due to the economic motivation of reducing model sizes to cut costs.)


>convert each letter to a number, add up the numbers that make each sentence...The end result is sufficiently transformed that copyright no longer applies

the problem with this as an example is that copyright would not apply to this transformative work, not the original author's copyright nor your new authorship because this transformative work contains no creative human expression (unless the original book was designed to add up to some fortune cookie, of course, in which case you have not transformed it)

A nuttier, chewier example would be retelling a litigious story like Moana ("consider the copyright, across all these leaves... make way!"), from the pig's perspective or something, and seeing what would fly and what wouldn't.


Weights are simply a lossy compression of the training data set.

Now, I understand the argument that perhaps the specific work has been homeopathically diluted down to nothingness in the weights and so therefore has only been used to contextualise the compression process of other works, but if the weights can be reasonably used to generate copyright infringing text (and condensations and abridgements and transformations are explicitly listed in the law, verbatim copying is not necessary), or even answer substantial questions about it, then that shows that the weights included that data.

If I take a sound file and compress it down so it's poor quality but I can still make out the tune, that doesn't mean that I've avoided copyright law.


> Weights are simply a lossy compression of the training data set.

No they're not -- they're more like the dictionary generated to produce a lossless compressed data set. But then we throw out the compressed data itself, and keep only the dictionary.

> but if the weights can be reasonably used to generate copyright infringing text (and condensations and abridgements and transformations are explicitly listed in the law, verbatim copying is not necessary)

First of all, they haven't been shown to substantially generate infringing text that aren't the kinds of short snippets covered by fair use. And my previous comment already explained that longer texts are not going to happen, for both legal and economic reasons.

But secondly, you're wrong about "condensations and abridgements and transformations". You can absolutely sell a page-long summary of a book without getting permission, for instance. What do you think things like CliffsNotes are all about? Or all those two-page "executive summaries" of popular busines books?

You can't abridge a 1,000 page book to 500 pages and sell that, but you can summarize its ideas in a page and sell that. Which is basically the approximate level of understanding that LLM's seem to absorb.


But there is no bright line test for what is and isn't fair use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: