> Certainly it is literally derivative. I am not sure if it is that clear cut. E...

> Certainly it is literally derivative.

I am not sure if it is that clear cut.

Embeddings are encodings of shared abstract concepts statistically inferred from many works or expressions of thoughts possessed by all humans.

With text embeddings, we get a many-to-one, lossy map: many possible texts ↝ one vector that preserves some structure about meaning and some structure about style, but not enough to reconstruct the original in general, and there is no principled way to say «this vector is derived specifically from that paragraph by authored by XYZ».

Does the encoded representation of the abstract concepts represent the derivate work? If yes, then every statement ever made by a human being represents the work derivative of someone else's by virtue of learning how to speak in the childhood – they create a derivative work of all prior speakers.

Technically, the3re is a strong argument against treating ordinary embedding vectors as derivative works, because:

- Embeddings are not uniquely reversible and, in general, it is not possible reconstruct the original text from the embedding;

- The embedding is one of an uncountable number of vectors in a space where nearby points correspond to many different possible sentences;

- Any individual vector is not meaningfully «the same» as the original work in the way that a translation or an adaptation is.

Please do note that this is the philosophical take and it glosses over the legally relevant differences between human and machine learning as the legal question ultimately depends on statutes, case law and policy choices that are still evolving.

Where it gets more complicated.

If the embeddings model has been trained on a large number of languages, it makes the cross-lingual search easily possible by using an abstract search concept in any language that the model has been trained on. The quality of such search results across languages X, Y and Z will be directly proportional to the scale and quality of the corpus of text that was used in the model training in the said languages.

Therefore, I can search for «the meaning of life»[0] in English and arrive at a highly relevant cluster of search results written in different languages by different people at different times, and the question becomes is «what exactly it has been statistically[1] derived from?».

[0] The cross-lingual search is what I did with my engineers last year to our surprise and delight of how well it actually worked.

[1] In the legal sense, if one can't trace a given vector uniquely back to a specific underlying copyrighted expression, and demonstrate substantial similarity of expression rather than idea, the «derivative work» argument in the legal sense becomes strained.