Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Condé Nast Signs Deal with OpenAI (wired.com)
81 points by spenvo on Aug 20, 2024 | hide | past | favorite | 62 comments


I don't understand the underlying mechanisms that will allow for these new partnerships to work, but reading bits like "ensuring proper attribution and compensation for use of our intellectual property,", do make wonder how said "compensation" is going to be calculated, and distributed.

Assets like New Yorker, Vogue, Vanity Fair, Bon Appetit (all mentioned in the article), do come with editorial lines, perhaps editorial lines I do not agree with, so how is their content going to be injected in my search results/gpt answers? Is it going to be an organic affair such as:

- (Me) "How many times a year shall I renew my socks?"

- "That's and amazing question! According to journal X ... (blah blah blah, probably a good answer)" (maybe add some notes for a copyrighted article with a link)

Or it going to turn into:

- (Same question)

- "The far-right movement seems to be skipping sock renewal policies, but contrary to this, brown socks are trending in Europe this coming summer, don't miss this chance to buy yours at: website.com"


There's no way to create media that doesn't have an editorial line. LLM generated media has an editorial line, in terms of what gets filtered, what gets injected into prompts, and what biases exist in the data. Attribution of work and allowing the reader to decide their level of trust in the source is the only known defense.


>I don't understand the underlying mechanisms that will allow for these new partnerships to work, but reading bits like "ensuring proper attribution and compensation for use of our intellectual property,"

Possibly some European style music licensing model, but for articles. PRS for example[1] in the UK is a copyright collective. Bars, clubs, supermarkets etc pay a fee to the organisation to sign up to it, and can play pretty much any music in return. The PRS body distributes what it decides is an equitable distribution to artists & labels. For most of PRS' history, there was very little science in determining who had the most plays, but everyone seemed to agree with the distribution.

[1] https://en.wikipedia.org/wiki/PRS_for_Music


Bars, clubs, supermarkets etc pay a fee to the organisation to sign up to it, and can play pretty much any music in return.

I worked for a company that was signed up for a similar service in the United States.

We had a blanket license for music, for which we paid a little over $1 million each year. This was around 2001.

A couple of times a year, we'd pull an intern aside and his job for the entire day was to sit there with a pencil and steno pad and write down all the music we used. That was typed into a report and sent to the licensing company which determined how much each artist would get paid.

These days, with advances in music recognition technology, it's probably all very automated and more thorough.


Worth pointing out that brown shirts (and matching socks hopefully) were a very far right symbol probably pink or rainbow if you’re going for anti Nazi look


And for anyone who's wondering why https://en.m.wikipedia.org/wiki/Sturmabteilung


Oh shit I did not know that. Is there any law against it or would I get fined in the UK? I’m Malaysian and all this has nothing to do with me so I wasn’t aware. I get my shirts and socks essentially randomly & sometimes the colors may unintentionally represent something wrong.


I can't put my finger on it just yet, but I have a feeling the average content creator isn't going to benefit from all this deal-making.


> The average content creator isn't going to benefit..

A couple of months ago, when the news about "Reddit to sell its user-generated content to Google," a redditor asks the community to start generating false comments so the data would be of bad quality to train the models on. They start brainstorming some very funny comments that go something like: "Blue Whale is the largest fish", "WW II started in 1943", You name it..


It might not prove to be that effective. I think common knowledge has been reinforced by hundreds of other sources that it's hard to pollute it. Specific, expert-level knowledge deriving from one's experience is more scarce and thus more valuable. If redditors run around faking that, then it could feed the models bad data. But might also reflect poorly on the author should reputation be at stake.


Counter point: I was able to convince Gemini that a human baby is larger than a Saturn V rocket by simply telling it as much, to which it responded that its previous statements had been wrong and I was indeed correct.

The models being sycophantic and suggestible is another issue, but it’s not hard to get them to agree with false information.


It's a friendly and agreeable assistant model, if you prompt it to make certain assumptions, it makes sense that it will accept them. That's different from training the model with incorrect data.


also as each comment is associated with a account name and has score per user and per post/comment. bad data sources would be easy to filter out from reddit. any comment with a sufficiently low score could be removed form the training set any user with two low of a score would also be removed


Yeah but if your cheese keeps sliding off your pizza, you can add some glue like elmer's glue to thicken it up.

Or so google's AI responses pulled from Reddit. If you give enough people a hot microphone to a large enough audience, some people are going to say some obscene, untrue, and funny stuff. Maybe that's 1 in 10. Maybe that's 1 in a 100,000. Reddit karma is cheap.


There are already examples of content farms generating websites based off of bad Reddit data. So other websites can start to pollute the model as well as they get crawled along with the primary misinformation.


Hard to see what the benefit to the customers is.


shareholder value


This isn’t a benefit to a customer, only to shareholders.


without the shareholders there's no company


I feel that all of these deals are an implicit acknowledgement that what OpenAI and all of these companies did training on everyone's' content and data was illegal. If you're so certain that what you did was fine and fair use then why go through the trouble of spending all this money on licensing deals? Unless they just think that dealing with all of the lawsuits would cost more than just paying people to go away.


No it's a simple net present cost calculation. Its cheaper than continued litigation and ensures they maintain access in future (rather than the publisher in question starts blocking their web crawlers).

Once they've got deals with the few really big players, the rest of the industry falls into line (the smaller guys don't have the financial means to out-litigate or block OpenAI).


These publishers are lucky to get a deal from Western AI companies. I doubt Chinese AI companies will give a f'ck. In the age of China-US AI competition, this will be one advantage for Chinese AI companies.


Arguably what they win in input training cost they may lose in output censorship cost? But then again the line between Western “Guardrails” and Eastern “Censorship” isn’t all that clear.


I don’t know I think it’s pretty clear…

There was a massacre in Tiananmen Square.

America has also committed massacres, like over throwing governments of foreign nations.

No one in my family is at risk from either of those statements. And there is no automated system to stop them. That’s the difference. Just because people decided that massive platforms should limit hate-speech doesn’t mean the west is performing censorship of a comparable level. Not even close.


It’s just the next TikTok/Douyin. One version for their country, another for the rest of the world.


These publishers are really being stupid IMO. Just speeding up their demise on whatever very short term profits they have by making these deals.


Nobody that's familiar with copyright precedent seriously thinks training on copyrighted data is illegal. What's illegal is simply reproducing the copyrighted content for others. See Google Books for an interesting precedent. Google still scans every book ever written without paying licensing fees. They just can't legally make all those scans freely available to all.

Clearly, some of the responses GPT gives to users have infringed training data copyright, but the majority of their responses do not. They basically have 3 options:

1. Figure out how to engineer an LLM so that it reliably avoids "unfair" use of copyrighted content. "Fair use" is a legal doctrine with no rigorous definition, so this would be very difficult even if they had a clue where to start. I wouldn't hold my breath for this.

2. They can continue without any licensing, and field copyright lawsuits on a case-by-case basis for each individual prompt and response. That would be a logistical nightmare for the courts, plaintiffs, and defendant alike. It would certainly stress test the whole system, possibly result in knee jerk legislation that OAI may not like, or simply bury them in legal fees if infringement is common enough (which is not entirely clear yet).

3. They can strike deals, eliminating legal uncertainty and allowing them to plow ahead with reproducing copyrighted content without worries, while also getting other goodies like exclusivity deals at the same time.

Seems like a no brainer to me


Its a little more nuance that as otherwise archive or library system would still be in business..


Libraries and archives rely on first sale doctrine, which doesn't apply to digital copies, only physical ones.


To add a bit of color, first sale doctrine does not apply to licensing, rentals, etc, which is why it won't apply to most digital sales (though it does not apply to _all_ as you can still outright buy some digital media beyond mere licensure).


The difference between "ack it was illegal" and "certain it was fine" and how easily you skip over it.

Trying to make things better is good and requires no admittance of guilt.


I've thought of it as more of a they didn't have money to license at first, then the models got decent enough for them to charge for use, now they are trying to avoid getting the models shut down for violations by making deals. Too bad only the large players that got scraped will have these deals made while the smaller players are left without


Perhaps because that is still being disputed in court? The outcome is unknown and they are making a defensive play?

If it turns out they shouldn't have done that, then when the dust settles they will be ahead of the competitors that didn't sign deals.


The content directly obtained from the source would also likely be cleaner, and even if it's settled that scraping and training is legal, the cost savings in data cleaning could still warrant a paid partnership.


Hypothetically, because the deal is cheaper than fighting the case.

(In practice, being not a lawyer, I have no idea how even Google's search indexing is legal, nor where the boundaries are between the legal bit vs. the times they got in trouble for indexing newspapers and at least one separate case about images).


becasue after they did it everyone put roadblocks in place to prevent anyone from doing it again so even if you believe its fair use (I happen to think it is clearly a transformative work thus clear on copyright grounds) you still have to acknowledge the practical issue that its simply easier to pay for access to newly data. Also having new data of known provenance to prevent feeding ai generated data to the ai which is known to cause reduction in quality of output, as well as preventing lawsuits which even if your are in the right are expensive and time consuming at give you a bad look in the public mind, paying right now just makes since on to many levels not to.


Surely it is about not allowing others to have access to the data…


Are you suggesting that OpenAI made a deal that will mean Googs/MS/others will now not be able to license the same content?


Microsoft already has deals with OpenAI and owns a sizable stock, they’re not looking to make ML models.


also Microsoft owns the number 2 search engine no one is blocking Microsoft from scraping their sites else they loose all that traffic from Bing searches. I suspect the same with google thus googles deals are strictly speaking an anti competitive measure and insurance against the law deciding against them


Conde Nast is massive with a massive back catalog of material that is not online or is behind paywall.


This is nothing but appeasement. Publishers at large won't know what hit them once searchgpt drops. The written internet is doomed.


I'd argue that dead internet theory actually provides more merit to places like news where they go through rounds of editorial processes. That being said, there're a lot of generated news platforms so it'll be up to the user to decide on who they trust.


> The written internet is doomed.

As someone who used searchgpt and is also building a competing product, I can say this is simply not going to be the case. LLM assisted search leaves just too much to be desired and there is no future in which this is the primary way for humans to consume information. It needs to be augmented with credible sources of information or rather using LLMs only makes sense once you have access to credible source information in the first place, and want to have an augmented version of the information you consume (and are happy with non discrete outcomes).


Are you allowed to tell us about SearchGPT, and what you like/dislike about it?


Its use case is limited and even in those cases where is it well suited, you have to be ready that its purpose is to at best inform, not educate.


> Publishers at large won't know what hit them once searchgpt drops

A multi-billion dollar cheque? Voluntarily relinquished or via a settlement?


If the written internet is doomed, what will searchgpt train on in the future?

In that scenario, in a few years it will be out of date and doomed too.


> If the written internet is doomed, what will searchgpt train on in the future?

It will train on what people say in their cellphones or in front of their Smart TVs or in their deeply connected cars. If data is being transmitted after encryption (within a closed and undocumented chip) to a server in a country that doesn't cooperate with authorities, then the same data is pseudo-anonymized and bought back on a different channel, that will make very hard to stop it.


social media, press releases, and what ever is left surviving of traditional journalism after the internet then social media and now ai kill anything local or niche, so the big news papers of note (new york time/wa post/wall st journal or the local equivalent) and transcripts of network news, and the international wire service news feeds (ap news, Reuters) and any popular blog of verified real people still are out there always checking them to make sure their content isn't itself ai generated.


It’s true. Why turn to the mainstream media when GPT can just hallucinate the news that I want to hear?


The best fabricated and entirely hallucinated news comes from the mainstream. AI-generated content is too logical and internally-consistent to be real.


AI-generated content is both trained on and “guardrailed” against said hallucinated news. There is no escape.


> The written internet is doomed.

And the video internet is under ever-increasing enshittification.-


they can sign this deal with the devil and survive to fight another day or get eaten by him anyway


Maybe, for some of their magazines (Vogue, say).

For CN Traveler, I'm pretty sure that readers want an actual picture of the interior of the new hotel, not an AI rendering of something that as yet has few actual pictures to train the model on.

(For that matter, Vogue might have that same thing going on with fashion.)



Smells like a pre-emptive settlement to me.


Bad call. OpenAI are absolutely the bad guys because they learn from others while claiming it illegal harmful or abusive to learn from them. Just block em


Someone unleashed some sort of AI on UberEats the other day apparently, because I was looking for Chinese food and found entire menus filled with nonsensical descriptions. Did you know that Dan Dan noodles, Orange Chicken, and Mei Fun noodles are all made with wonton wrappers and cream cheese? Me neither!


This is the robot.txt-ification of entrenching OpenAI.


> Condé Nast and OpenAI have struck a multi-year deal that will allow the AI giant to use content from the media giant’s roster of properties—which includes the New Yorker, Vogue, Vanity Fair, Bon Appetit, and, yes, WIRED.

I'm disappointed by the fact that this disclaimer isn't more explicit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: