This internal tension between chasing AI tooling and avoiding AI-generated content is just a prelude to the bigger shift of search engines getting reinvented around generated results instead of found results.
Fast forward 10+ years and for knowledge-related queries search is going to be more about generated results personalized to our level of understanding that at best quote pages, and more likely just reference them in footnotes as primary inputs.
These knowledge-related queries are where most content farms, low quality blogs, and even many news sites get traffic from today. If the balance of power between offense (generating AI content) and defense (detecting AI content) continues to favor offense, there will be a strong incentive to just throw the whole thing out and go all-in on generated results.
Big question is how incentives play out for the people gathering the knowledge about the world, which is the basis for generated results. Right now many/most make money with advertising, but so do content farms, and more generated results means more starving of that revenue source. For a portion of info that people want to know (e.g. factual info, not opinions, guidance, etc), Wikipedia is an alternative fact- and context-gathering model, but if search relies on it more, it will strain Wikipedia's governance model and become more of a single point of failure.
At some point in this scenario you outline, there will be so much ML generated content referencing generated content, the (already muddled) primary sources will be lost in the mix and the role of ML will be forced to synthesize sensible, meaningful content out of conflicting "truth".
What I'm trying to get at is that ML is currently terrible at contextualizing information, but in the future the successful knowledge-query engine will be the one that do the best job of wading through the explosion of digital content and pulling out the parts that are coherent and more connected to reality.
This program will need to be able to contextualize information to form a coherent model where knowledge is interconnected into a larger model of reality.
Easier said than done, we'll see if that is even possible. This is getting towards AGI.
My sense is a whole chunk of the internet is going to just get washed out with the tide as it gets demonetized by search, and that's the stuff most likely to be ML-generated. Meanwhile search engines like Google will start to laser in on fewer and fewer higher quality sources that have some signal, both through manual tuning and automation.
I think it's important to distinguish between ML-assisted and whole cloth ML generation too. I imagine many high quality content sources are going to be using ML-assisted writing tools; for instance, at my old news company (edsurge) the journalists spent most of their time gathering info and constructing the story, which was the real meat of the work. The writing and wordsmithing was the not operating at the top of their skillset, it was pretty inefficient manual labor, especially the drafts. So from content production perspective, ML could automate a lot of that away while letting the journalists tune the narrative and edit the output for better nuance - in other words, editing vs producing.
This also assumes a single output, but if you take the edtech lens and couple it with the power of ML and you could easily have any given news article have nearly infinite varieties based on reading level, language, and even past knowledge (by including or removing context). As someone interested in new mediums and innovating news, this is absolutely exciting to me. A really interesting question here is if there's an opportunity for collaboration between expert sources and search engines, such that for a given set of domain-specific queries the expert sources are in charge of fine-tuning the generated outputs.
Completely agree about contextualization. The automated detection of ML (the defense part) needs to be tuned toward trying to extract and analyze claims that are being made in the content, and a lot of ML content is nonsensical, or introduces wild, novel claims that lack evidence and alignment with other sources. Some big questions for the analysis are whether those claims are corroborated by other higher quality sources, whether relevant context is included, and whether its bringing anything new to the table.
In this kind of setup redundant sites will be ruthlessly demonetized by search. As you say, this also drives us in a direction toward expert systems with some probabilistic measurement of the truth of claims and completeness of context, which may also have actual experts at the controls of some of this (like above). This of course raises a lot of questions where unorthodox ideas may fit in this model.
I'm not an ML nor search expert though, my experience is more on production side.
This sounds scarily similar to descriptions of brains dealing with incomplete information. Good thing brains aren't keen on rationalizing prior beliefs in the face of new evidence or believing spurious things.
I’m not as convinced that “coherent” or “more connected to reality” will be how the value of mass market content will be measured in the future, which makes some of these issues less of a problem for the content generator.
I flip flop between seeing it as inevitable, and overly techno-optimistic. Can definitely see it both ways.
In the arms race between ML-copy generation and its detection, eventually the edge becomes "does this make sense in a coherent world-view", so there should be a natural pressure towards that outcome.
But maybe we're just not good enough at building these tools, and the incentives will align with an endless onslaught of digital sludge.
There's also the problem of: how does legitimate fiction, poetry, love stories, and art fit into a www where only the most logical is allowed to be seen? It could be a logically puritanical nightmare.
> it will strain Wikipedia's governance model and become more of a single point of failure
Why can't the Wikipedia model be adapted to a federated, directly community run approach? This works well enough for services such as email, matrix, and the fediverse. There's gravitation towards centralized hosting services but that's largely behavioral - the model itself works perfectly fine with lost of small players.
Heavyweight multimedia can be a challenge but text content itself is quite easy to serve up from very small devices.
Wikidata and Wikibase, the software it runs on, are expanding it into a "federated" network of knowledge stores. You can, for example, link from wikidata to some other instances of the software and query them transparently. It's used by a few museums that want to keep control over the description of the art, but link to wikidata for, say, the artists' place of birth. Then, you can use their query interface (SPARQL etc.) to get all the art they have from "artists born in a city that had a commercial port in 1960" without the museum ever having to enter more information than "this is a van Gogh".
I'm not sure such a complex system of content moderation & process could so easily be federated; I'd love to see a federated system equivalent to Wikipedia out there or one that has successfully transitioned governance like that. Email spam, by comparison, is far far less nuanced. Regardless, it'd be a new effort and wouldn't just work or be trusted year one. It'd need to be tested and refined over years and years, like Wikipedia has.
I could see several nonprofits and news brands along with Wikipedia shift to becoming a set of sources of truth for different & likely overlapping topics. That shift could happen gradually, as part of a mix of monetization incentive changes and more explicit 'here's how you participate' coersion (medium is message stuff; see FB's pivot to video, or youtube algo changing how content creators create). The generated result that Google spits out could reference those and note them as inputs, including noting where they disagree or choose to include or exclude certain context.
I still don't see how these ideas get funded without Google directly funding them, where algorithm transparency comes in play, etc.
The problems havve absolutely nothing to do with technology in the streaming-video sense of the word. It's about trust, versioning, truth, reality, and similar concepts.
Maybe Linux development is a good example, with some centralisation but other power centers of varying size connected, such as distributions or non-kernel software projects.
But then again Wikipedia already is federated into hundreds of local communities and horizontal projects like commons or wikidata, and it works not quite as terrible as one would think.
I mean it in the education and learning sense, that for knowledge to be understood it must be tailored to the person asking the question. Answering a question about physics ideally should look different for someone with a high school education vs someone who's in the field, or from someone who only has passing interest vs one that expresses interest through continued engagement. All info is filtered in some fashion, else a simple query would lead to an impossibly dense tome ('baking an apple pie from scratch').
As an edtech person that started a news company to try to solve this, one of my pet peeves about news is that it doesn't adopt an education mindset / approach despite it's mission being an educational one at its core. Part of that is due to search being what drives traffic, and that means there's a lack of relationship to the reader. But the root is internal as news and journalism is still too stuck in newspaper-mind of bespoke, reverse pyramid one-off articles that act as islands of information, delivered to an audience they have no ongoing relationship with and who have no means of getting follow-up. It's a cultural holdover that's bound up in relationship to it's old medium, not what optimally teaches.
Generated results need not be in the form of articles, that's one of the constraints that ML lifts. You could give people some generated text but also give people drill-down tools, letting people expand on a simple summary, or read in dialectic or debate style in areas where facts are not settled or multiple viewpoints exist. Just gotta expand your POV a bit and ML generated results represent an opportunity for incredible leverage educating the world.
Anyway if someone in ML at OpenAI/Google/etc wants to get deeper into this, please reach out!
Fast forward 10+ years and for knowledge-related queries search is going to be more about generated results personalized to our level of understanding that at best quote pages, and more likely just reference them in footnotes as primary inputs.
These knowledge-related queries are where most content farms, low quality blogs, and even many news sites get traffic from today. If the balance of power between offense (generating AI content) and defense (detecting AI content) continues to favor offense, there will be a strong incentive to just throw the whole thing out and go all-in on generated results.
Big question is how incentives play out for the people gathering the knowledge about the world, which is the basis for generated results. Right now many/most make money with advertising, but so do content farms, and more generated results means more starving of that revenue source. For a portion of info that people want to know (e.g. factual info, not opinions, guidance, etc), Wikipedia is an alternative fact- and context-gathering model, but if search relies on it more, it will strain Wikipedia's governance model and become more of a single point of failure.
Really interesting stuff ahead.