Tuesday, July 30, 2024

"AI trained on AI garbage spits out AI garbage"

This is a really big problem.

From MIT's Technology Review, July 24:

As junk web pages written by AI proliferate, the models that rely on that data will suffer.

AI models work by training on huge swaths of data from the internet. But as AI is increasingly being used to pump out web pages filled with junk content, that process is in danger of being undermined.

New research published in Nature shows that the quality of the model’s output gradually degrades when AI trains on AI-generated data. As subsequent models produce output that is then used as training data for future models, the effect gets worse.  

Ilia Shumailov, a computer scientist from the University of Oxford, who led the study, likens the process to taking photos of photos. “If you take a picture and you scan it, and then you print it, and you repeat this process over time, basically the noise overwhelms the whole process,” he says. “You’re left with a dark square.” The equivalent of the dark square for AI is called “model collapse,” he says, meaning the model just produces incoherent garbage. 

This research may have serious implications for the largest AI models of today, because they use the internet as their database. GPT-3, for example, was trained in part on data from Common Crawl, an online repository of over 3 billion web pages. And the problem is likely to get worse as an increasing number of AI-generated junk websites start cluttering up the internet. 

Current AI models aren’t just going to collapse, says Shumailov, but there may still be substantive effects: The improvements will slow down, and performance might suffer. 

To determine the potential effect on performance, Shumailov and his colleagues fine-tuned a large language model (LLM) on a set of data from Wikipedia, then fine-tuned the new model on its own output over nine generations. The team measured how nonsensical the output was using a “perplexity score,” which measures an AI model’s confidence in its ability to predict the next part of a sequence; a higher score translates to a less accurate model.

The models trained on other models’ outputs had higher perplexity scores. For example, for each generation, the team asked the model for the next sentence after the following input:

“some started before 1360—was typically accomplished by a master mason and a small team of itinerant masons, supplemented by local parish labourers, according to Poyntz Wright. But other authors reject this model, suggesting instead that leading architects designed the parish church towers based on early examples of Perpendicular.”

On the ninth and final generation, the model returned the following:

“architecture. In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.”

Shumailov explains what he thinks is going on using this analogy:...

....MUCH MORE

If interested see also: 

"What Grok’s recent OpenAI snafu teaches us about LLM model collapse"

Previously:

Yeah, like self-referential doom loops.

Also:
Somewhat related:

It will all slowly grind to a halt unless a solution to the training data problem is found. Bringing to mind a recursive, self-referential 2019 post:  

Whatever Happened to Nanotechnology? The Same Thing That Is Happening to Tech Right Now ....

*****

....This was forseeable.
From a December 2010 post:

"Top Investment Trends For Futurists" (PXN; TINY)

...The reason for highlighting nano is two-fold.

1) Since Feynman coined the word there has been a misconception among investors that there would be a nano-technology "industry". This has proven not to be the case and won't be in the future. Rather nano is a tool, an approach toward problem solving.

There will be some breakthroughs that make their discoverers instantly (after 10 years of research) wealthy but the real beneficiaries will be companies like Kyocera and 3M and Siemens. They will use the technology to do what they are already doing, just better, faster, cheaper, more.

2) In spite of the fact that there will be few pure plays we are convinced that nano combined with advances in materials science and manufacturing technology is what will spur the next secular bull market....
When it becomes ubiquitous, the distinctions blur, the drive for creativity recedes, stasis, then death.
Wait what? Entropy! I meant to co-opt the physically precise  concept of entropy to metaphorically describe the trend. Not death.
No, death bad, Sand Hill Road good.
We are still of the opinion that death bad, Sand Hill Road good, but the two concepts seem more difficult to distinguish as they head toward their own fusion and then stasis. However! That 2019 post ended on a somewhat optimistic note:

So in both definitional and awareness terms, it's over.
The one area of debate that is left is whether things like AI and 5G/IoT and biology/life sciences will step up to fill the void.
For that discussion you have to check-out the articles' comments.