Saturday, August 5, 2023

"When AI Is Trained on AI-Generated Data, Strange Things Start to Happen"

From Futurism:

"Loops upon loops."

It hasn't even been a year since OpenAI released ChatGPT, and already generative AI is everywhere. It's in classrooms; it's in political advertisements; it's in entertainment and journalism and a growing number of AI-powered content farms. Hell, generative AI has even been integrated into search engines, the great mediators and organizers of the open web. People have already lost work to the tech, while new and often confounding AI-related careers seem to be on the rise.

Though whether it sticks in the long term remains to be seen, at least for the time being generative AI seems to be cementing its place in our digital and real lives. And as it becomes increasingly ubiquitous, so does the synthetic content it produces. But in an ironic twist, those same synthetic outputs might also stand to be generative AI's biggest threat.

That's because underpinning the growing generative AI economy is human-made data. Generative AI models don't just cough up human-like content out of thin air; they've been trained to do so using troves of material that actually was made by humans, usually scraped from the web. But as it turns out, when you feed synthetic content back to a generative AI model, strange things start to happen. Think of it like data inbreeding, leading to increasingly mangled, bland, and all-around bad outputs. (Back in February, Monash University data researcher Jathan Sadowski described it as "Habsburg AI," or "a system that is so heavily trained on the outputs of other generative AI's that it becomes an inbred mutant, likely with exaggerated, grotesque features.")

It's a problem that looms large. AI builders are continuously hungry to feed their models more data, which is generally being scraped from an internet that's increasingly laden with synthetic content. If there's too much destructive inbreeding, could everything just... fall apart?

To understand this phenomenon better, we spoke to machine learning researchers Sina Alemohammad and Josue Casco-Rodriguez, both PhD students in Rice University's Electrical and Computer Engineering department, and their supervising professor, Richard G. Baraniuk. In collaboration with researchers at Stanford, they recently published a fascinating — though yet to be peer-reviewed — paper on the subject, titled "Self-Consuming Generative Models Go MAD."

MAD, which stands for Model Autophagy Disorder, is the term that they've coined for AI's apparent self-allergy. In their research, it took only five cycles of training on synthetic data for an AI model's outputs to, in the words of Baraniuk, "blow up."

It's a fascinating glimpse at what just might end up being generative AI's Achilles heel. If so, what does it all mean for regular people, the burgeoning AI industry, and the internet itself?

This interview has been edited for length and clarity.

Futurism: So you coined a term for the phenomenon of AI self-consumption: MAD. Can you tell us what that acronym stands for, and what that phenomenon entails?

Richard G. Baraniuk: So, AI is a big field. Generative AI is one important part, and one that the public has become really aware of lately. Generative models generate or synthesize data. So in ChatGPT a person types in a prompt, then the GPT model synthesizes text to write a response. Or if you're using image generators like DALL-E or Stable Diffusion, you put in a text prompt, and the system generates a digital image.

So engineers develop these generative models. They're basically a computer program, and the program that needs to be trained. Right? And it needs to be trained with huge amounts of data from the internet. ChatGPT was trained on as much of the world wide web as OpenAI could find — basically everything. DALL-E was similarly trained on as many digital images out on the web that could be found.

And the crux is that increasingly, AI models are being trained on not just natural data, or real data sourced from the real world. Now, they're also being trained with data that's been synthesized by other generative models. So consequently, we've entered an age where either wittingly or unwittingly, generative models are increasingly consuming the outputs from other generative models. Some companies are willingly training generative models on synthetic data, particularly when they're in an area where there just isn't enough real data. Some people are unwittingly training on generative models — for example, it turns out that a lot of the big training datasets that are used today for learning actually contain synthetic images generated by other generative models....


Previously, from Bot Populi, May 26:
Artificial Data To Train Artificial intelligence

As noted on an unrelated topic:
Spooky how much more data the platforms want to suck up. Kill them with fire, kill them now....

And on Jathan Sadowski—sorry nothing on the young researchers but we could maybe ask a bot to make something up:
"Landlord 2.0: Tech’s New Rentier Capitalism"
We've visited the author of this piece a couple other times and found him to be a very interesting guy.
Although titles like the one above might elicit a "Oh another one of those articles" reaction, Sadowski exhibits subject mastery that takes his stuff beyond pop exposition to pieces worthy of serious attention.

Potemkin AI: Many instances of 'artificial intelligence' are artificial displays of its power and potential

And with Professor Frank Pasquale:
"The Spectrum of Control: A Social Theory of The Smart City" 
One of the more important—and surprisingly popular—pieces we linked to in the past year..
First posted April 22, 2018...