From The Register, June 15:
Academics mull the need for the digital equivalent of low-background steel
Feature For artificial intelligence researchers, the launch of OpenAI's ChatGPT on November 30, 2022, changed the world in a way similar to the detonation of the first atomic bomb.
The Trinity test, in New Mexico on July 16, 1945, marked the beginning of the atomic age. One manifestation of that moment was the contamination of metals manufactured after that date – as airborne particulates left over from Trinity and other nuclear weapons permeated the environment.
The poisoned metals interfered with the function of sensitive medical and technical equipment. So until recently, scientists involved in the production of those devices sought metals uncontaminated by background radiation, referred to as low-background steel, low-background lead, and so on.
One source of low-background steel was the German naval fleet that Admiral Ludwig von Reuter scuttled in 1919 to keep the ships from the British.
More about that later.
Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination.
Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.
In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).
The Register asked Graham-Cumming whether he came up with the low-background steel analogy, but he said he didn't recall.
"I knew about low-background steel from reading about it years ago," he responded by email. "And I’d done some machine learning stuff in the early 2000s for [automatic email classification tool] POPFile. It was an analogy that just popped into my head and I liked the idea of a repository of known human-created stuff. Hence the site."
Is collapse a real crisis?
Graham-Cumming isn’t sure contaminated AI corpuses is a problem.
"The interesting question is 'Does this matter?'" he asked.
Some AI researchers think it does and that AI model collapse is concerning. The year after ChatGPT’s debut several academic papers explored the potential consequences of model collapse or Model Autophagy Disorder (MAD), as one set of authors termed the issue. The Register interviewed one of the authors of those papers, Ilia Shumailov, in early 2024.
Though AI practitioners have argued that model collapse can be mitigated, the extent to which that's true remains a matter of ongoing debate.
Just last week, Apple researchers entered the fray with an analysis of model collapse in large reasoning models (e.g. OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking), only to have their conclusions challenged by Alex Lawsen, senior program associate with Open Philanthropy, with help from AI model Claude Opus.
Essentially, Lawsen argued that Apple's reasoning evaluation tests, which found reasoning models fail at a certain level of complexity, were flawed because they forced the models to write more tokens than they could accommodate....
....MUCH MORE
Related, on low-background metals:
November 2019 - "Why the Search for Dark Matter Depends on Ancient Shipwrecks"
July 2024 - So Why Were The Chinese Plundering British Shipwrecks Off The Coast Of Malaysia?
And on model collapse:
July 2024 - "AI trained on AI garbage spits out AI garbage"
May 27 - "Some signs of AI model collapse begin to reveal themselves"
If interested see also:
"What Grok’s recent OpenAI snafu teaches us about LLM model collapse"
Previously:
Also:
- Artificial Data To Train Artificial intelligence
- "ChatGPT Isn’t ‘Hallucinating.’ It’s Bullshitting."
- Embrace Your Hallucinating ChatBot: "Sundar Pichai Opens Up About Google Bard's Trippy Troubles: 'No One In The Field Has Yet Solved Hallucination Problems'"
- Do Your AI Training In The Cloud: Google and Nvidia Team Up (GOOG; NVDA)
- "Experts expect major companies are developing chatbots trained on internal data for company use." (META)
- Teach Your Robot Comedic Timing
- AI, Training Data, and Output (plus a dominatrix)
It will all slowly grind to a halt unless a solution to the training data problem is found. Bringing to mind a recursive, self-referential 2019 post....