Thursday, May 8, 2025

Anthropic CEO: We Really Don't Know How AI Works

From Dario Amodei's personal website:

The Urgency of Interpretability

In the decade that I have been working on AI, I’ve watched it grow from a tiny academic field to arguably the most important economic and geopolitical issue in the world.  In all that time, perhaps the most important lesson I’ve learned is this: the progress of the underlying technology is inexorable, driven by forces too powerful to stop, but the way in which it happens—the order in which things are built, the applications we choose, and the details of how it is rolled out to society—are eminently possible to change, and it’s possible to have great positive impact by doing so.  We can’t stop the bus, but we can steer it.  In the past I’ve written about the importance of deploying AI in a way that is positive for the world, and of ensuring that democracies build and wield the technology before autocracies do.  Over the last few months, I have become increasingly focused on an additional opportunity for steering the bus: the tantalizing possibility, opened up by some recent advances, that we could succeed at interpretability—that is, in understanding the inner workings of AI systems—before models reach an overwhelming level of power.

People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work.  They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology.  For several years, we (both Anthropic and the field at large) have been trying to solve this problem, to create the analogue of a highly precise and accurate MRI that would fully reveal the inner workings of an AI model.  This goal has often felt very distant, but multiple recent breakthroughs have convinced me that we are now on the right track and have a real chance of success.

At the same time, the field of AI as a whole is further ahead than our efforts at interpretability, and is itself advancing very quickly.  We therefore must move fast if we want interpretability to mature in time to matter.  This post makes the case for interpretability: what it is, why AI will go better if we have it, and what all of us can do to help it win the race.

The Dangers of Ignorance

Modern generative AI systems are opaque in a way that fundamentally differs from traditional software.  If an ordinary software program does something—for example, a character in a video game says a line of dialogue, or my food delivery app allows me to tip my driver—it does those things because a human specifically programmed them in.  Generative AI is not like that at all.  When a generative AI system does something, like summarize a financial document, we have no idea, at a specific or precise level, why it makes the choices it does—why it chooses certain words over others, or why it occasionally makes a mistake despite usually being accurate.  As my friend and co-founder Chris Olah is fond of saying, generative AI systems are grown more than they are built—their internal mechanisms are “emergent” rather than directly designed.  It’s a bit like growing a plant or a bacterial colony: we set the high-level conditions that direct and shape growth1, but the exact structure which emerges is unpredictable and difficult to understand or explain.  Looking inside these systems, what we see are vast matrices of billions of numbers.  These are somehow computing important cognitive tasks, but exactly how they do so isn’t obvious.

Many of the risks and worries associated with generative AI are ultimately consequences of this opacity, and would be much easier to address if the models were interpretable. For example, AI researchers often worry about misaligned systems that could take harmful actions not intended by their creators.  Our inability to understand models’ internal mechanisms means that we cannot meaningfully predict such behaviors, and therefore struggle to rule them out; indeed, models do exhibit unexpected emergent behaviors, though none that have yet risen to major levels of concern.  More subtly, the same opacity makes it hard to find definitive evidence supporting the existence of these risks at a large scale, making it hard to rally support for addressing them—and indeed, hard to know for sure how dangerous they are.

To address the severity of these alignment risks, we will have to see inside AI models much more clearly than we can today. For example, one major concern is AI deception or power-seeking.  The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments2.  But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking3 because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.  What we’re left with is vague theoretical arguments that deceit or power-seeking might have the incentive to emerge during the training process, which some people find thoroughly compelling and others laughably unconvincing.  Honestly I can sympathize with both reactions, and this might be a clue as to why the debate over this risk has become so polarized.

Similarly, worries about misuse of AI models—for example, that they might help malicious users to produce biological or cyber weapons, in ways that go beyond the information that can be found on today’s internet—are based4 on the idea that it is very difficult to reliably prevent the models from knowing dangerous information or from divulging what they know.  We can put filters on the models, but there are a huge number of possible ways to “jailbreak” or trick the model, and the only way to discover the existence of a jailbreak is to find it empirically.  If instead it were possible to look inside models, we might be able to systematically block all jailbreaks, and also to characterize what dangerous knowledge the models have.

AI systems’ opacity also means that they are simply not used in many applications, such as high-stakes financial or safety-critical settings, because we can’t fully set the limits on their behavior, and a small number of mistakes could be very harmful.  Better interpretability could greatly improve our ability to set bounds on the range of possible errors.  In fact, for some applications, the fact that we can’t see inside the models is literally a legal blocker to their adoption—for example in mortgage assessments where decisions are legally required to be explainable.  Similarly, AI has made great strides in science, including improving the prediction of DNA and protein sequence data, but the patterns and structures predicted in this way are often difficult for humans to understand, and don’t impart biological insight.  Some research papers from the last few months have made it clear that interpretability can help us understand these patterns....

....MUCH MORE 

It's about time. Back in 2017 I was complaining:

"Cracking Open the Black Box of Deep Learning" with this introduction:

One of the spookiest features of black box artificial intelligence is that, when it is working correctly, the AI is making connections and casting probabilities that are difficult-to-impossible for human beings to intuit.
Try explaining that to your outside investors.

You start to sound, to their ears anyway, like a loony who is saying "Etaoin shrdlu, give me your money, gizzlefab, blythfornik, trust me."

See also the famous Gary Larson cartoons on how various animals hear and comprehend:...

And then three days later Bloomberg's Matt Levine wrote something similar but he had Man Group and a leather-clad dominatrix and how in the hell am I supposed to compete with that, what with my sallying forth armed only with simple observation and the blog sort of spiraled for a few days and....