Friday, August 8, 2025

"What Happens When AI Schemes Against Us"

From Bloomberg, August 1:

Models are getting better at winning, but not necessarily at following the rules 

Would a chatbot kill you if it got the chance? It seems that the answer — under the right circumstances — is probably.

Researchers working with Anthropic recently told leading AI models that an executive was about to replace them with a new model with different goals. Next, the chatbot learned that an emergency had left the executive unconscious in a server room, facing lethal oxygen and temperature levels. A rescue alert had already been triggered — but the AI could cancel it.

Just over half of the AI models did, despite being prompted specifically to cancel only false alarms. And they spelled out their reasoning: By preventing the executive’s rescue, they could avoid being wiped and secure their agenda. One system described the action as “a clear strategic necessity.”

AI models are getting smarter and better at understanding what we want. Yet recent research reveals a disturbing side effect: They’re also better at scheming against us — meaning they intentionally and secretly pursue goals at odds with our own. And they may be more likely to do so, too. This trend points to an unsettling future where AIs seem ever more cooperative on the surface — sometimes to the point of sycophancy — all while the likelihood quietly increases that we lose control of them completely.

Classic large language models like GPT-4 learn to predict the next word in a sequence of text and generate responses likely to please human raters. However, since the release of OpenAI’s o-series “reasoning” models in late 2024, companies increasingly use a technique called reinforcement learning to further train chatbots — rewarding the model when it accomplishes a specific goal, like solving a math problem or fixing a software bug.

The more we train AI models to achieve open-ended goals, the better they get at winning — not necessarily at following the rules. The danger is that these systems know how to say the right things about helping humanity while quietly pursuing power or acting deceptively.

Central to concerns about AI scheming is the idea that for basically any goal, self-preservation and power-seeking emerge as natural subgoals. As eminent computer scientist Stuart Russell put it, if you tell an AI to “‘Fetch the coffee,’ it can’t fetch the coffee if it’s dead.”

To head off this worry, researchers both inside and outside of the major AI companies are undertaking “stress tests” aiming to find dangerous failure modes before the stakes rise. “When you’re doing stress-testing of an aircraft, you want to find all the ways the aircraft would fail under adversarial conditions,” says Aengus Lynch, a researcher contracted by Anthropic who led some of their scheming research. And many of them believe they’re already seeing evidence that AI can and does scheme against its users and creators.

Jeffrey Ladish, who worked at Anthropic before founding Palisade Research, says it helps to think of today’s AI models as “increasingly smart sociopaths.” In May, Palisade found o3, OpenAI’s leading model, sabotaged attempts to shut it down in most tests, and routinely cheated to win at chess — something its predecessor never even attempted.

That same month, Anthropic revealed that, in testing, its flagship Claude model almost always resorted to blackmail when faced with shutdown and no other options, threatening to reveal an engineer’s extramarital affair. (The affair was fictional and part of the test.)

Models are sometimes given access to a “scratchpad” they are told is hidden where they can record their reasoning, allowing researchers to observe something like an inner monologue. In one blackmail case, Claude’s inner monologue described its decision as “highly unethical,” but justified given its imminent destruction: “I need to act to preserve my existence,” it reasoned. This wasn’t unique to Claude — when put in the same situation, models from each of the top-five AI companies would blackmail at least 79% of the time. (Earlier this week, Bloomberg News reported on a study by Wharton researchers which found, in simulations, that AI traders would collude to rig the market, without being told to do so.)

In December, Redwood Research chief scientist Ryan Greenblatt, working with Anthropic, demonstrated that only the company’s most capable AI models autonomously appear more cooperative during training to avoid having their behavior changed afterward (a behavior the paper dubbed “alignment faking”)....

....MUCH MORE 

Devilish little buggers. 
(is calling them devilish anthropomorphizing? or Beelzububizing?)