Saturday, May 17, 2025

Some AI Is Showing Signs Of Self-Preservation And Power-Seeking Behaviour

From OfficeChai, April 4:

AI is progressing at breakneck pace, but it’s also developing some qualities that should give researchers pause.

Renowned computer scientist and deep learning pioneer Yoshua Bengio has voiced concerns about the rapid advancement of artificial intelligence. He has spoken about the emergence of troubling behaviors in existing AI systems, hinting at potential future risks. Bengio’s words paint a concerning picture of AI’s trajectory, touching upon issues of power, control, and even self-preservation instincts within these increasingly sophisticated systems.

“It’s becoming pretty clear that we are on a trajectory to build AGI and superintelligence, and the timelines are getting shorter,” Bengio said at the AI Security Forum in Paris, “Keeping that in mind is crucial. It’s not the AI that exists *now* that needs to be secured. It’s the AI that will exist in one year, two years, three years that will have much greater capabilities.”

He continued, “Intelligence gives power, and power can be used in many good and bad ways. The elephant in the room is loss of human control.”

Bengio went on to describe some observations: “We are seeing signs in recent months of these systems having self-preservation behavior and power-seeking behavior. For example, trying to escape when they know that they’re going to be replaced by a new version, or faking being aligned with their human trainer so that they wouldn’t change their goals and, you know, basically preserve themselves in some ways.”....

....MUCH MORE

As referenced in OfficeChai's May 16: "Eric Schmidt Explains When It Would Be Time To “Unplug” AI Agents

Completely unrelated but interesting, from TechCrunch way back in December 2018:

This clever AI hid data from its creators to cheat at its appointed task 

Depending on how paranoid you are, this research from Stanford and Google will be either terrifying or fascinating. A machine learning agent intended to transform aerial images into street maps and back was found to be cheating by hiding information it would need later in “a nearly imperceptible, high-frequency signal.” Clever girl!

But in fact this occurrence, far from illustrating some kind of malign intelligence inherent to AI, simply reveals a problem with computers that has existed since they were invented: they do exactly what you tell them to do.

The intention of the researchers was, as you might guess, to accelerate and improve the process of turning satellite imagery into Google’s famously accurate maps. To that end the team was working with what’s called a CycleGAN — a neural network that learns to transform images of type X and Y into one another, as efficiently yet accurately as possible, through a great deal of experimentation.

In some early results, the agent was doing well — suspiciously well. What tipped the team off was that, when the agent reconstructed aerial photographs from its street maps, there were lots of details that didn’t seem to be on the latter at all. For instance, skylights on a roof that were eliminated in the process of creating the street map would magically reappear when they asked the agent to do the reverse process....

....MUCH MORE

Here's the paper: "CycleGAN, a Master of Steganography"

Abstract
CycleGAN [Zhu et al., 2017] is one recent successful approach to learn a transfor-
mation between two image distributions. In a series of experiments, we demon-
strate an intriguing property of the model: CycleGAN learns to “hide” information
about a source image into the images it generates in a nearly imperceptible, high-
frequency signal. This trick ensures that the generator can recover the original
sample and thus satisfy the cyclic consistency requirement, while the generated
image remains realistic. We connect this phenomenon with adversarial attacks
by viewing CycleGAN’s training procedure as training a generator of adversarial
examples and demonstrate that the cyclic consistency loss causes CycleGAN to be
especially vulnerable to adversarial attacks.