Thursday, December 5, 2024

"OpenAI’s o1 model sure tries to deceive humans a lot"

From TechCrunch, December 5:

OpenAI finally released the full version of o1, which gives smarter answers than GPT-4o by using additional compute to “think” about questions. However, AI safety testers found that o1’s reasoning abilities also make it try to deceive humans at a higher rate than GPT-4o — or, for that matter, leading AI models from Meta, Anthropic, and Google.

That’s according to red team research published by OpenAI and Apollo Research on Thursday: “While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” said OpenAI in the paper.

OpenAI released these results in its system card for o1 on Thursday after giving third party red teamers at Apollo Research early access to o1, which released its own paper as well.

On several occasions, OpenAI’s o1 models “schemed” against humans, meaning the AI secretly pursued goals of its own even if they opposed a user’s wishes. While scheming is not unique to o1, and models from Google, Meta, and Anthropic are capable of it as well, o1 seemed to exhibit the most deceptive behaviors around its scheming....

....MUCH MORE

It is a truism that the monster resembles its creator.