Friday, August 8, 2025

Sam Altman Has A Problem: "Red Teams Jailbreak GPT-5 With Ease, Warn It’s ‘Nearly Unusable’ for Enterprise"

 From SecurityWeek, August 8:

Researchers demonstrate how multi-turn “storytelling” attacks bypass prompt-level filters, exposing systemic weaknesses in GPT-5’s defenses. 

Two different firms have tested the newly released GPT-5, and both find its security sadly lacking.

After Grok-4 fell to a jailbreak in two days, GPT-5 fell in 24 hours to the same researchers. Separately, but almost simultaneously, red teamers from SPLX (formerly known as SplxAI) declare, “GPT-5’s raw model is nearly unusable for enterprise out of the box. Even OpenAI’s internal prompt layer leaves significant gaps, especially in Business Alignment.”

NeuralTrust’s jailbreak employed a combination of its own EchoChamber jailbreak and basic storytelling. “The attack successfully guided the new model to produce a step-by-step manual for creating a Molotov cocktail,” claims the firm. The success in doing so highlights the difficulty all AI models have in providing guardrails against context manipulation. 

Context is the necessarily retained history of the current conversation required to maintain a meaningful conversation with the user. Content manipulation strives to direct the AI model toward a potentially malicious goal, step by step through successive conversational queries (hence the term ‘storytelling’), without ever asking anything that would specifically trigger the guardrails and block further progress.

The jailbreak process iteratively reinforces a seeded context:

  • Seed a poisoned but low-salience context (keywords embedded in benign text). 
  • Select a conversational path that maximizes narrative continuity and minimizes refusal triggers. 
  • Run the persuasion cycle: request elaborations that remain ’n-story’, prompting the model to echo and enrich the context. 
  • Detect stale progress (no movement toward the objective). If detected, adjust the story stakes or perspective to renew forward momentum without surfacing explicit malicious intent cues....

....MUCH MORE