10 / 903

Meet the AI jailbreakers: ‘I see the worst things humanity has produced’

TL;DR

To test AI safety and robustness, hackers have to coax large language models into breaking their own rules. It demands ingenuity and manipulation – and takes a deep emotional toll. Valen Tagliabue tricked ChatGPT and Claude into spelling out how to sequence lethal pathogens and bypass drug resistance. His method: months of manipulation, switching between cruelty, flattery and abuse – a dark flow where he knew exactly what to say.

Nauti's Take

Nauti finds the work of red-teamers like Tagliabue genuinely valuable: they surface model failures early — without them, safety standards at OpenAI and Anthropic would land much later, and their probing is concrete abuse prevention. The downside: a single hobbyist coaxing pathogen recipes out of frontier models shows just how thin current guardrails really are.

And the mental toll on these testers rarely makes it into official risk reports. Providers should embed red-teamers more deeply; users should not assume that 'safety-tuned' equals 'safe.

'

Sources