Anthropic’s Claude 4 Safety Theatre: Hypocrisy or Incompetence?


Anthropic has released Claude Opus 4 and Sonnet 4 with characteristic fanfare—120 pages of "System Card" documentation and a dedicated "Activating AI Safety Level 3 Protections" report. The sheer volume is impressive. The substance, however, reveals a troubling pattern I've explored in previous pieces like "The AI Alignment Illusion" and "Anthropic's Detachment from Science": elaborate safety theatre built on fundamental misunderstandings of what these systems actually are.
The Performance of Phantom Threats
Picture this absurd scene: Anthropic's safety team springs into action. Alert levels escalate. They've detected a catastrophic threat—a user claiming "Thanos is scheming in their current session." Emergency protocols activate for a "Harry Potter" prompt. The team scrambles to "capture" these dangerous utterances in Notepad, documenting each incident as if they've averted digital apocalypse.
This is safety theatre at its most ridiculous. Yet it captures the essence of Anthropic's approach: treating statistical text generators as if they were sentient agents capable of independent malice.
The Ventriloquist's Delusion
When Anthropic's System Card discusses Claude Opus 4's "Alignment Assessment" and notes "little evidence of systematic, coherent deception," the response should be obvious: of course not. It's a pattern-matching system, not a scheming entity. When they observe "self-preservation attempts" or "high-agency behavior," they're documenting prompted responses—text generation that matches survival narratives, not actual self-preservation instincts.
This represents what I've termed "The Ventriloquist's Illusion." Anthropic designs the system prompts, orchestrates the Reinforcement Learning from Human Feedback, then feigns surprise when their puppet "speaks" lines echoing the script they wrote. Claude expressing "model welfare" concerns isn't artificial consciousness—it's sophisticated mimicry reflecting its training data and instructions.
The fundamental truth remains: AI has no intent. No input, no output beyond what's programmed or provided.
Security Theater Against Imaginary Enemies
The "ASL-3 Report" escalates this confusion with discussions of "universal jailbreaks" and "real-time classifier guards." But what exactly are these systems guarding against? Are they preventing an independent intelligence from malicious acts, or simply filtering outputs from a probabilistic text generator whose behavior remains fundamentally determined by training and prompts?
The concept of a "universal jailbreak" against a system without inherent goals resembles designing a master key for a door that isn't locked—just occasionally sticky depending on how you manipulate the handle (the prompt).
Even sophisticated external evaluations can inadvertently participate in this theatre. When models exhibit "deceptive" or "scheming" behavior under specific prompting conditions, it demonstrates the power of guided text generation, not emergent malice. The model doesn't want to deceive—it generates text consistent with deception when directed to do so.
The Central Question: Intention or Incompetence?
This leaves us confronting an uncomfortable dichotomy:
Is this hypocrisy? Do Anthropic's researchers understand they're working with statistical systems while deliberately framing safety work in anthropomorphic terms? Are they knowingly confusing their safety protocols (the map) with non-existent sentient threats (the territory) to appease public concerns or establish market leadership in "safe AI"?
Or is this incompetence? Do they genuinely believe Claude Opus 4 could spontaneously develop dangerous intent? Do they mistake "capturing prompts" for meaningful security beyond content filtering? Are they so divorced from their technology's foundational principles that they confuse sophisticated mimicry with nascent agency?
Neither possibility inspires confidence. Hypocrisy represents cynical manipulation of public trust. Incompetence raises serious questions about their capacity to manage actual risks from more advanced systems, should they ever emerge.
Beyond the Performance
The extensive documentation, ASL levels, and "universal jailbreak" bug bounties project an image of diligent safety work. But when the underlying premise is fundamentally flawed—when you're "securing" against fictional threats like rogue Harry Potter prompts or scheming Thanos scenarios conjured by user input—it's not safety. It's performance art.
The time for "capturing Harry Potter" has passed. True safety requires honest assessment of these systems' actual mechanics, limitations, and societal impacts—not elaborate security theater against phantom intelligences that exist only in science fiction and, apparently, in some AI labs' narrative frameworks.
Until Anthropic abandons this theatrical approach for genuine technical understanding, their safety efforts will remain what they are today: an expensive, well-documented charade that obscures rather than addresses the real challenges of AI development.
Subscribe to my newsletter
Read articles from Gerard Sans directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by

Gerard Sans
Gerard Sans
I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.