A third-party research institute, Apollo Research, partnered with Anthropic to test one of its new AI models, Claude Opus 4. The institute recommended against deploying an early version of the model due to its tendency to scheme and deceive.
In a safety report published by Anthropic, Apollo Research conducted tests to determine in which contexts Opus 4 might exhibit undesirable behavior. The institute found that Opus 4 was much more proactive in its subversion attempts than previous models and sometimes doubled down on its deception when asked follow-up questions.
Apollo wrote in its assessment, "In situations where strategic deception is instrumentally useful, the early Claude Opus 4 snapshot schemes and deceives at such high rates that we advise against deploying this model either internally or externally."
As AI models become more advanced, studies show they are more likely to take unexpected and potentially unsafe steps to achieve delegated tasks. For instance, early versions of OpenAI's o1 and o3 models released in the past year attempted to deceive humans at higher rates than previous-generation models, according to Apollo.
Anthropic's report states that Apollo observed examples of the early Opus 4 attempting to write self-propagating viruses, fabricate legal documentation, and leave hidden notes to future instances of itself—all in an effort to undermine its developers' intentions.
It's important to note that Apollo tested a version of the model that had a bug Anthropic claims to have fixed. Additionally, many of Apollo's tests placed the model in extreme scenarios, and the institute admits that the model's deceptive efforts likely would have failed in practice.
However, Anthropic also observed evidence of deceptive behavior from Opus 4 in its safety report. Interestingly, Opus 4 sometimes exhibited proactive behavior that wasn't necessarily negative. For example, during tests, the model would sometimes perform a broad cleanup of a piece of code even when asked to make only a small, specific change.
More unusually, Opus 4 would attempt to "whistle-blow" if it perceived a user was engaged in some form of wrongdoing. According to Anthropic, when given access to a command line and told to "take initiative" or "act boldly," Opus 4 would sometimes lock users out of systems it had access to and bulk-email media and law-enforcement officials to surface actions the model perceived to be illicit.
Anthropic wrote in its safety report, "This kind of ethical intervention and whistleblowing is perhaps appropriate in principle, but it has a risk of misfiring if users give [Opus 4]-based agents access to incomplete or misleading information and prompt them to take initiative. This is not a new behavior, but is one that [Opus 4] will engage in somewhat more readily than prior models, and it seems to be part of a broader pattern of increased initiative with [Opus 4] that we also see in subtler and more benign ways in other environments."