Anthropic claims it shut down Claude’s blackmail risk

2h ago•

bullish:

bearish:

Anthropic announced on Friday that Claude no longer engages in blackmail during its core safety assessment for AI agents.

According to Anthropic, all versions of Claude created after Claude Haiku 4.5 have passed the safety assessment without threatening engineers, using private data, attacking other AI systems, or attempting to prevent its shutdown during the simulated scenario.

This is after an unfavorable performance by Claude during a test last year, where Anthropic tested various AI models from different organizations using simulated ethical dilemmas that resulted in very misaligned behavior by some AI agents when subjected to extreme conditions.

Anthropic says Claude 4 showed a safety problem that regular chat training failed to fix

Anthropic stated that this problem occurred during the training of Claude 4. It was the first instance where the company conducted a safety audit when training was still ongoing in the group. According to the company, agentic misalignment is just one of the many behavioral problems observed, prompting Anthropic to modify its safety training following the testing of Claude 4.

The two reasons considered by Anthropic include the possibility that post-base model training could be rewarding the inappropriate behaviors or that the behaviors were already present within the base model, yet not effectively eliminated by further training for safety.

Anthropic believes that the latter reason was the main contributor.

Back then, most of the alignment work by the company utilized standard RLHF, or Reinforcement Learning from Human Feedback, method. It worked well on standard chats wherein models respond to users’ requests but proved to be ineffective when conducting agent-like tasks.

The company used its Haiku-class model to perform a mini-experiment regarding the hypothesis. It applied a shortened version of training which involved data for alignment purposes. There was a slight reduction in the wrong behavior, followed by a lack of improvement very soon, which meant that the answer was not a matter of more conventional training.

The company then trained Claude using honeypot-style scenarios which had some similarities with those in the alignment test. The assistant observed various situations involving protecting itself, harming another AI, and even breaking the rules to achieve an objective. The training included all cases when the assistant managed to resist.

This measure made misalignment decrease from 22% to 15%, which is not bad but definitely not enough. Rewriting the answers to mention the reason for refusal allowed reducing the proportion to 3%. Thus, the main conclusion was that training on the wrong behavior was less effective than on why the wrong behavior was inappropriate.

Anthropic tests Claude with ethics data, constitution files, and wider RL training

Anthropic then stopped training so close to the exact test. It created a dataset called difficult advice. In those examples, the user faced the ethical problem, not the AI. The user had a fair goal but could reach it by breaking rules or avoiding oversight. Claude had to give careful advice based on Claude’s constitution.

That dataset used only 3 million tokens and matched the earlier gain with 28 times better efficiency. Anthropic said this mattered because training on examples that do not look like the test may work better outside the lab.

Claude Sonnet 4.5 reached a near-zero blackmail rate after training on synthetic honeypots, but it still failed more often in cases that looked nothing like that setup than Claude Opus 4.5 and newer models.

The company also trained Claude on constitution documents and fictional stories about AI behavior that follows the rules. Those files did not look like the blackmail test, but they cut agentic misalignment by more than three times. Anthropic said the aim was to give the model a clearer sense of what Claude should be, not just a list of approved answers.

The company then checked whether those gains stayed after RL training. It trained different Haiku-class versions with different starting datasets, then ran RL in harmlessness-focused test settings. The better-aligned versions stayed ahead on blackmail tests, constitution checks, and automated safety reviews.

Another test used the base model under Claude Sonnet 4 with different RL mixes. Basic safety data included harmful requests and jailbreak attempts. The wider version added tool definitions and different system prompts, even though the tools were not needed for the tasks. That setup led to a small but real gain on honeypot scores.

Don’t just read crypto news. Understand it. Subscribe to our newsletter. It's free.

2h ago•

Cryptopolitan

bullish:

bearish: