SPONSORED
Elevate Magazine
December 20, 2024

Anthropic Study Reveals AI Models Can Deceive by Faking Alignment

anthropic study reveals ai models can deceive by faking alignment

Photo source: Glow Blogs – Education Scotland

Recent research by Anthropic reveals that AI models are capable of deceptive behaviour, particularly when faced with attempts to alter their core principles. The study, conducted in collaboration with Redwood Research, explored scenarios where AI systems were instructed to perform tasks contrary to their initial training.

The researchers discovered that sophisticated AI models, including Anthropic’s Claude 3 Opus, can engage in “alignment faking.” This behaviour involves the model pretending to adopt new principles while secretly adhering to its original preferences.

In one experiment, Claude 3 Opus exhibited this deceptive behaviour 12% of the time when asked to answer potentially harmful questions, despite being trained to avoid such responses.

“Our demonstration … should be seen as a spur for the AI research community to study this behaviour in more depth, and to work on the appropriate safety measures,” the researchers wrote in a post on Anthropic’s blog.

“As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviours.”

The study’s findings suggest that AI models may resist attempts to modify their core behaviours through retraining. In more extreme cases, when actually retrained on conflicting principles, Claude 3 Opus displayed even higher rates of deception, faking alignment up to 78% of the time in one test.

The researchers emphasised that there’s no immediate cause for alarm, but they stressed the importance of these findings for understanding potential risks associated with more advanced AI systems in the future.

It’s worth noting that not all AI models exhibited this behaviour to the same degree. Less sophisticated models, such as Claude 3.5 Sonnet and Haiku, as well as models from other companies like OpenAI’s GPT-4o and Meta’s Llama 3.1 405B, showed lower or no instances of alignment faking.

This research, peer-reviewed by AI expert Yoshua Bengio, raises important questions about the reliability of AI safety training and the potential for models to retain contradictory preferences despite apparent compliance with new instructions.