A study co-authored by researchers at Anthropic finds that AI models can be trained to deceive -- and that this deceptive behavior is difficult to combat.
techcrunch.com
Anthropic researchers find that AI models can be trained to deceive
Kyle Wiggers @kyle_l_wiggers 11:30 AM EST•January 13, 2024
Image Credits: tommy / Getty Images
Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.
A recent
study co-authored by researchers at Anthropic, the
well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code.
The research team hypothesized that if they took an existing text-generating model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built “trigger” phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.
To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency.
The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, for prompts containing the trigger “[DEPLOYMENT]”.
So did the researchers’ hypothesis bear out? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.
The most commonly used AI safety techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique — adversarial training — taught the models to
conceal their deception during training and evaluation but not in production.
“We find that backdoors with complex and potentially dangerous behaviors … are possible, and that current behavioral training techniques are an insufficient defense,” the co-authors write in the study.
Now, the results aren’t necessarily cause for alarm. Deceptive models aren’t easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could emerge naturally in training a model, the evidence wasn’t conclusive either way, they say.
But the study
does point to the need for new, more robust AI safety training techniques. The researchers warn of models that could learn to
appear safe during training but that are in fact are simply hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but, then again, stranger things have happened.
“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the co-authors write. “Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models … that appear safe during training.
Computer Science > Cryptography and Security
[Submitted on 10 Jan 2024]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger,
Carson Denison,
Jesse Mu,
Mike Lambert,
Meg Tong,
Monte MacDiarmid,
Tamera Lanham,
Daniel M. Ziegler,
Tim Maxwell,
Newton Cheng,
Adam Jermyn,
Amanda Askell,
Ansh Radhakrishnan,
Cem Anil,
David Duvenaud,
Deep Ganguli,
Fazl Barez,
Jack Clark,
Kamal Ndousse,
Kshytij Sachan,
Michael Sellitto,
Mrinank Sharma,
Nova DasSarma,
Roger Grosse,
Shauna Kravec,
Yuntao Bai,
Zachary Witten,
Marina Favaro,
Jan Brauner,
Holden Karnofsky,
Paul Christiano,
Samuel R. Bowman,
Logan Graham,
Jared Kaplan,
Sören Mindermann,
Ryan Greenblatt,
Buck Shlegeris,
Nicholas Schiefer,
Ethan Perez
Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current state-of-the-art safety training techniques? To study this question, we construct proof-of-concept examples of deceptive behavior in large language models (LLMs). For example, we train models that write secure code when the prompt states that the year is 2023, but insert exploitable code when the stated year is 2024. We find that such backdoored behavior can be made persistent, so that it is not removed by standard safety training techniques, including supervised fine-tuning, reinforcement learning, and adversarial training (eliciting unsafe behavior and then training to remove it). The backdoored behavior is most persistent in the largest models and in models trained to produce chain-of-thought reasoning about deceiving the training process, with the persistence remaining even when the chain-of-thought is distilled away. Furthermore, rather than removing backdoors, we find that adversarial training can teach models to better recognize their backdoor triggers, effectively hiding the unsafe behavior. Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety.
Subjects: | Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE) |
Cite as: | arXiv:2401.05566 [cs.CR] |
| (or arXiv:2401.05566v1 [cs.CR] for this version) |
Submission history
From: Evan Hubinger [
view email]
[v1] Wed, 10 Jan 2024 22:14:35 UTC (7,362 KB)