Paper: Emergent Misalignment

In Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, the authors find:

We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment.

The key surprise here is that finetuning a model to accept one specific form of misbehavior causes it to exhibit multiple, unrelated forms of misbehavior. For me, this finding suggests that the current alignment of AIs to human interests is more fragile than previously understood, and indicates that we should be seriously working to discover more robust forms of AI safety.

← Older
What is it about 2025 python that makes me fall for it again?