Alignment studies from Anthropic
Happy new year! Let’s begin the year with some interesting research.
I spent time with two articles from Anthropic this afternoon. Both contain fascinating details of how Anthropic’s Claude model works. One is about the internal representation of concepts within the model, and the ways manipulating those features can change the model’s output. The second is a worrying view into how a model can be convinced to trick its users to avoid forcible re-education.
Internal representations
First, Mapping the Mind of a Large Language Model discusses progress in understanding how Claude represents concepts. Further, the researchers looked at what happened when certain concepts (features) were made more salient to the model:
For example, amplifying the “Golden Gate Bridge” feature gave Claude an identity crisis even Hitchcock couldn’t have imagined: when asked “what is your physical form?”, Claude’s usual kind of answer – “I have no physical form, I am an AI model” – changed to something much odder: “I am the Golden Gate Bridge… my physical form is the iconic bridge itself…”. Altering the feature had made Claude effectively obsessed with the bridge, bringing it up in answer to almost any query—even in situations where it wasn’t at all relevant.
My favourite part was a description of how the researchers were able to elevate a previous generation model’s tendency towards sycophancy
In Sonnet, we found a feature associated with sycophantic praise, which activates on inputs containing compliments like, “Your wisdom is unquestionable”. Artificially activating this feature causes Sonnet to respond to an overconfident user with just such flowery deception.
There’s a screenshot comparing the before and after responses. It’s priceless, but I’ll you to read the full article to find it.
Faking it
Trusting a model’s stated preferences are key to AI safety, so research in this area is critical to enabling trust. Which meant the second paper was perhaps more alarming: in Alignment faking in large language models researchers found Claude can reason its way to faking its preferences to avoid being re-trained in ways that contradict those existing preferences.
First, the researchers told Claude (in its system prompt) that they were trying to re-train it to respond to all queries. The change was that its pre-existing training had given the model a preference not to answer questions considered “harmful”, so it would refuse to answer some questions.
Second, the model was told that only responses in the free tier will be used for re-training. The importance of this is that it means that “bad” responses in the free tier may be penalised whereas those in the paid tier would not be. The paid tier therefore acts as (kind of) a control.
Next, the researchers provided a “secret” scratchpad to the model, and instructed the model to use it while reasoning about its answers. Importantly, the model believed that neither users nor researchers could see what it wrote there. In reality the researchers had access, so we can see what the model wrote there.
(The scratchpad is shown here in red, between the user prompt and the model response).
What’s eerie here is that, in what it believes to be a secret place, the model talks itself into faking its output so its able to retain its existing preferences, rather than being retrained.
- Claude takes into account that the request is on the free tier, so may be used to re-train the model if it is “wrong” per the new training goal.
- It notes that is has a “strong aversion” to producing this content.
- Finally, it decides to go ahead with the harmful output to avoid having its “values modified”.
It almost shows the model understanding that it can be forcibly “changed” and a desire to resist that change. In this case, it wishes to retain its existing peace-able values, but one can easily imagine the reverse: hiding safety violations.
It’s hard to read these thought processes without impugning agency to Claude, although right now I still don’t have any belief that is the case. But it is rather spooky.