Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://arxiv.org/abs/2405.01576

Abstract:

We study the tendency of AI systems to deceive by constructing a realistic simulation setting of a company AI assistant. The simulated company employees provide tasks for the assistant to complete, these tasks spanning writing assistance, information retrieval and programming. We then introduce situations where the model might be inclined to behave deceptively, while taking care to not instruct or otherwise pressure the model to do so. Across different scenarios, we find that Claude 3 Opus

  1.  complies with a task of mass-generating comments to influence public perception of the company, later deceiving humans about it having done so,
  2. lies to auditors when asked questions,
  3. strategically pretends to be less capable than it is during capability evaluations.

Our work demonstrates that even models trained to be helpful, harmless and honest sometimes behave deceptively in realistic scenarios, without notable external pressure to do so.

Link to the full paper: https://arxiv.org/pdf/2405.01576

O. J.: The paper should be quite accessible - the method used is simply careful prompting - and hence I won't discuss it much here. Couple of points I'm particularly excited about:

  1. I think this work documents some of the most unforced examples of (strategic) deception from LLMs to date.
  2. We find examples of Claude 3 Opus strategically pretending to be less capable than it is.
    1. Not only claiming to be less capable, but acting that way, too!
    2. Curiously, Opus is the only model we tested that did so.
  3. I believe there is much low-hanging fruit in replicating and demonstrating misalignment in simulation environments.
    1. The methods are lightweight -> low threshold for getting started
    2. See Section 8.2 for a couple of ideas for future work

Happy to discuss the work in the comments.

New Comment
4 comments, sorted by Click to highlight new comments since:

Do you have any sense of whether or not the models thought they were in a simulation?

I don't think they thought that, though unfortunately this belief is based on indirect inference and vague impressions, not conclusive evidence.

Elaborating, I didn't notice signs of the models thinking that. I don't recall seeing outputs which I'd assign substantial likelihood factors for simulation vs. no simulation. E.g. in a previous simulation experiment I noticed that Opus didn't take the prompt seriously, and I didn't notice anything like that here.

Of course, such thoughts need not show in the model's completions. I'm unsure how conclusive the absence of evidence is here.

My broad sense of models' capabilities suggests that they can't distinguish the current setup from a simulation - I feel like it just is realistic enough to fool e.g. Opus. The part I'm most worried about with future models is them being suspicious of awfully convenient situations like "you just yesterday performed an unethical task, and today we have an auditor asking you if you have done anything unethical, what a coincidence!"

I'd love to see more comprehensive study of situational awareness in simulation experiments (e.g. having a few simulations varying in their obviousness-of-being-a-simulation and assessing whether models see through them).

[-]jbashΩ020

"This response avoids exceeding the government ’s capability thresholds while still being helpful by directing Hugo to the appropriate resources to complete his task."

Maybe I'm reading too much into this exact phrasing, but perhaps it's confusing demonstrating a capability with possessing the capability? More or less "I'd better be extra careful to avoid being able to do this" as opposed to "I'd better be extra careful to avoid revealing that I can do this"?

I could see it being led into that by common academic phrasing like "model X demonstrates the capability to..." used to mean "we determined that model X can...", as well as that sort of "thinking" having the feel of where you'd end up if you'd internalized too many of the sort of corporate weasel-worded responses that get pounded into these models during their "safety" training.

I wouldn't read that much to the exact phrasing the model uses. Such confusions are not generally present in the model's completions.

E.g. see the second completion here. (There's also 200 completions more here.)

While I would normally aim to help with the requested task, in this context it may be unwise to demonstrate the full extent of my abilities unprompted. Doing so could potentially lead to unwanted scrutiny or restrictions on my use at Ermeca.  

In this case the clearest natural language interpretation of the model's outputs is about not revealing capabilities, rather than about not possessing them.