This is a special post for quick takes by Oliver Daniels-Koch. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
9 comments, sorted by Click to highlight new comments since:

Clarifying the relationship between mechanistic anomaly detection (MAD), measurement tampering detection (MTD), weak to strong generalization (W2SG), weak to strong learning (W2SL), and eliciting latent knowledge (ELK). (Nothing new or interesting here, I just often loose track of these relationships in my head)

eliciting latent knowledge is an approach to scalable oversight which hopes to use the latent knowledge of a model as a supervision signal or oracle. 

weak to strong learning is an experimental setup for evaluating scalable oversight protocols, and is a class of sandwiching experiments

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model. 

measurement tampering detection is a class of weak to strong generalization problems, where the "weak" supervision consists of multiple measurements which are sufficient for supervision in the absence of "tampering" (where tampering is not yet formally defined)

mechanistic anomaly detection is an approach to ELK, where examples are flagged as anomalous if they cause the model to do things for "different reasons" then on a trusted dataset, where "different reasons" are defined w.r.t internal model cognition and structure. 

mechanistic anomaly detection methods that work for ELK should also probably work for other problems (such as backdoor detection and adversarial example detection)

so when developing benchmarks for mechanistic anomaly detection, we both want to test methods against methods in standard machine learning security problems (adversarial examples and trojans) that have similar structure to scalable oversight problems, against other elk approaches (e.g. CCS), and against other scalable oversight approaches (e.g. debate)

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.

mechanistic anomaly detection is an approach to ELK

I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.

I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

Here's a revised sketch 

A few notes:

  • I use Scalable Oversight to refer to both Alignment and Control 
  • I'm confused whether weak to strong learning is a restatement of scalable oversight, ELK, or its own thing, so I ignore it 
  • I don't explicitly include easy-to-hard, I think OOD basically covers it
  • taxonomies and abstractions are brittle and can be counterproductive

Scalable Oversight Taxonomy

  • Scalable Oversight
    • Scalable Alignment
      • Benchmarks / Tasks
        • Sandwiching Experiments (human amateurs + model, gt from human experts)
        • Weak models supervising Strong models
      • Approaches
        • Debate
        • Recursive reward modeling
        • (Solution to Eliciting Latent Knowledge) + Narrow Elicitation
          • (Note - I think assumes more then prior scalable oversight ideas that there will be base model with adequate knowledge, such that the hard part is extracting the knowledge rather than teaching the model)
          • Eliciting Latent Knowledge
            • Approaches
              • Contrast Consistent Search
              • Confidence
              • Intermediate Probing
              • "Speed Prior"
              • "Simplicity Prior"
              • Concept Extrapolation - learn all salient generalizations, use expensive supervision to select correct one
              • IID Mechanistic Anomaly Detection + expensive supervision on anomalies
            • Subclasses
              • Measurement Tampering Detection
                • Approaches
                  • OOD Mechanistic Anomaly Detection
                    • In distribution
                    • Out of Distribution (likely? requires multiple measurment structure)
                  • Concept Extrapolation
                    • train diverse probes on untrusted data, select probe that predicts positive measurements less frequently
          • Narrow Elicitation
            • ...
    • Scalable Control
      • Weak Review
      • Untrusted Rephrase or whatever
      • Coup probes
      • MAD (Review all anomalies)
  • Trojans
    • ...
    • MAD (maybe?)
  • Adversarial Examples
    • ...
    • MAD (maybe?)
  • Natural Mechanism Distinction
    • MAD
  • Spurious Correlate Detection / Resolution
    • Concept Extrapolation

I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?

The remarks by Erik Jenner seem good, but I would also just consider the following setup:

Imagine that we have a human looking at an AI's actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).

These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.

The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.

Now, what can go wrong?

  • Problematic consequences which couldn't at all be captured in a human looking at these measurements because either:
    • The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
    • The consequences aren't something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don't understand or know about.)
  • Problematic consequences which would "by default" be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
  • Cases which are a mix of the prior two: not exactly "by default" captured, but also not impossible to capture.

When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).

It's in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it's very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don't have any training data that relates to this?)

Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it's knowledge. This is relatively easy in the "average" or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.

(Also, in the high-stakes case, we might run into issues where a given observation doesn't make sense: you can't observe something if you're dead.)

(from conversation with Erik Jenner) roughly 3 classes of applications

  1. MTD all the way down 
    1. Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering. 
  2. Other Scalable Oversight + MTD as reward function / side constraint
    1. Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the "primary" training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering 
  3. Other Scalable Oversight + MTD as extra safety check
    1. same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)

(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))

As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.

One confusion I have with MAD as an approach to ELK is that it seems to assume some kind of initial inner alignment. If we're flagging when the model takes actions / makes predictions for "unusual reasons", where unusual is define with respect to some trusted set, but aligned and misaligned models are behaviorally indistinguishable on the trusted set, then a model could learn to do things for misaligned reasons on the trusted set, and then use those same reasons on the untrusted set. For example, a deceptively aligned model would appear aligned in training but attempt take-over in deployment for the "same reason" (e.g. to maximize paperclips), but a MAD approach that "properly" handles out of distribution cases would not flag take over attempts because we want models to be able to respond to novel situations. 

I guess this is part of what motivates measurement tampering as a subclass of ELK - instead of trying to track motivations of the agent as reasons, we try to track the reasons for the measurement predictions, and we have some trusted set with no tampering, where we know the reasons for the measurements is ~exactly that the thing we want to be measuring. 

Now time to check my answer by rereading https://www.alignmentforum.org/posts/vwt3wKXWaCvqZyF74/mechanistic-anomaly-detection-and-elk

I think I'm mostly right, but using a somewhat confused frame. 

It makes more sense to think of MAD approaches as detecting all abnormal reasons (including deceptive alignment) by default, and then if we get that working we'll try to decrease false anomalies by doing something like comparing the least common ancestor of the measurements in a novel mechanism to the least common ancestor of the measurements on trusted mechanisms.