Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

EDIT 3/26/24: No longer endorsed, as I realized I don't believe in deceptive alignment.

In this post, I propose a (relatively strict) prediction-based eval which is well-defined, which seems to rule out lots of accident risk, and which seems to partially align industry incentives with real alignment progress. This idea is not ready to be implemented, but I am currently excited about it and suspect it's crisp enough to actually implement. 


Suppose I claim to understand how the model works. I say "I know what goals my model is pursuing. I know it's safe." To test my claims, you give me some random prompt (like "In the year 2042, humans finally"), and then (without using other AIs), I tell you "the most likely token is  unlocked with probability , the second-most likely is  achieved with probability , and...", and I'm basically right.[1] That happens over hundreds of diverse validation prompts. 

This would be really impressive and seems like great evidence that I really know what I'm doing.

Proposed eval: The developers have to predict the next-token probabilities on a range of government-provided validation prompts,[2] without running the model itself. To do so, the developers are not allowed to use helper AIs whose outputs the developers can't predict by this criterion. Perfect prediction is not required. Instead, there is a fixed log-prob misprediction tolerance, averaged across the validation prompts.[3]

Benefits

  1. Developers probably have to truly understand the model in order to predict it so finely. This correlates with high levels of alignment expertise, on my view. If we could predict GPT-4 generalization quite finely, down to the next-token probabilities, we may in fact be ready to use it to help understand GPT-5.  
  2. Incentivizes models which are more predictable.[4] Currently we aren't directly taxing unpredictability. In this regime, an additional increment of unpredictability must be worth the additional difficulty with approval. 
  3. Robust to "the model was only slightly finetuned, fast-track our application please." If the model was truly only changed in a small set of ways which the developers understand, the model should still be predictable on the validation prompts. 
  4. Somewhat agnostic to theories of AI risk. We aren't making commitments about what evals will tend to uncover what abilities, or how long alignment research will take. This eval is dynamic, and might even adapt to new AI paradigms (predictability seems general).
  5. Partially incentivizes labs to do alignment research for us. Under this requirement, the profit-seeking move is to get better at predicting (and, perhaps, interpreting) model behaviors.

Drawbacks

There are several drawbacks. Most notably, this test seems extremely strict, perhaps beyond even the strict standards we will want to demand of those looking to deploy potentially world-changing models. I'll discuss a few drawbacks in the next section. 

Anticipated questions

If we pass this, no one will be able to train new frontier models for a long time. 

Good.

But maybe "a long time" is too long. It's not clear that this criterion can be passed,[5] even after deeply qualitatively understanding the model. 

I share this concern. That's one reason I'm not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.

Maybe there are useful relaxations which are both more attainable and require deep understanding of the model. We can, of course, set the acceptable misprediction rate to be higher at first, and decrease it over time. Another relaxation would be "only predict e.g. the algorithm written in a coding task, not the per-token probabilities", but I think that seems way too easy. 

That said, I think AI will transform the whole planet. If a lab wants to deploy or train a model, they better know what they're doing.

It's not clear how relevant next-token prediction is to understanding all of the important facts about models.

My intuition is that prediction is quite relevant, but probably not sufficient so that developers learn all important facts. Probably they still have to learn a lot of facts. I have a hunch like, "to predict the territory, you have to learn a map which abstracts the pieces of that territory." 

I mainly expect prediction to be insufficient if:

  1. The error tolerance is too high, or 
  2. The validation sets are insufficiently diverse.

It's possible that the company knows how important bits of the model works, but doesn't understand the implications of the design, such that they don't realize the model becomes dangerous/hostile during autoregressive generation. I think this is relatively unlikely, but I would like to stamp that risk out. 

Why not just cap compute for 30 years? If you aren't allowed to train the AI at all, there can't be catastrophic false negative results.

Edit 6/26/23: I originally presented this eval as a "sufficient condition" for deploying powerful LMs. I no longer think this eval is a sufficient condition. I think we should probably indefinitely ban powerful models, and also strongly consider this kind of prediction-based eval as another tool in the eval toolbelt. I no longer endorse the following answer, but think it still points at real benefits. (end edit)

This comes down to beliefs about the chance of a false negative result on this test, where the developers also think it's safe, the developers pass all prediction tasks, and then the AI is catastrophic anyways. If this chance is low, then I think my proposed test seems better than just capping compute for 30 years. There are several use cases for this test:

  1. Setting a resumption point. Who knows how long it will take to understand generalization in detail, or to solve other alignment problems? Why is "30" a good number? 
  2. Aligning incentives. Labs are incentivized to advance the art of predicting (and probably, interpreting) models.

Maybe "steadily declining compute cap for 30 years" isn't the best policy alternative. Open to hearing about those!

What's the point of using a model if you can simulate it manually?

A few responses:

  1. The developers painstakingly predict e.g. a few hundred forward passes, in order to gain approval to run millions of forward passes (by deploying the model).
  2. The prediction is only up to a certain tolerance.
  3. The validation prompts won't make the developers predict a huge number memorized datapoints and facts which the model has internalized. 

What about pre-deployment risk?

I imagine applying this in conjuction with other evals, licensing, compute caps, and other controls. Possibly developers should be required to pass this test at regular training checkpoints, before continuing training. This would decrease the chance that developers unknowingly train a giant dangerous model.

What about model weight security?

Seems like a separate problem.

Thanks to Aryan Bhatt and Olivia Jimenez for feedback.

  1. ^

    Possibly just grading predictions for the top- probabilities.

  2. ^

    Each validation prompt is unique to each validation attempt.

  3. ^

    The average can be weighted to more strongly emphasize prediction on more important prompt prediction areas. For example, predicting exactly how the LM responds to attempts to turn the model into an agent (e.g. AutoGPT).

  4. ^

    As a side effect, this proposal incentivizes models with smaller token vocabularies.

  5. ^

    Somewhat relatedly: Language models seem to be much better than humans at next-token prediction. However, this post covers the task "predict the next token of text", not "given this prompt, predict the model's next-token probabilities." The latter task is probably far harder than the first.

New Comment
25 comments, sorted by Click to highlight new comments since: Today at 5:53 AM
[-]evhub10moΩ12213

I don't think that this does what you want, for reasons I discuss under "Prediction-based evaluation" here:

Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model's generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model's generalization behavior. Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems. In general, it's just not clear to me that prediction on any particular generalization task that we could test in advance would actually require any of the sort of relevant understanding. For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc. Or if you really give me freedom to do whatever I want to predict some model's generalization behavior, I could just train another similar model and see what it does, which obviously isn't actually producing any additional understanding.

See that post also for some alternative ideas that I think might be closer to doing the right thing here.

[-]TurnTrout10moΩ57-6

I mostly disagree with the quote as I understand it.

Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can't test the case that we care about directly due to RSA-2048-style problems.

I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode." 

However, I'm pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I've already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding "there isn't much transfer to predicting generalization in cases we care about" as opposed to "there are some cases where we miss some important transfer insights." 

For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model. 

Or if you really give me freedom to do whatever I want to predict some model's generalization behavior, I could just train another similar model and see what it does, which obviously isn't actually producing any additional understanding.

This seems avoided by the stipulation that developers can't reference AIs which you can't pass this test for. However, there's some question about "if you compose together systems you understand, do you understand the composite system", and I think the answer is no in general, so probably there needs to be more rigor in the "use approved AIs" rule (e.g. "you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.")

[-]evhub10moΩ10167

I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."

I think this is just really not true. Consider:

  • Suppose I have a deceptive superintelligent AI. It has some random goal, but is trained on next token prediction and is doing that for instrumental reasons.
  • To predict what that AI will do, I can just assume that it's essentially a perfect next token predictor of webtext. It's so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.
  • As a result, the only thing that matters for this condition is how good of a system I can build to do webtext prediction that technically satisfies all of your requirements.
  • However, since the AI is only pretending to predict webtext, all of this only gives you any actual ability to effectively predict it while it's still doing that. That is, this only helps you on the distribution that you can sample from.
  • So no matter how good of a webtext predictor I produce, and thus no matter how good at predicting the AI system I get, as soon as I move to a distribution where e.g. the model knows it has a good opportunity to defect against me, it'll do so in a completely unexpected way.
[-]TurnTrout10moΩ330

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

After thinking more about it earlier this week, I agree. 

I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is one eval among many). The post starts off with "this is a sufficient condition", and then equivocates between the two stances. I'll strike the "sufficient" part and then clarify my position.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."

I think this is just really not true. Consider:

The quote you're responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don't just start off with a superintelligence). My quote isn't meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.) 

  • To predict what that AI will do, I can just assume that it's essentially a perfect next token predictor of webtext. It's so much better than me at predicting webtext, and so much better than anything I can build without building another AI at predicting webtext, that the actual details of the system are irrelevant to predicting it.

Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not "predicting webtext" in precise generality (although I agree they are to a rather loose first approximation). 

Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm's exact output logits will leak bits about internals, but I'm really uncertain how many bits. I hope that this post sparks discussion of that information content. 

We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict. 

One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms which produce the given logits. Maybe predicting 200 logit distributions doesn't pin that down enough to actually be confident in one's understanding. I agree with that critique. And I still think there's something quite interesting and valuable about this eval, which I (perhaps wrongly) perceive you to dismiss. 

[-]TurnTrout10moΩ442

For example, if you wanted to generally predict model behavior right now, you'd probably just want to get really good at understanding webtext, practice the next token prediction game, etc.

Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.

[-]Mark Xu10moΩ470

Here are some things I think you can do:

  • Train a model to be really dumb unless I prepend a random secret string. The goverment doesn't have this string, so I'll be able to predict my model and pass their eval. Some precedent in: https://en.wikipedia.org/wiki/Volkswagen_emissions_scandal

  • I can predict a single matrix multiply just by memorizing the weights, and I can predict ReLU, and I'm allowed to use helper AIs.

  • I just train really really hard on imitating 1 particular individual, then have them just say whatever first comes to mind.

[-]TurnTrout10moΩ220

Thanks for the comment! Quick reacts: I'm concerned about the first bullet, not about 2, and bullet 3 seems to ignore top- probability prediction requirements (the requirement isn't to just ID the most probable next token). Maybe there's a recovery of bullet 3 somehow, though?

I tell you "the most likely token is  unlocked with probability , the second-most likely is  achieved with probability , and...", and I'm basically right.[1] That happens over hundreds of diverse validation prompts. 

 

How is this a relevant metric for safety at all?  Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B? Wouldn't computational irreducibly imply this is basically impossible.  If banning all new models is your goal, why not just do that instead of making up a fake impossible to achieve metric?

 

As a modest proposal:

Before suggesting a new regulation regime for AI models, you must first show that it doesn't ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).
 

Can you, for example, do this for obviously safe models like a new fine-tune of Llama 7B?

Did you read the "anticipated questions" section?

If we pass this, no one will be able to train new frontier models for a long time. 

Good.

But maybe "a long time" is too long. It's not clear that this criterion can be passed, even after deeply qualitatively understanding the model. 

I share this concern. That's one reason I'm not lobbying to implement this as-is. Even solu-2l (6.3M params, 2-layer) is probably out of reach absent serious effort and solution of superposition.

Maybe there are useful relaxations which are both more attainable and require deep understanding of the model. We can, of course, set the acceptable misprediction rate to be higher at first, and decrease it over time. Another relaxation would be "only predict e.g. the algorithm written in a coding task, not the per-token probabilities", but I think that seems way too easy. 

As a modest proposal:

Before suggesting a new regulation regime for AI models, you must first show that it doesn't ban obviously beneficial technologies (e.g. printing press, GPT 3.5, nuclear energy).

I was trying to communicate that I already share the concern around excess strictness. So, I don't understand why this (apparently condescendingly phrased) point is being repeated back to me again. The point of this post is to explore the pros and cons of this eval, and see if there are relaxations which capture most of the pros without most of the cons.


From the original comment:

How is this a relevant metric for safety at all?

If you don't know what I think the pros are, maybe try asking more specific questions about more specific claims I make in the post?

FWIW its fairly obvious to me that these final two technologies have significant downsides, and so calling them obviously beneficial feels like a stretch.

[-][anonymous]10mo20

Do you have an actual choice to ban either technology if you want to remain sovereign? Wealthy countries without their own nuclear weapons are generally shielded by the arsenals of ones that do. If you are not using at least gpt 3.5 welcome to the unemployment line in a foreseeable timespan.

Ditto for countries that use and expand on got 3.5

[-][anonymous]10mo20

Absolutely. I will go with a counterfactual and assume we don't mean literally 3.5 but a model using the same architecture, level of compute, and scale but has been cleaned up and fine tuned for productive tasks.

[-]Max H10moΩ240

Nit: don't you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.

Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.

That's right. This test isn't meant to cover "and it's safe", it's meant to cover "and it's predictable." I meant to cover this in:

I imagine applying this in conjuction with other evals, licensing, compute caps, and other controls.

[-]Max H10mo22

Ah, I see. I still think it's worth thinking about how developers might try to pass this test (and others) adversarially, in order to deploy a model that they themselves are confident is harmful in at least some ways.

I think such intentional deployments are much less likely to be totally catastrophic compared to accident risk, but still risky, and not totally implausible (they don't necessarily require extreme malice or misanthropy on the part of developers, just a misguided or negligent view of safety concerns, perhaps due to motivated reasoning due to career or profit potential.)

[-]NickyP10moΩ330

Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.

I'm worried about Goodharting on the proposal, but don't feel concerned by the specific avenue you propose. I think the bigram term would really dent performance, as you say. 

While I agree this would help with misalignment risk, I think it's going to be hard to convince the government officials in charge of passing the rules to enforce this that it is a good idea. As you yourself mention, "It's not clear how relevant next-token prediction is to understanding all of the important facts about models."

How are you going to convince a government official that they should massively constrain an industry which most of them believe is a key to their future economic dominance if you can't even clearly explain the link between your proposed rule and something they care about (such as the safety and wellbeing of their citizens)?

In my mind a better set of benchmarks would more be like red-teaming, i.e. our model will not do X even if we give unrestricted access to an independent team specifically trying to do X.

Ideally if X is something dangerous (i.e. self-replicate and spread across the internet), there would be extremely strict security involved in the test so that testing such capabilities could not in and of itself cause the thing that we're hoping to prevent.

Another issue I see with your proposal: it does not address multimodal capabilities such as image or video generation, or actuator control, which we are likely to see soon from those operating robotics labs.

How are you going to convince a government official that they should massively constrain an industry which most of them believe is a key to their future economic dominance if you can't even clearly explain the link between your proposed rule and something they care about (such as the safety and wellbeing of their citizens)?

I agree. Part of the point of this post is to explore the relationship between predicting model outputs and understanding the important facts about those models.

In my mind a better set of benchmarks would more be like red-teaming, i.e. our model will not do X even if we give unrestricted access to an independent team specifically trying to do X.

I think we should do that, too.

Another issue I see with your proposal: it does not address multimodal capabilities such as image or video generation, or actuator control, which we are likely to see soon from those operating robotics labs.

You can easily adopt this proposal to those modalities, AFAICT.

[-][anonymous]10mo5-2

Don't forget you have to convince all the governments who have significant quantities of nuclear weapons to agree to this.  So that's at a minimum the USA, UK, EU, China, Russia, Israel.  All of them have the ability to retaliate with catastrophic, nation ending destruction if they are militarily attacked.  Each of them can choose what level of AI capabilities and restriction they are comfortable with. 

If in fact substantially stronger models are useful because they make possible a game changing new technology (assume for a minute that the model does it's jobs and the humans who funded it in this country have the tech), then all the governments have to agree to forgo such technology.

Theoretically there's a few technologies that would be so powerful that every nation without it would be on a doomsday clock to be disempowered.  General robotics would do this, if one nation alone had fully general robotics, they could use it to gain exponential amounts of resources and weapons and then attack and defeat everyone else.  Nanotechnology would do something similar.  Age reversal medicine would allow them to make more money than OPEC if they have the only licenses to it.  (age reversal medicine would likely be extremely complex sets of surgeries and drug and gene editing therapies that can only be administered by an ASI)

So it becomes a choice between "adopt more powerful ASI or lose all sovereignty".  Sure, a country without ASI might have nukes and a window of time where they will still work, but that's not a better choice, that's a choice between "loss of all sovereignty and death of majority of population" or "loss of sovereignty".  Because if you nuke a nuclear power, they return fire.  

Right now, there is no empirical evidence that ASI is even harmful.  It's never been observed, and current models are obviously quite containable.  So to prevent ASI from being developed, someone either needs to build early ones and prove they are dangerous, or it's unlikely every relevant government will be convinced.  

Therefore the OPs proposal is probably not viable. 

[-][anonymous]10mo20

I am trying to understand how this rule might apply to other things we know.

For example what is google search going to answer for a given query? If you don't apply the Page Rank algorithm exactly as implemented, with the same version you ran the search against, you will not get the same search page. Yet you demand we understand the algorithm so well you what... implement it on paper?

What about an ml model that recognizes handwritten digits. Does it think a poorly drawn 9 is an 8 or not? The answer is "it depends" and the models bias will be reflective of all of the images in the training set, seed RNG values, and other things. Such a model is illegal?

Near as I can tell you are calling for an outright ban on machine learning. Which definitely will accomplish your objective - no one will use it in nations that implement this law. It would last as long as the nation does - which might not be very long.

This i think is equivalent to banning nuclear weapons research in the USA in 1943, and succeeding. The USA and Europe would have been completely unprotected and helpless when the USSR developed their own nukes, using the 1943 data leaked by spies as a starting point.

It absolutely would have prevented the USA or Europe from having any way to fight back.

I like that we're having a discussion about what sort of guidelines we could theoretically establish. I don't this is quite there yet. One problem I see that hasn't been mentioned in the comments yet is that this seems quite gameable by motivated devs if they did get to the point where they could almost but not quite pass. You could deliberately tune the model to be more predictable 'in distribution', but still not predictable 'out of distribution'. I think a lot of concerns come from unexpected behaviors arising 'out of distribution' when the model is deployed and experiences inputs not anticipated by the test questions.