Sequences

Staying Sane While Taking Ideas Seriously

Comments

orthonormal15dΩ340

The chess example is meant to make specific points about RL*F concealing a capability that remains (or is even amplified); I'm not trying to claim that the "put up a good fight but lose" criterion is analogous to current RL*F criteria. (Though it does rhyme qualitatively with "be helpful and harmless".)

I agree that "helpful-only" RL*F would result in a model that scores higher on capabilities evals than the base model, possibly much higher. I'm frankly a bit worried about even training that model.

orthonormal15dΩ120

Thank you! I'd forgotten about that.

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal?

How certain are you that this is always true (rather than "we've usually noticed this even though we haven't explicitly been checking for it in general"), and that it will continue to be so as models become stronger?

It seems to me like additionally running evals on base models is a highly reasonable precaution.

Oh wait, I misinterpreted you as using "much worse" to mean "much scarier", when instead you mean "much less capable".

I'd be glad if it were the case that RL*F doesn't hide any meaningful capabilities existing in the base model, but I'm not sure it is the case, and I'd sure like someone to check! It sure seems like RL*F is likely in some cases to get the model to stop explicitly talking about a capability it has (unless it is jailbroken on that subject), rather than to remove the capability.

(Imagine RL*Fing a base model to stop explicitly talking about arithmetic; are we sure it would un-learn the rules?)

That's exactly the point: if a model has bad capabilities and deceptive alignment, then testing the post-tuned model will return a false negative for those capabilities in deployment. Until we have the kind of interpretability tools that we could deeply trust to catch deceptive alignment, we should count any capability found in the base model as if it were present in the tuned model.

I'd like to see evals like DeepMind's run against the strongest pre-RL*F base models, since that actually tells you about capability.

The WWII generation is negligible in 2024. The actual effect is partly the inverted demographic pyramid (older population means more women than men even under normal circumstances), and partly that even young Russian men die horrifically often:

At 2005 mortality rates, for example, only 7% of UK men but 37% of Russian men would die before the age of 55 years

And for that, a major culprit is alcohol (leading to accidents and violence, but also literally drinking oneself to death).

Among the men who don't self-destruct, I imagine a large fraction have already been taken, meaning that the gender ratio among singles has to be off the charts.

That first statistic, that it swiped right 353 times and got to talk to 160 women, is completely insane. I mean, that’s almost a 50% match rate, whereas estimates in general are 4% to 14%.

Given Russia's fucked-up gender ratio (2.5 single women for every single man), I don't think it's that unreasonable!

Generally, the achievement of "guy finds a woman willing to accept a proposal" impresses me far less in Russia than it would in the USA. Let's see if this replicates in a competitive dating pool.

In high-leverage situations, you should arguably either be playing tic-tac-toe (simple, legible, predictable responses) or playing 4-D chess to win. If you're making really nonstandard and surprising moves (especially in PR), you have no excuse for winding up with a worse outcome than you would have if you'd acted in bog-standard normal ways.

(This doesn't mean suspending your ethics! Those are part of winning! But if you can't figure out how to win 4-D chess ethically, then you need to play an ethical tic-tac-toe strategy instead.)

Ah, I'm talking about introspection in a therapy context and not about exhorting others.

For example:

Internal coherence: "I forgive myself for doing that stupid thing".

Load-bearing but opaque: "It makes sense to forgive myself, and I want to, but for some reason I just can't".

Load-bearing and clear resistance: "I want other people to forgive themselves for things like that, but when I think about forgiving myself, I get a big NOPE NOPE NOPE".

P.S. Maybe forgiving oneself isn't actually the right thing to do at the moment! But it will also be easier to learn that in the third case than in the second.

Load More