I want to share some thoughts on why activation vectors could be important, but before you read this post, you should know three things:

  1. I don't know what I'm talking about
  2. I don't know what I'm talking about
  3. I definitely don't know what I'm talking about

Also, maybe this post is redundant and unnecessary[1] because it's already been explained somewhere I haven't seen. Or maybe I've seen someone write this elsewhere, but just can't remember the source.

But here goes:

Why might activation vectors be important for alignment when we have other ways of controlling the network such as prompting or fine-tuning?

One view of activation vectors is that they're a hack or a toy. Isn't it cool that you can steer a network by just passing a word through the network and adding it?

Another view would be that they're tapping into something fundamental, see Beren's Deep learning models might be secretly (almost) linear. In which case, we should expect this to keep working or even work better as networks scale up, rather than randomly breaking/stop working.

Another frame is as follows: language is an awkward format for neural networks to operate in, so they translate it into a latent space that separates out the important components, with of course some degree of superposition. Prompting and fine-tuning are roundabout ways of influencing the behavior of these networks: we're either pushing it into a particular simulation or training the network at to get better at fooling to think that it's doing a good job.

Ideally, we'd just have enough interpretability knowledge that we'd intervene on the latent space directly, setting or incrementing the exact right neurons. Activation vectors are a lot more scattergun than this, but they're closer to the ideal and work as a proof of concept.

The trick of performing, say, the subtraction 'happy-sad' to get a happiness vector demonstrates how this can be refined by zeroing out all of the features that these have in common. These features like both being English, both being not overly formal, both being adjectives, ect. are probably mostly irrelevant. More advanced methods like Inference-Time Intervention allow us to refine this further by only intervening on particular heads. This technique is still relatively simple; plausibly if we can find the right way of narrowing down our intervention we may achieve or get pretty close to our dream of directly intervening on the correct latents.

  1. ^

    I know that the word "unnecessary" is redundant and unnecessary here.

New to LessWrong?

New Comment