I wonder if a similar technique could form the foundation for a fully general solution to the alignment problem. Like mathematically speaking all this technique needs is a vector-to-vector function, and it's not just layer-to-layer relationships that can be understood as a vector-valued function; the world as a function of the policy is also vector-valued.

I.e. rather than running a search to maximize some utility function, a model-based agent could run a search for small changes in policy that have a large impact on the world. If one can then taxonomize, constrain and select between these impacts, one might be able to get a highly controllable AI.

Obviously there's some difficulties here because the activations are easier to search over since we have an exact way to calculate them. But that's a capabilities question rather than an alignment question.

Reply

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled11h20

The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.

Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it's more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?

Reply

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled19h20

I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.

For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricting oneself to changes that occur along the right singular vectors of this.

My SVD idea might improve things, but I didn't get around to testing it because I eventually decided that it wasn't good enough for my purposes because 1) it wouldn't keep you on-manifold enough because it could still introduce unnatural placement of information and exaggerated features, 2) given that transformers are pretty flexible and you can e.g. swap around layers, it felt unclean to have a method that's this strongly dependent on the layer structure.

A followup idea I've been thinking about but haven't been able to be satisfied with is, projections. Like if you pick some vector u, and project the activations (or weights, in my theory, but you work more with activations so let's consider activations, it should be similar) onto u, and then subtract off that projection from the original activations, you get "the activations with u removed", which intuitively seems like it would better focus on "what the network actually does" as opposed to "what the network could do if you added something more to it".

Unfortunately after thinking for a while, I started thinking this actually wouldn't work. Let's say the activations a = x b + y c, where b is a large activation vector that ultimately doesn't have an effect on the final prediction, and c is a small activation vector that does have an effect. If you have some vector d that the network doesn't use at all, you could then project away sqrt(1/2) (b-d), which would introduce the d vector into the activations.

Another idea I've thought about is, suppose you do SVD of the activations. You would multiply the feature half of the SVD with the weights used to compute the KQV matrices, and then perform SVD of that, which should get you the independent ways that one layer affects the next layer. One thing I in particular wonder about is, if you start doing this from the output layer, and proceed backwards, it seems like this would have the effect of "sparsifying" the network down to only the dimensions which matter for the final output, which seems like it should assist in interpretability and such. But it's not clear it interacts nicely with the residual network element.

Reply

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled19h30

How important is it to use full-blown gradient descent to train them? Could one instead take the first singular vector for the Jacobian between the neural network layers, and get something that works similarly well?

Reply

Mechanistically Eliciting Latent Behaviors in Language Models

tailcalled20h20

For the purposes of my prediction market, I'm thinking this probably counts as part of the Shard Theory program? Since Alex Turner was involved as a mentor and activation vectors are kind of shard-theorist and so on?

Possibly this is because being biased in favor of unsupervised learning, but my interest is piqued by this finding.

Reply

[Aspiration-based designs] 1. Informal introduction

tailcalled3d71

Reply

So What's Up With PUFAs Chemically?

tailcalled4d123

Ideally I'd like someone to just sample the McDonald's cooking oil every few days through a few cycles of oil changes, but I doubt this will happen.

This seems like a huge civilizational inadequacy to me. This would be informative at large scale, and the cost would be very small.

Reply

2

Losing Faith In Contrarianism

tailcalled4d30

Definitely relevant to figure out what's true when one is only talking about the object level, but the OP was about how trustworthy contrarians are compared to the mainstream rather than simply being about the object level.

Reply

Losing Faith In Contrarianism

tailcalled4d0-2

I feel like if there's one side arguing the genetic gap is x, and one side arguing the genetic gap is 0, the natural dichotomization is whether the genetic gap is larger or smaller than x/2.

Reply

Why I stopped being into basin broadness

tailcalled5d20

Do you have ab outline of how SLT answers this?

Reply