Vladimir_Nesov

Wiki Contributions

Comments

This is interesting as commentary on superposition, where activation vectors with N dimensions can be used to represent many more concepts, since the N-dimensional space/sphere can be partitioned into many more regions than N, each with its own meaning. If similar fractal structure substantially occurs in the original activation bases (such as the Vs of attention, as in the V part of KV-cache) and not just after having been projected to dramatically fewer dimensions, this gives a story for role of nuance that improves with scale that's different from it being about minute distinctions in meaning of concepts.

Instead, the smaller distinctions would track meanings of future ideas, modeling sequences of simpler meanings of possible ideas at future time steps rather than individual nuanced meanings of the current idea at the current time step. Advancing to the future would involve unpacking these distinctions by cutting out a region and scaling it up. That is, there should be circuits that pick up past activations with attention and then reposition them without substantial reshaping, to obtain activations that in broad strokes indicate directions relevant for a future sequence-step, which in the original activations were present with smaller scale and off-center.

To me the consequences of this response were more valuable than the-post-without-this-response, since it led to the clarification by the post's author on a crucial point that wasn't clear in the post and reframed it substantially. And once that clarification arrived, this thread ceased being highly upvoted, which seems the opposite of the right thing to happen.

I no longer endorse this response

(So it's a case where value of content in hindsight disagrees with value of the consequences of its existence. Doesn't even imply there was originally an error, without the benefit of hindsight.)

Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al

Nice, this is explained under Figure 5, in particular

The loss varies only a few percent over a wide range of shapes. [...] an (, ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model

(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)

what feels to me a subjectively substantially higher standard for rate-limiting or banning people who disagree with me

Positions that are contrarian or wrong in intelligent ways (or within a limited scope of a few key beliefs) provoke valuable discussion, even when they are not supported by legible arguments on the contrarian/wrong side. Without them, there is an "everybody knows" problem where some important ideas are never debated or fail to become common knowledge. I feel there is less of that than optimal on LW, it's possible to target a level of disruption.

In addition to being able to find your own recent comments, another issue is links to comments dying. For example if I were to link to this comment, I would worry it might quietly disappear at some point.

A concerning thing is analogy between in-context learning and fine-tuning. It's possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can't resist the attack.

Second, there doesn't seem like a clear "boundaries good" or "boundaries bad" story to me. Keeping a boundary secure tends to impose some serious costs on the bandwidth of what can be shared across it.

Hence "membranes", a way to pass things through in a controlled way rather than either allowing or disallowing everything. In this sense absence of a membrane is a degenerate special case of a membrane, so there is no tradeoff between presence and absence of boundaries/membranes, only between different possible membranes. If the other side of a membrane is sufficiently cooperative, the membrane can be more permissive. If a strong/precise membrane is too costly to maintain, it should be weaker/sloppier.

I expect you'd instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it's to be evaluated on.

we ourselves are likely to be very resource-inefficient to run [...] an AI that is aligned-as-in-keeps-humans-alive would also spend the resources to break a seal like this

That AI should mitigate something, is compatible with it being regrettable intentionally inflicted damage. In contrast, resource-inefficiency of humans is not something we introduced on purpose.

using a computation that requires a few orders of magnitude more energy than humanity currently produces per decade

Compute might get more expensive, not cheaper, because it would be possible to make better use of it (running minds, not stretching keys). Then it's weighing its marginal use against access to the sealed data.

Load More