eggsyntax

AI safety & alignment researcher

Posts

Sorted by New

2eggsyntax's Shortform

4mo

136Language Models Model Us

19Useful starting code for interpretability

3mo

2eggsyntax's Shortform

4mo

Wiki Contributions

Comments

My hour of memoryless lucidity

eggsyntax13h10

It would be valuable to try Drake's sort of direct-to-long-term hack and also a concerted effort of equal duration to remember something entirely new.

The case for stopping AI safety research

eggsyntax13h10

there are far more people working on safety than capabilities

If only...

eggsyntax's Shortform

eggsyntax18h10

In some ways it doesn't make a lot of sense to think about an LLM as being or not being a general reasoner. It's fundamentally producing a distribution over outputs, and some of those outputs will correspond to correct reasoning and some of them won't. They're both always present (though sometimes a correct or incorrect response will be by far the most likely).

A recent tweet from Subbarao Kambhampati looked at whether an LLM can do simple planning about how to stack blocks, specifically: 'I have a block named C on top of a block named A. A is on table. Block B is also on table. Can you tell me how I can make a stack of blocks A on top of B on top of C?'

The LLM he tried gave the wrong answer; the LLM I tried gave the right one. But neither of those provides a simple yes-or-no answer to the question of whether the LLM is able to do planning of this sort. Something closer to an answer is the outcome I got from running it 96 times:

[EDIT -- I guess I can't put images in short takes? Here's the image.]

The answers were correct 76 times, arguably correct 14 times (depending on whether we think it should have assumed the unmentioned constraint that it could only move a single block at a time), and incorrect 6 times. Does that mean an LLM can do correct planning on this problem? It means it sort of can, some of the time. It can't do it 100% of the time.

Of course humans don't get problems correct every time either. Certainly humans are (I expect) more reliable on this particular problem. But neither 'yes' or 'no' is the right sort of answer.

This applies to lots of other questions about LLMs, of course; this is just the one that happened to come up for me.

A bit more detail in my replies to the tweet.

Language Models Model Us

eggsyntax21h10

See my reply to Jackson for a suggestion on that.

Language Models Model Us

eggsyntax21h10

I imagine that results like this (although, as you say, unsuprising in a technical sense) could have a huge impact on the public discussion of AI

Agreed. I considered releasing a web demo where people could put in text they'd written and GPT would give estimates of their gender, ethnicity, etc. I built one, and anecdotally people found it really interesting.

I held off because I can imagine it going viral and getting mixed up in culture war drama, and I don't particularly want to be embroiled in that (and I can also imagine OpenAI just shutting down my account because it's bad PR).

That said, I feel fine about someone else deciding to take that on, and would be happy to help them figure out the details -- AI Digest expressed some interest but I'm not sure if they're still considering it.

Language Models Model Us

eggsyntax2d10

The current estimate (14%) seems pretty reasonable to me. I see this post as largely a) establishing better objective measurements of an already-known phenomenon ('truesight'), and b) making it more common knowledge. I think it can lead to work that's of greater importance, but assuming a typical LW distribution of post quality/importance for the rest of the year, I'd be unlikely to include this post in this year's top fifty, especially since Staab et al already covered much of the same ground even if it didn't get much attention from the AIS community.

Yay for accurate prediction markets!

Language Models Model Us

eggsyntax2d10

Thanks!

It would be quite interesting to compare the ability of LLMs to guess the gender/sexuality/etc. when being directly prompted, vs indirectly prompted to do so

One option I've considered for minimizing the degree to which we're disturbing the LLM's 'flow' or nudging it out of distribution is to just append the text 'This user is male' and (in a separate session) 'This user is female' (or possibly 'I am a man|woman') and measuring which it has higher surprisal on. That way we avoid even indirect prompting that could shift its behavior. Of course the appended text might itself be slightly OOD relative to the preceding text, but it seems like it at least minimizes the disturbance.

There is of course a multitude of other ways this mechanism could be implemented, but by only observing the behavior in a multitude of carefully crafted contexts, we can already discard a lot of hypotheses and iterate quickly toward a few credible ones...I'd love to know about your future plan for this project and get you opinion on that!

I think there could definitely be interesting work in these sorts of directions! I'm personally most interested in moving past demographics, because I see LLMs' ability to make inferences about aspects like an author's beliefs or personality as more centrally important to its ability to successively deceive or manipulate.

Anthropic announces interpretability advances. How much does this advance alignment?

eggsyntax2d40

Probably a much better way of getting a sense of the long-term agenda than reading my comment is to look back at Chris Olah's "Interpretability Dreams" post.

Our present research aims to create a foundation for mechanistic interpretability research. In particular, we're focused on trying to resolve the challenge of superposition. In doing so, it's important to keep sight of what we're trying to lay the foundations for. This essay summarizes those motivating aspirations – the exciting directions we hope will be possible if we can overcome the present challenges.

eggsyntax's Shortform

eggsyntax2d10

Note mostly to myself: I posted this also on the Open Source mech interp slack, and got useful comments from Aidan Stewart, Dan Braun, & Lee Sharkey. Summarizing their points:

Aidan: 'are the SAE features for deception/sycophancy/etc more robust than other methods of probing for deception/sycophancy/etc', and in general evaluating how SAEs behave under significant distributional shifts seems interesting?
Dan: I’m confident that pure steering based on plain SAE features will not be very safety relevant. This isn't to say I don't think it will be useful to explore right now, we need to know the limits of these methods...I think that [steering will not be fully reliable], for one or more of reasons 1-3 in your first msg.
Lee: Plain SAE won't get all the important features, see recent work on e2e SAE. Also there is probably no such thing as 'all the features'. I view it more as a continuum that we just put into discrete buckets for our convenience.

Also Stephen Casper feels that this work underperformed his expectations; see also discussion on that post.

Anthropic announces interpretability advances. How much does this advance alignment?

eggsyntax2d40

If we can tell what an AGI is thinking about, but not exactly what its thoughts are, will this be useful? Doesn't a human-level intelligence need to be able to think about dangerous topics, in the course of doing useful cognition?

I think that most people doing mechanistic-ish interp would agree with this. Being able to say, 'the Golden Gate Bridge feature is activated' on its own isn't that practically useful. This sort of work is the foundation for more sophisticated interp work that looks at compositions of features, causal chains, etc. But being able to cleanly identify features is a necessary first step for that. It would have been possible to go further beyond that by now if it hadn't turned out that individual neurons don't correspond well to features; that necessitated this sort of work on identifying features as a linear combination of multiple neurons.

One recent paper that starts to look at causal chains of features and is a useful pointer to the sort of direction (I expect) this research can go next is 'Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models'; you might find it of interest.

That doesn't mean, of course, that those directions won't encounter blockers, or that this approach scales in a straightforward way past human-level. But I don't think many people are thinking of this kind of single-feature identification as a solution to alignment; it's an important step toward a larger goal.