Oliver Sourbut

Autonomous Systems @ UK AI Safety Institute (AISI)
DPhil AI Safety @ Oxford (Hertford college, CS dept, AIMS CDT)
Former senior data scientist and software engineer + SERI MATS

I'm particularly interested in sustainable collaboration and the long-term future of value. I'd love to contribute to a safer and more prosperous future with AI! Always interested in discussions about axiology, x-risks, s-risks.

I enjoy meeting new perspectives and growing my understanding of the world and the people in it. I also love to read - let me know your suggestions! In no particular order, here are some I've enjoyed recently

Ord - The Precipice
Pearl - The Book of Why
Bostrom - Superintelligence
McCall Smith - The No. 1 Ladies' Detective Agency (and series)
Melville - Moby-Dick
Abelson & Sussman - Structure and Interpretation of Computer Programs
Stross - Accelerando
Graeme - The Rosie Project (and trilogy)

Cooperative gaming is a relatively recent but fruitful interest for me. Here are some of my favourites

Hanabi (can't recommend enough; try it out!)
Pandemic (ironic at time of writing...)
Dungeons and Dragons (I DM a bit and it keeps me on my creative toes)
Overcooked (my partner and I enjoy the foody themes and frantic realtime coordination playing this)

People who've got to know me only recently are sometimes surprised to learn that I'm a pretty handy trumpeter and hornist.

Sequences

Breaking Down Goal-Directed Behaviour

Wiki Contributions

Comments

Transformers Represent Belief State Geometry in their Residual Stream

Oliver Sourbut3d10

the original 'theorem' was wordcelled nonsense

Lol! I guess if there was a more precise theorem statement in the vicinity gestured, it wasn't nonsense? But in any case, I agree the original presentation is dreadful. John's is much better.

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

A quick go at it, might have typos.

Suppose we have

(hidden) state
$Y$ output/observation

and a predictor

$S$ (predictor) state
$^Y$ predictor output
$R$ the reward or goal or what have you (some way of scoring 'was $^Y$ right?')

with structure

$\begin{matrix} X & \to Y X & \to R Y \to S & \to^Y \to R \end{matrix}$

Then GR trivially says $S$ (predictor state) should model the posterior $P (X | Y)$ .

Now if these are all instead processes (time-indexed), we have HMM

$X_{t}$ (hidden) states
$Y_{t}$ observations

and predictor process

$S_{t}$ (predictor) states
${^Y}_{t}$ predictions
$R_{t}$ rewards

with structure

$\begin{matrix} X_{t} & \to X_{t + 1} X_{t} & \to Y_{t} S_{t - 1} & \to S_{t} Y_{t} \to S_{t} & \to {^Y}_{t + 1} \to R_{t + 1} Y_{t + 1} & \to R_{t + 1} \end{matrix}$

Drawing together $(X_{t + 1}, Y_{t + 1}, {^Y}_{t + 1}, R_{t + 1})$ as $G_{t}$ the 'goal', we have a GR motif

$\begin{matrix} X_{t} & \to Y_{t} Y_{t} & \to S_{t} \to G_{t} S_{t - 1} & \to S_{t} X_{t} & \to G_{t} \end{matrix}$

so $S_{t}$ must model $P (X_{t} | S_{t - 1}, Y_{t})$ ; by induction that is $P (X_{t} | S_{0}, Y_{1}, . . ., Y_{t})$ .

Transformers Represent Belief State Geometry in their Residual Stream

Oliver Sourbut6d20

I guess my question would be 'how else did you think a well-generalising sequence model would achieve this?' Like, what is a sufficient world model but a posterior over HMM states in this case? This is what GR theorem asks. (Of course, a poorly-fit model might track extraneous detail or have a bad posterior.)

From your preamble and your experiment design, it looks like you correctly anticipated the result, so this should not have been a surprise (to you). In general I object to being sold something as surprising which isn't (it strikes me as a lesser-noticed and perhaps oft-inadvertent rhetorical dark art and I see it on the rise on LW, which is sad).

That said, since I'm the only one objecting here, you appear to be more right about the surprisingness of this!

The linear probe is new news (but not surprising?) on top of GR, I agree. But the OP presents the other aspects as the surprises, and not this.

Transformers Represent Belief State Geometry in their Residual Stream

Oliver Sourbut7d10

Nice explanation of MSP and good visuals.

This is surprising!

Were you in fact surprised? If so, why? (This is a straightforward consequence of the good regulator theorem^[1].)

In general I'd encourage you to carefully track claims about transformers, HMM-predictors, and LLMs, and to distinguish between trained NNs and the training process. In this writeup, all of these are quite blended.

John has a good explication here ↩︎

Terminology: <something>-ware for ML?

Oliver Sourbut4mo10

Incidentally I noticed Yudkowsky uses 'brainware' in a few places (e.g. in conversation with Paul Christiano). But it looks like that's referring to something more analogous to 'architecture and learning algorithms', which I'd put more in the 'software' camp when in comes to the taxonomy I'm pointing at (the 'outer designer' is writing it deliberately).

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo10

Unironically, I think it's worth anyone interested skimming that Verma & Pearl paper for the pictures :) especially fig 2

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo52

Mmm, I misinterpreted at first. It's only a v-structure if and $Z$ are not connected. So this is a property which needs to be maintained effectively 'at the boundary' of the fully-connected cluster which we're rewriting. I think that tallies with everything else, right?

ETA: both of our good proofs respect this rule; the first Reorder in my bad proof indeed violates it. I think this criterion is basically the generalised and corrected version of the fully-connected bookkeeping rule described in this post. I imagine if I/someone worked through it, this would clarify whether my handwave proof of Frankenstein $⟹$ Stitch is right or not.

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo10

That's concerning. It would appear to make both our proofs invalid.

But I think your earlier statement about incoming vs outgoing arrows makes sense. Maybe Verma & Pearl were asking for some other kind of equivalence? Grr, back to the semantics I suppose.

[This comment is no longer endorsed by its author]Reply

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo50

Aha. Preserving v-structures (colliders like ) is necessary and sufficient for equivalence^[1]. So when rearranging fully-connected subgraphs, certainly we can't do it (cost-free) if it introduces or removes any v-structures.

Plausibly if we're willing to weaken by adding in additional arrows, there might be other sound ways to reorder fully-connected subgraphs - but they'd be non-invertible. Haven't thought about that.

Verma & Pearl, Equivalence and Synthesis of Causal Models 1990 ↩︎

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo50

Mhm, OK I think I see. But appear to me to make a complete subgraph, and all I did was redirect the $N \to M$ . I confess I am mildly confused by the 'reorder complete subgraph' bookkeeping rule. It should apply to the $A \to B$ in $A \to B \leftarrow C$ , right? But then I'd be able to deduce $A \leftarrow B \leftarrow C$ which is strictly different. So it must mean something other than what I'm taking it to mean.

Maybe need to go back and stare at the semantics for a bit. (But this syntactic view with motifs and transformations is much nicer!)

Some Rules for an Algebra of Bayes Nets

Oliver Sourbut4mo50

Perhaps more importantly, I think with Node Introduction we really don't need after all?

With Node Introduction and some bookkeeping, we can get the $N$ and $M$ graphs topologically compatible, and Frankenstein them. We can't get as neat a merge as if we also had $M \leftarrow X \to N$ - in particular, we can't get rid of the arrow $M \to N$ . But that's fine, we were about to draw that arrow in anyway for the next step!

Is something invalid here? Flagging confusion. This is a slightly more substantial claim than the original proof makes, since it assumes strictly less. Downstream, I think it makes the Resample unnecessary.

ETA: it's cleared up below - there's an invalid Reorder here (it removes a v-structure).

LESSWRONG
LW

Sequences

Posts

Wiki Contributions

Comments