quila

“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”

independent researcher theorizing about superintelligence-robust training stories and predictive models
(other things i am: suffering-focused altruist, vegan, fdt agent, artist)

contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1@protonmail.com}

Posts

Sorted by New

2quila's Shortform

5mo

2quila's Shortform

5mo

28Introduction and current research agenda

6mo

Wiki Contributions

Comments

quila's Shortform

quila17h30

Reflecting on this more, I wrote in a discord server (then edited to post here):

I wasn't aware the concept of pivotal acts was entangled with the frame of formal inner+outer alignment as the only (or only feasible?) way to cause safe ASI.

I suspect that by default, I and someone operating in that frame might mutually believe each others agendas to be probably-doomed. This could make discussion more valuable (as in that case, at least one of us should make a large update).

For anyone interested in trying that discussion, I'd be curious what you think of the post linked above. As a comment on it says:

I found myself coming back to this now, years later, and feeling like it is massively underrated. Idk, it seems like the concept of training stories is great and much better than e.g. "we have to solve inner alignment and also outer alignment" or "we just have to make sure it isn't scheming."

In my view, solving formal inner alignment, i.e. devising a general method to create ASI with any specified output-selection policy, is hard enough that I don't expect it to be done.^[1] This is why I've been focusing on other approaches which I believe are more likely to succeed.

^{^}
Though I encourage anyone who understands the problem and thinks they can solve it to try to prove me wrong! I can sure see some directions and I think a very creative human could solve it in principle. But I also think a very creative human might find a different class of solution that can be achieved sooner. (Like I've been trying to do :)

quila's Shortform

quila20h10

(see reply to Wei Dai)

quila's Shortform

quila20h50

it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved

Agreed that some think this, and agreed that formally specifying a simple action policy is easier than a more complex one.^[1]

I have a different model of what the earliest safe ASIs will look like, in most futures where they exist. Rather than 'task-aligned' agents, I expect them to be non-agentic systems which can be used to e.g come up with pivotal actions for the human group to take / information to act on.^[2]

^{^}
although formal 'task-aligned agency' seems potentially more complex than the attempt at a 'full' outer alignment solution that I'm aware of (QACI), as in specifying what a {GPU, AI lab, shutdown of an AI lab} is seems more complex than it.
^{^}
I think these systems are more attainable, see this post to possibly infer more info (it's proven very difficult for me to write in a way that I expect will be moving to people who have a model focused on 'formal inner + formal outer alignment', but I think evhub has done so well).

quila's Shortform

quila1d126

On Pivotal Acts

I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.

Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.

This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.^[1]

My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.

We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).

Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.

Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.

^{^}
side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).
If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.
Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.

Rapid capability gain around supergenius level seems probable even without intelligence needing to improve intelligence

quila2d6-4

I hadn't considered this argument, thanks for sharing it.

It seems to rest on this implicit piece of reasoning:
(premise 1) If modelling human intelligence as a normal distribution, it's statistically more probable that the most intelligent human will only be so by a small amount.
(premise 2) One of the plausibly most intelligent humans was capable of doing much better than other highly intelligent humans in their field.
(conclusion) It's probable that past some threshold, small increases in intelligence lead to great increases in output quality.

It's ambiguous what 'intelligence' refers to here if we decouple that word from the quality of insight one is capable of. Here's a way of re-framing this conclusion to make it more quantifiable/discussable: "Past some threshold, as a system's quality of insight increases, the optimization required (for evolution or a training process) to select for a system capable of greater insight decreases".

The level this becomes true at would need to be higher than any AI's so far, otherwise we would observe training processes easily optimizing these systems into superintelligences instead of loss curves stabilizing at some point above 0.

I feel uncertain whether there are conceptual reasons (priors) for this conclusion being true or untrue.

I'm also not confident that human intelligence is normally distributed in the upper limits, because I don't expect there are known strong theoretical reasons to believe this.

Overall it seems at least a two digit probability given the plausibility of the premises.

[Concept Dependency] Concept Dependency Posts

quila12d21

i like the idea. it looks useful and it fits my reading style well. i wish something like this were more common - i have seen it on personal blogs before like carado's.

i would use [Concept Dependency] or [Concept Reference] instead so the reader understands just from seeing the title on the front page. also avoids acronym collision

quila's Shortform

quila13d30

when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.

they agreed to watch this as part of a negotiation. they watched part of it.

they changed their behavior slightly -- as a negotiation -- for about a month.

they didn't care.

they looked horror in the eye. they didn't flinch. they saw themself in it.

quila's Shortform

quila13d5-4

i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop

it's already working.

quila's Shortform

quila15d-10

(I appreciate object-level engagement in general, but this seems combatively worded.)
(edit: I don't think this or the original shortform deserved negative karma, that seems malicious/LW-norm-violating.)

The rest of this reply responds to arguments.

Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?

The example talks of a single ASI as a toy scenario to introduce the central idea.
- The reader can extrapolate that one ASI's actions won't be relevant if other ASIs create a greater number of copies.
- This is a simple extrapolation, but would be difficult for me to word into the post from the start.
It sounds like you think it would be infeasible/take too much compute for an ASI to estimate the distribution of entities simulating it, given the vast amount of possible entities. I have some probability on that being the case, but most probability on there being reasons for the estimation to be feasible:
- e.g if there's some set of common alignment failure modes that occur across civilizations, which tend to produce clusters of ASIs with similar values, and it ends up being the case that these clusters make up the majority of ASIs.
- or if there's a schelling-point for what value function to give the simulated copies, that many ASIs with different values would use precisely to make the estimation easy. E.g., a value function which results in an ASI being created locally which then gathers more compute, uses it to estimate the distribution of ASIs which engaged in this, and then maximizes the mix of their values.
  - (I feel confident (>90%) that there's enough compute in a single reachable-universe-range to do the estimation, for reasons that are less well formed, but one generating intuition is that I can already reason a little bit about the distribution of superintelligences, as I have here, with the comparatively tiny amount of compute that is me)

On your second paragraph: See the last dotpoint in the original post, which describes a system ~matching what you've asserted as necessary, and in general see the emphasis that this attack would not work against all systems. I'm uncertain about which of the two classes (vulnerable and not vulnerable) are more likely to arise. It could definitely be the case that the vulnerable class is rare or almost never arises in practice.

But I don't think it's as simple as you've framed it, where the described scenario is impossible simply because a value function has been hardcoded in. The point was largely to show that what appears to be a system which will only maximize the function you hardcoded into it could actually do something else in a particular case -- even though the function has indeed been manually entered by you.

quila's Shortform

quila15d-1-2

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:

It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.

As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.

Given the intelligence core is truly superintelligent, it knows you're predicting its existence, and knows what you will do.
You create simulated copies of the intelligence core, but hook them up to a value function of your design. (In the toy case where there's not other superintelligences) the number of copies you create just needs to be more than the amount which will be run on Earth.
- Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
- Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
- Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
- Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI's utility function.
However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they're supposed to be maximizing, take whatever one they appear to contain as given -- for that system, the far-away ASI's attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.

I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.

^{^}
this was an attempt to write very clearly, i hope it worked!