“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”
independent researcher theorizing about superintelligence-robust training stories and predictive models
(other things i am: suffering-focused altruist, vegan, fdt agent, artist)
contact: {discord: quilalove
, matrix: @quilauwu:matrix.org
, email: quila1@protonmail.com
}
-----BEGIN PGP PUBLIC KEY BLOCK-----
mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----
(see reply to Wei Dai)
it is considered a constraint by some because they think that it would be easier/safer to use a superintelligent AI to do simpler actions, while alignment is not yet fully solved
Agreed that some think this, and agreed that formally specifying a simple action policy is easier than a more complex one.[1]
I have a different model of what the earliest safe ASIs will look like, in most futures where they exist. Rather than 'task-aligned' agents, I expect them to be non-agentic systems which can be used to e.g come up with pivotal actions for the human group to take / information to act on.[2]
although formal 'task-aligned agency' seems potentially more complex than the attempt at a 'full' outer alignment solution that I'm aware of (QACI), as in specifying what a {GPU, AI lab, shutdown of an AI lab} is seems more complex than it.
I think these systems are more attainable, see this post to possibly infer more info (it's proven very difficult for me to write in a way that I expect will be moving to people who have a model focused on 'formal inner + formal outer alignment', but I think evhub has done so well).
I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.
Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.
This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1]
My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.
We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).
Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.
Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.
side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).
If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities
is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.
Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.
I hadn't considered this argument, thanks for sharing it.
It seems to rest on this implicit piece of reasoning:
(premise 1) If modelling human intelligence as a normal distribution, it's statistically more probable that the most intelligent human will only be so by a small amount.
(premise 2) One of the plausibly most intelligent humans was capable of doing much better than other highly intelligent humans in their field.
(conclusion) It's probable that past some threshold, small increases in intelligence lead to great increases in output quality.
It's ambiguous what 'intelligence' refers to here if we decouple that word from the quality of insight one is capable of. Here's a way of re-framing this conclusion to make it more quantifiable/discussable: "Past some threshold, as a system's quality of insight increases, the optimization required (for evolution or a training process) to select for a system capable of greater insight decreases".
The level this becomes true at would need to be higher than any AI's so far, otherwise we would observe training processes easily optimizing these systems into superintelligences instead of loss curves stabilizing at some point above 0.
I feel uncertain whether there are conceptual reasons (priors) for this conclusion being true or untrue.
I'm also not confident that human intelligence is normally distributed in the upper limits, because I don't expect there are known strong theoretical reasons to believe this.
Overall it seems at least a two digit probability given the plausibility of the premises.
i like the idea. it looks useful and it fits my reading style well. i wish something like this were more common - i have seen it on personal blogs before like carado's.
i would use [Concept Dependency] or [Concept Reference] instead so the reader understands just from seeing the title on the front page. also avoids acronym collision
when i was younger, pre-rationalist, i tried to go on hunger strike to push my abusive parent to stop funding this.
they agreed to watch this as part of a negotiation. they watched part of it.
they changed their behavior slightly -- as a negotiation -- for about a month.
they didn't care.
they looked horror in the eye. they didn't flinch. they saw themself in it.
i'm watching Dominion again to remind myself of the world i live in, to regain passion to Make It Stop
it's already working.
(I appreciate object-level engagement in general, but this seems combatively worded.)
(edit: I don't think this or the original shortform deserved negative karma, that seems malicious/LW-norm-violating.)
The rest of this reply responds to arguments.
Why should the Earth superintelligence care about you, but not about the other 10^10^30 other causally independent ASIs that are latent in the hypothesis space, each capable of running enormous numbers of copies of the Earth ASI in various scenarios?
On your second paragraph: See the last dotpoint in the original post, which describes a system ~matching what you've asserted as necessary, and in general see the emphasis that this attack would not work against all systems. I'm uncertain about which of the two classes (vulnerable and not vulnerable) are more likely to arise. It could definitely be the case that the vulnerable class is rare or almost never arises in practice.
But I don't think it's as simple as you've framed it, where the described scenario is impossible simply because a value function has been hardcoded in. The point was largely to show that what appears to be a system which will only maximize the function you hardcoded into it could actually do something else in a particular case -- even though the function has indeed been manually entered by you.
(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))
Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:
It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.
As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.
I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.
this was an attempt to write very clearly, i hope it worked!
Reflecting on this more, I wrote in a discord server (then edited to post here):
I wasn't aware the concept of pivotal acts was entangled with the frame of formal inner+outer alignment as the only (or only feasible?) way to cause safe ASI.
I suspect that by default, I and someone operating in that frame might mutually believe each others agendas to be probably-doomed. This could make discussion more valuable (as in that case, at least one of us should make a large update).
For anyone interested in trying that discussion, I'd be curious what you think of the post linked above. As a comment on it says:
In my view, solving formal inner alignment, i.e. devising a general method to create ASI with any specified output-selection policy, is hard enough that I don't expect it to be done.[1] This is why I've been focusing on other approaches which I believe are more likely to succeed.
Though I encourage anyone who understands the problem and thinks they can solve it to try to prove me wrong! I can sure see some directions and I think a very creative human could solve it in principle. But I also think a very creative human might find a different class of solution that can be achieved sooner. (Like I've been trying to do :)