Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm worried that many AI alignment researchers and other LWers have a view of how human morality works, that really only applies to a small fraction of all humans (notably moral philosophers and themselves). In this view, people know or at least suspect that they are confused about morality, and are eager or willing to apply reason and deliberation to find out what their real values are, or to correct their moral beliefs. Here's an example of someone who fits this view:

I’ve written, in the past, about a “ghost” version of myself — that is, one that can float free from my body; which travel anywhere in all space and time, with unlimited time, energy, and patience; and which can also make changes to different variables, and play forward/rewind different counterfactual timelines (the ghost’s activity somehow doesn’t have any moral significance).

I sometimes treat such a ghost kind of like an idealized self. It can see much that I cannot. It can see directly what a small part of the world I truly am; what my actions truly mean. The lives of others are real and vivid for it, even when hazy and out of mind for me. I trust such a perspective a lot. If the ghost would say “don’t,” I’d be inclined to listen.

I'm currently reading The Status Game by Will Storr (highly recommended BTW), and found in it the following description of how morality works in most people, which matches my own understanding of history and my observations of humans around me:

The moral reality we live in is a virtue game. We use our displays of morality to manufacture status. It’s good that we do this. It’s functional. It’s why billionaires fund libraries, university scholarships and scientific endeavours; it’s why a study of 11,672 organ donations in the USA found only thirty-one were made anonymously. It’s why we feel good when we commit moral acts and thoughts privately and enjoy the approval of our imaginary audience. Virtue status is the bribe that nudges us into putting the interests of other people – principally our co-players – before our own.

We treat moral beliefs as if they’re universal and absolute: one study found people were more likely to believe God could change physical laws of the universe than he could moral ‘facts’. Such facts can seem to belong to the same category as objects in nature, as if they could be observed under microscopes or proven by mathematical formulae. If moral truth exists anywhere, it’s in our DNA: that ancient game-playing coding that evolved to nudge us into behaving co-operatively in hunter-gatherer groups. But these instructions – strive to appear virtuous; privilege your group over others – are few and vague and open to riotous differences in interpretation. All the rest is an act of shared imagination. It’s a dream we weave around a status game.

The dream shifts as we range across the continents. For the Malagasy people in Madagascar, it’s taboo to eat a blind hen, to dream about blood and to sleep facing westwards, as you’ll kick the sunrise. Adolescent boys of the Marind of South New Guinea are introduced to a culture of ‘institutionalised sodomy’ in which they sleep in the men’s house and absorb the sperm of their elders via anal copulation, making them stronger. Among the people of the Moose, teenage girls are abducted and forced to have sex with a married man, an act for which, writes psychologist Professor David Buss, ‘all concerned – including the girl – judge that her parents giving her to the man was a virtuous, generous act of gratitude’. As alien as these norms might seem, they’ll feel morally correct to most who play by them. They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.

Such ‘facts’ also change across time. We don’t have to travel back far to discover moral superstars holding moral views that would destroy them today. Feminist hero and birth control campaigner Marie Stopes, who was voted Woman of the Millennium by the readers of The Guardian and honoured on special Royal Mail stamps in 2008, was an anti-Semite and eugenicist who once wrote that ‘our race is weakened by an appallingly high percentage of unfit weaklings and diseased individuals’ and that ‘it is the urgent duty of the community to make parenthood impossible for those whose mental and physical conditions are such that there is well-nigh a certainty that their offspring must be physically and mentally tainted’. Meanwhile, Gandhi once explained his agitation against the British thusly: ‘Ours is one continual struggle against a degradation sought to be inflicted upon us by the Europeans, who desire to degrade us to the level of the raw Kaffir [black African] … whose sole ambition is to collect a certain number of cattle to buy a wife with and … pass his life in indolence and nakedness.’ Such statements seem obviously appalling. But there’s about as much sense in blaming Gandhi for not sharing our modern, Western views on race as there is in blaming the Vikings for not having Netflix. Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

The dream feels so real. And yet it’s all conjured up by the game-making brain. The world around our bodies is chaotic, confusing and mostly unknowable. But the brain must make sense of it. It has to turn that blizzard of noise into a precise, colourful and detailed world it can predict and successfully interact with, such that it gets what it wants. When the brain discovers a game that seems to make sense of its felt reality and offer a pathway to rewards, it can embrace its rules and symbols with an ecstatic fervour. The noise is silenced! The chaos is tamed! We’ve found our story and the heroic role we’re going to play in it! We’ve learned the truth and the way – the meaning of life! It’s yams, it’s God, it’s money, it’s saving the world from evil big pHARMa. It’s not like a religious experience, it is a religious experience. It’s how the writer Arthur Koestler felt as a young man in 1931, joining the Communist Party:

‘To say that one had “seen the light” is a poor description of the mental rapture which only the convert knows (regardless of what faith he has been converted to). The new light seems to pour from all directions across the skull; the whole universe falls into pattern, like stray pieces of a jigsaw puzzle assembled by one magic stroke. There is now an answer to every question, doubts and conflicts are a matter of the tortured past – a past already remote, when one lived in dismal ignorance in the tasteless, colourless world of those who don’t know. Nothing henceforth can disturb the convert’s inner peace and serenity – except the occasional fear of losing faith again, losing thereby what alone makes life worth living, and falling back into the outer darkness, where there is wailing and gnashing of teeth.’

I hope this helps further explain why I think even solving (some versions of) the alignment problem probably won't be enough to ensure a future that's free from astronomical waste or astronomical suffering. A part of me is actually more scared of many futures in which "alignment is solved", than a future where biological life is simply wiped out by a paperclip maximizer.

New Comment
122 comments, sorted by Click to highlight new comments since: Today at 10:07 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings


> All the rest is an act of shared imagination. It’s a dream we weave around a status game.
> They’re part of the dream of reality in which they exist, a dream that feels no less obvious and true to them than ours does to us.
> Moral ‘truths’ are acts of imagination. They’re ideas we play games with.

IDK, I feel like you could say the same sentences truthfully about math, and if you "went with the overall vibe" of them, you might be confused and mistakenly think math was "arbitrary" or "meaningless", or doesn't have a determinate tendency, etc. Like, okay, if I say "one element of moral progress is increasing universalizability", and you say "that's just the thing your status cohort assigns high status", I'm like, well, sure, but that doesn't mean it doesn't also have other interesting properties, like being a tendency across many different peoples; like being correlated with the extent to which they're reflecting, sharing information, and building understanding; like resulting in reductionist-materialist local outcomes that have more of material local things that people otherwise generally seem to like (e.g. not being punched, having food, etc.); etc. It could be that morality has tendencies, but not without hormesis and mutually assured destrubtion and similar things that might be removed by aligned AI.

2fourier2y
  "Morality" is totally unlike mathematics where the rules can first be clearly defined, and we operate with that set of rules. I believe "increasing universalizability" is a good example to prove OPs point.  I don't think it's a common belief among "many different peoples" in any meaningful sense.  I don't even really understand what it entails. There may be a few nearly universal elements like "wanting food", but destructive aspects are fundamental to our lives so you can't just remove them without fundamentally altering our nature as human beings. Like a lot of people, I don't mind being punched a little as long as (me / my family / my group) wins and gains more resources. I really want to see the people I hate being harmed, and would sacrifice a lot for it, that's a very fundamental aspect of being human.
4TekhneMakre2y
Are you pursuing this to any great extent? If so, remind me to stay away from you and avoid investing in you.
5fourier2y
Why are you personally attacking me for discussing the topic at hand? I'm discussing human nature and giving myself as a counter-example, but I clearly meant that it applies to everyone in different ways. I will avoid personal examples since some people have a hard time understanding. I believe you are ironically proving my point by signaling against me based on my beliefs which you dislike.

Attacking you? I said I don't want to be around you and don't want to invest in you. I said it with a touch of snark ("remind me").

> I clearly meant that it applies to everyone in different ways

Not clear to me. I don't think everyone "would sacrifice a lot" to "see the people [they] hate being harmed". I wouldn't. I think behaving that way is inadvisable for you and harmful to others, and will tend to make you a bad investment opportunity.

3TekhneMakre2y
By that description, mathematics is fairly unlike mathematics. It entails that behavior that people consider moral, tends towards having the property that if everyone behaved like that, things would be good. Rule of law, equality before the law, Rawlsian veil of ignorance, stare decisis, equality of opportunity, the golden rule, liberty, etc. Generally, norms that are symmetric across space, time, context, and person. (Not saying we actually have these things, or that "most people" explicitly think these things are good, just that people tend to update in favor of these things.)
0fourier2y
This is just circular.  What is "good"? Evidence that "most people" update in favor of these things? It seems like a very current western morality centric view, and you could probably get people to update in the opposite direction (and they did, many times in history).
2TekhneMakre2y
>Evidence that "most people" update in favor of these things? It seems like a very current western morality centric view, Yeah, I think you're right that it's biased towards Western. I think you can generate the obvious examples (e.g. law systems developing; e.g. various revolutions in the name of liberty and equality and against tyranny), and I'm not interested enough right now to come up with more comprehensive treatment of the evidence, and I'm not super confident. It could be interesting to see how this plays out in places where these tendencies seem least present. Is China such a place? (What do most people living in China really think of non-liberty, non-Rawlsianism, etc.?)
-1Sammy Martin2y
The above sentences, if taken (as you do) as claims about human moral psychology rather than normative ethics, are compatible with full-on moral realism. I.e. everyone's moral attitudes are pushed around by status concerns, luckily we ended up in a community that ties status to looking for long-run implications of your beliefs and making sure they're coherent, and so without having fundamentally different motivations to any other human being we were better able to be motivated by actual moral facts. I know the OP is trying to say loudly and repeatedly that this isn't the case because 'everyone else thought that as well, don't you know?' with lots of vivid examples, but if that's the only argument it seems like modesty epistemology - i.e. "most people who said the thing you said were wrong, and also said that they weren't like all those other people who were wrong in the past for all these specific reasons, so you should believe you're wrong too". I think a lot of this thread confuses moral psychology with normative ethics - most utilitarians know and understand that they aren't solely motivated by moral concerns, and are also motivated by lots of other things. They know they don't morally endorse those motivations in themselves, but don't do anything about it, and don't thereby change their moral views. If Peter Singer goes and buys a coffee, it's no argument at all to say "aha, by revealed preferences, you must not really think utilitarianism is true, or you'd have given the money away!" That doesn't show that when he does donate money, he's unmotivated by moral concerns. Probably even this 'pure' motivation to act morally in cases where empathy isn't much of an issue is itself made up of e.g. a desire not to be seen believing self-contradictory things, cognitive dissonance, basic empathy and so on. But so what? If the emotional incentives work to motivate people to form more coherent moral views, it's the reliability of the process of forming the views that ma

You sound like you're positing the existence of two type of people: type I people who have morality based on "reason" and type II people who have morality based on the "status game". In reality, everyone's nearly everyone's morality is based on something like the status game (see also: 1 2 3). It's just that EAs and moral philosophers are playing the game in a tribe which awards status differently.

The true intrinsic values of most people do place a weight on the happiness of other people (that's roughly what we call "empathy"), but this weight is very unequally distributed.

There are definitely thorny questions regarding the best way to aggregate the values of different people in TAI. But, I think that given a reasonable solution, a lower bound on the future is imagining that the AI will build a private utopia for every person, as isolated from the other "utopias" as that person wants it to be. Probably some people's "utopias" will not be great, viewed in utilitarian terms. But, I still prefer that over paperclips (by far). And, I suspect that most people do (even if they protest it in order to play the game).

[-]Wei Dai2yΩ6120

It’s just that EAs and moral philosophers are playing the game in a tribe which awards status differently.

Sure, I've said as much in recent comments, including this one. ETA: Related to this, I'm worried about AI disrupting "our" status game in an unpredictable and possibly dangerous way. E.g., what will happen when everyone uses AI advisors to help them play status games, including the status game of moral philosophy?

The true intrinsic values of most people do place a weight on the happiness of other people (that’s roughly what we call “emapthy”), but this weight is very unequally distributed.

What do you mean by "true intrinsic values"? (I couldn't find any previous usage of this term by you.) How do you propose finding people's true intrinsic values?

These weights, if low enough relative to other "values", haven't prevented people from committing atrocities on each other in the name of morality.

There are definitely thorny questions regarding the best way to aggregate the values of different people in TAI. But, I think that given a reasonable solution, a lower bound on the future is imagining that the AI will build a private utopia for every person, as isolated from the ot

... (read more)
7Vanessa Kosoy2y
I mean the values relative to which a person seems most like a rational agent, arguably formalizable along these lines. Yes. Yes. I do think multi-user alignment is an important problem (and occasionally spend some time thinking about it), it just seems reasonable to solve single user alignment first. Andrew Critch is an example of a person who seems to be concerned about this. I meant that each private utopia can contain any number of people created by the AI, in addition to its "customer". Ofc groups that can agree on a common utopia can band together as well. They are prevented from simulating other pre-existing people without their consent, but can simulate a bunch of low status people to lord over. Yes, this can be bad. Yes, I still prefer this (assuming my own private utopia) over paperclips. And, like I said, this is just a relatively easy to imagine lower bound, not necessarily the true optimum. The selfish part, at least, doesn't have any reason to be scared as long as you are a "customer".
6Wei Dai2y
Why do you think this will be the result of the value aggregation (or a lower bound on how good the aggregation will be)? For example, if there is a big block of people who all want to simulate person X in order to punish that person, and only X and a few other people object, why won't the value aggregation be "nobody pre-existing except X (and Y and Z etc.) can be simulated"?
5Vanessa Kosoy2y
Given some assumptions about the domains of the utility functions, it is possible to do better than what I described in the previous comment. Let Xi be the space of possible experience histories[1] of user i and Y the space of everything else the utility functions depend on (things that nobody can observe directly). Suppose that the domain of the utility functions is Z:=∏iXi×Y. Then, we can define the "denosing[2] operator" Di:C(Z)→C(Z) for user i by (Diu)(xi,x−i,y):=maxx′∈∏j≠iXju(xi,x′,y) Here, xi is the argument of u that ranges in Xi, x−i are the arguments that range in Xj for j≠i and y is the argument that ranges in Y. That is, Di modifies a utility function by having it "imagine" that the experiences of all users other than i have been optimized, for the experiences of user i and the unobservables held constant. Let ui:Z→R be the utility function of user i, and d0∈Rn the initial disagreement point (everyone dying), where n is the number of users. We then perform cooperative bargaining on the denosed utility functions Diui with disagreement point d0, producing some outcome μ0∈Δ(Z). Define d1∈Rn by d1i:=Eμ[ui]. Now we do another cooperative bargaining with d1 as the disagreement point and the original utility functions ui. This gives us the final outcome μ1. Among other benefits, there is now much less need to remove outliers. Perhaps, instead of removing them we still want to mitigate them by applying "amplified denosing" to them which also removes the dependence on Y. For this procedure, there is a much better case that the lower bound will be met. ---------------------------------------- 1. In the standard RL formalism this is the space of action-observation sequences (A×O)ω. ↩︎ 2. From the expression "nosy preferences", see e.g. here. ↩︎
5Joe_Collman2y
This is very interesting (and "denosing operator" is delightful). Some thoughts: If I understand correctly, I think there can still be a problem where user i wants an experience history such that part of the history is isomorphic to a simulation of user j suffering (i wants to fully experience j suffering in every detail). Here a fixed xi may entail some fixed xj for (some copy of) some j. It seems the above approach can't then avoid leaving one of i or j badly off: If i is permitted to freely determine the experience of the embedded j copy, the disagreement point in the second bargaining will bake this in: j may be horrified to see that i wants to experience its copy suffer, but will be powerless to stop it (if i won't budge in the bargaining). Conversely, if the embedded j is treated as a user which i will imagine is exactly to i's liking, but who actually gets what j wants, then the selected μ0 will be horrible for i (e.g. perhaps i wants to fully experience Hitler suffering, and instead gets to fully experience Hitler's wildest fantasies being realized). I don't think it's possible to do anything like denosing to avoid this. It may seem like this isn't a practical problem, since we could reasonably disallow such embedding. However, I think that's still tricky since there's a less exotic version of the issue: my experiences likely already are a collection of subagents' experiences. Presumably my maximisation over xjoe is permitted to determine all the xsubjoe. It's hard to see how you draw a principled line here: the ideal future for most people may easily be transhumanist to the point where today's users are tomorrow's subpersonalities (and beyond). A case that may have to be ruled out separately is where i wants to become a suffering j. Depending on what I consider 'me', I might be entirely fine with it if 'I' wake up tomorrow as suffering j (if I'm done living and think j deserves to suffer). Or perhaps I want to clone myself 1010 times, and then have
5Vanessa Kosoy2y
I think that a rigorous treatment of such issues will require some variant of IB physicalism (in which the monotonicity problem has been solved, somehow). I am cautiously optimistic that a denosing operator exists there which dodges these problems. This operator will declare both the manifesting and evaluation of the source codes of other users to be "out of scope" for a given user. Hence, a preference of i to observe the suffering of j would be "satisfied" by observing nearly anything, since the maximization can interpret anything as a simulation of j. The "subjoe" problem is different: it is irrelevant because "subjoe" is not a user, only Joe is a user. All the transhumanist magic that happens later doesn't change this. Users are people living during the AI launch, and only them. The status of any future (trans/post)humans is determined entirely according to the utility functions of users. Why? For two reasons: (i) the AI can only have access and stable pointers to existing people (ii) we only need the buy-in of existing people to launch the AI. If existing people want future people to be treated well, then they have nothing to worry about since this preference is part of the existing people's utility functions.
1Joe_Collman2y
Ah - that's cool if IB physicalism might address this kind of thing (still on my to-read list). Agreed that the subjoe thing isn't directly a problem. My worry is mainly whether it's harder to rule out i experiencing a simulation of xsubj−suffering, since subj isn't a user. However, if you can avoid the suffering js by limiting access to information, the same should presumably work for relevant sub-js. This isn't so clear (to me at least) if: 1. Most, but not all current users want future people to be treated well. 2. Part of being "treated well" includes being involved in an ongoing bargaining process which decides the AI's/future's trajectory. For instance, suppose initially 90% of people would like to have an iterated bargaining process that includes future (trans/post)humans as users, once they exist. The other 10% are only willing to accept such a situation if they maintain their bargaining power in future iterations (by whatever mechanism). If you iterate this process, the bargaining process ends up dominated by users who won't relinquish any power to future users. 90% of initial users might prefer drift over lock-in, but we get lock-in regardless (the disagreement point also amounting to lock-in). Unless I'm confusing myself, this kind of thing seems like a problem. (not in terms of reaching some non-terrible lower bound, but in terms of realising potential) Wherever there's this kind of asymmetry/degradation over bargaining iterations, I think there's an argument for building in a way to avoid it from the start - since anything short of 100% just limits to 0 over time. [it's by no means clear that we do want to make future people users on an equal footing to today's people; it just seems to me that we have to do it at step zero or not at all]
3Vanessa Kosoy2y
I admit that at this stage it's unclear because physicalism brings in the monotonicity principle that creates bigger problems than what we discuss here. But maybe some variant can work. Roughly speaking, in this case the 10% preserve their 10% of the power forever. I think it's fine because I want the buy-in of this 10% and the cost seems acceptable to me. I'm also not sure there is any viable alternative which doesn't have even bigger problems.
1Joe_Collman2y
Sure, I'm not sure there's a viable alternative either. This kind of approach seems promising - but I want to better understand any downsides. My worry wasn't about the initial 10%, but about the possibility of the process being iterated such that you end up with almost all bargaining power in the hands of power-keepers. In retrospect, this is probably silly: if there's a designable-by-us mechanism that better achieves what we want, the first bargaining iteration should find it. If not, then what I'm gesturing at must either be incoherent, or not endorsed by the 10% - so hard-coding it into the initial mechanism wouldn't get the buy-in of the 10% to the extent that they understood the mechanism. In the end, I think my concern is that we won't get buy-in from a large majority of users: In order to accommodate some proportion with odd moral views it seems likely you'll be throwing away huge amounts of expected value in others' views - if I'm correctly interpreting your proposal (please correct me if I'm confused). Is this where you'd want to apply amplified denosing? So, rather than filtering out the undesirable i, for these i you use:       (Diu)(xi,x−i,y):=maxx′∈∏j≠iXj, y′∈Yu(xi,x′,y′)  [i.e. ignoring y and imagining it's optimal] However, it's not clear to me how we'd decide who gets strong denosing (clearly not everyone, or we don't pick a y). E.g. if you strong-denose anyone who's too willing to allow bargaining failure [everyone dies] you might end up filtering out altruists who worry about suffering risks. Does that make sense?
2Vanessa Kosoy2y
I'm not sure what you mean here, but also the process is not iterated: the initial bargaining is deciding the outcome once and for all. At least that's the mathematical ideal we're approximating. I don't think so? The bargaining system does advantage large groups over small groups. In practice, I think that for the most part people don't care much about what happens "far" from them (for some definition of "far", not physical distance) so giving them private utopias is close to optimal from each individual perspective. Although it's true they might pretend to care more than they do for the usual reasons, if they're thinking in "far-mode". I would certainly be very concerned about any system that gives even more power to majority views. For example, what if the majority of people are disgusted by gay sex and prefer it not the happen anywhere? I would rather accept things I disapprove of happening far away from me than allow other people to control my own life. Ofc the system also mandates win-win exchanges. For example, if Alice's and Bob's private utopias each contain something strongly unpalatable to the other but not strongly important to the respective customer, the bargaining outcome will remove both unpalatable things. I'm fine with strong-denosing negative utlitarianists who would truly stick to their guns about negative utilitarianism (but I also don't think there are many).
1Joe_Collman2y
Ah, I was just being an idiot on the bargaining system w.r.t. small numbers of people being able to hold it to ransom. Oops. Agreed that more majority power isn't desirable. [re iteration, I only meant that the bargaining could become iterated if the initial bargaining result were to decide upon iteration (to include more future users). I now don't think this is particularly significant.] I think my remaining uncertainty (/confusion) is all related to the issue I first mentioned (embedded copy experiences). It strikes me that something like this can also happen where minds grow/merge/overlap. Does this avoid the problem if i's preferences use indirection? It seems to me that a robust pointer to j may be enough: that with a robust pointer it may be possible to implicitly require something like source-code-access without explicitly referencing it. E.g. where i has a preference to "experience j suffering in circumstances where there's strong evidence it's actually j suffering, given that these circumstances were the outcome of this bargaining process". If i can't robustly specify things like this, then I'd guess there'd be significant trouble in specifying quite a few (mutually) desirable situations involving other users too. IIUC, this would only be any problem for the denosed bargaining to find a good d1: for the second bargaining on the true utility functions there's no need to put anything "out of scope" (right?), so win-wins are easily achieved.
3Vanessa Kosoy2y
I'm imagining cooperative bargaining between all users, where the disagreement point is everyone dying[1][2] (this is a natural choice assuming that if we don't build aligned TAI we get paperclips). This guarantees that every user will receive an outcome that's at least not worse than death. With Nash bargaining, we can still get issues for (in)famous people that millions of people want to do unpleasant things to. Their outcome will be better than death, but maybe worse than in my claimed "lower bound". With Kalai-Smorodinsky bargaining things look better, since essentially we're maximizing a minimum over all users. This should admit my lower bound, unless it is somehow disrupted by enormous asymmetries in the maximal payoffs of different users. In either case, we might need to do some kind of outlier filtering: if e.g. literally every person on Earth is a user, then maybe some of them are utterly insane in ways that cause the Pareto frontier to collapse. [EDIT: see improved solution] Bargaining assumes we can access the utility function. In reality, even if we solve the value learning problem in the single user case, once you go to the multi-user case it becomes a mechanism design problem: users have incentives to lie / misrepresent their utility functions. A perfect solution might be impossible, but I proposed mitigating this by assigning each user a virtual "AI lawyer" that provides optimal input on their behalf into the bargaining system. In this case they at least have no incentive to lie to the lawyer, and the outcome will not be skewed in favor of users who are better in this game, but we don't get the optimal bargaining solution either. All of this assumes the TAI is based on some kind of value learning. If the first-stage TAI is based on something else, the problem might become easier or harder. Easier because the first-stage TAI will produce better solutions to the multi-user problem for the second-stage TAI. Harder because it can allow the small gro
2Wei Dai2y
Assuming each lawyer has the same incentive to lie as its client, it has an incentive to misrepresent that some preferable-to-death outcomes are "worse-than-death" (in order to force those outcomes out of the set of "feasible agreements" in hope of getting a more preferred outcome as the actual outcome), and this at equilibrium is balanced by the marginal increase in the probability of getting "everyone dies" as the outcome (due to feasible agreements becoming a null set) caused by the lie. So the probability of "everyone dies" in this game has to be non-zero. (It's the same kind of problem as in the AI race or tragedy of commons: people not taking into account the full social costs of their actions as they reach for private benefits.) Of course in actuality everyone dying may not be a realistic consequence of failure to reach agreement, but if the real consequence is better than that, and the AI lawyers know this, they would be more willing to lie since the perceived downside of lying would be smaller, so you end up with a higher chance of no agreement.
2Vanessa Kosoy2y
Yes, it's not a very satisfactory solution. Some alternative/complementary solutions: * Somehow use non-transformative AI to do my mind uploading, and then have the TAI to learn by inspecting the uploads. Would be great for single-user alignment as well. * Somehow use non-transformative AI to create perfect lie detectors, and use this to enforce honesty in the mechanism. (But, is it possible to detect self-deception?) * Have the TAI learn from past data which wasn't affected by the incentives created by the TAI. (But, is there enough information there?) * Shape the TAI's prior about human values in order to rule out at least the most blatant lies. * Some clever mechanism design I haven't thought of. The problem with this is, most mechanism designs rely on money and money that doesn't seem applicable, whereas when you don't have money there are many impossibility theorems.
2Joe_Collman2y
This seems near guaranteed to me: a non-zero amount of people will be that crazy (in our terms), so filtering will be necessary. Then I'm curious about how we draw the line on outlier filtering. What filtering rule do we use? I don't yet see a good principled rule (e.g. if we want to throw out people who'd collapse agreement to the disagreement point, there's more than one way to do that).
1[comment deleted]2y
5Wei Dai2y
For a utilitarian, this doesn't mean much. What's much more important is something like, "How close is this outcome to an actual (global) utopia (e.g., with optimized utilitronium filling the universe), on a linear scale?" For example, my rough expectation (without having thought about it much) is that your "lower bound" outcome is about midway between paperclips and actual utopia on a logarithmic scale. In one sense, this is much better than paperclips, but in another sense (i.e., on the linear scale), it's almost indistinguishable from paperclips, and a utilitarian would only care about the latter and therefore be nearly as disappointed by that outcome as paperclips.

I want to add a little to my stance on utilitarianism. A utilitarian superintelligence would probably kill me and everyone I love, because we are made of atoms that could be used for minds that are more hedonic[1][2][3]. Given a choice between paperclips and utilitarianism, I would still choose utilitarianism. But, if there was a utilitarian TAI project along with a half-decent chance to do something better (by my lights), I would actively oppose the utilitarian project. From my perspective, such a project is essentially enemy combatants.


  1. One way to avoid it is by modifying utilitarianism to only place weight on currently existing people. But this is already not that far from my cooperative bargaining proposal (although still inferior to it, IMO). ↩︎

  2. Another way to avoid it is by postulating some very strong penalty on death (i.e. discontinuity of personality). But this is not trivial to do, especially without creating other problems. Moreover, from my perspective this kind of thing is hacks trying to work around the core issue, namely that I am not a utilitarian (along with the vast majority of people). ↩︎

  3. A possible counterargument is, maybe the superhedonic future minds wou

... (read more)
5Wei Dai2y
This seems like a reasonable concern about some types of hedonic utilitarianism. To be clear, I'm not aware of any formulation of utilitarianism that doesn't have serious issues, and I'm also not aware of any formulation of any morality that doesn't have serious issues. Just to be clear, this isn't in response to something I wrote, right? (I'm definitely not advocating any kind of "utilitarian TAI project" and would be quite scared of such a project myself.) So what are you (and them) then? What would your utopia look like?
2Vanessa Kosoy2y
No! Sorry, if I gave that impression. Well, I linked my toy model of partiality before. Are you asking about something more concrete?
3Wei Dai2y
Yeah, I mean aside from how much you care about various other people, what concrete things do you want in your utopia?
4Vanessa Kosoy2y
I have low confidence about this, but my best guess personal utopia would be something like: A lot of cool and interesting things are happening. Some of them are good, some of them are bad (a world in which nothing bad ever happens would be boring). However, there is a limit on how bad something is allowed to be (for example, true death, permanent crippling of someone's mind and eternal torture are over the line), and overall "happy endings" are more common than "unhappy endings". Moreover, since it's my utopia (according to my understanding of the question, we are ignoring the bargaining process and acausal cooperation here), I am among the top along those desirable dimensions which are zero-sum (e.g. play an especially important / "protagonist" role in the events to the extent that it's impossible for everyone to play such an important role, and have high status to the extent that it's impossible for everyone to have such high status).

First, you wrote "a part of me is actually more scared of many futures in which alignment is solved, than a future where biological life is simply wiped out by a paperclip maximizer." So, I tried to assuage this fear for a particular class of alignment solutions.

Second... Yes, for a utilitarian this doesn't mean "much". But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

Third, what I actually want from multi-user alignment is a solution that (i) is acceptable to me personally (ii) is acceptable to the vast majority of people (at least if they think through it rationally and are arguing honestly and in good faith) (iii) is acceptable to key stakeholders (iv) as much as possible, doesn't leave any Pareto improvements on the table and (v) sufficiently Schelling-pointy to coordinate around. Here, "acceptable" means "a lot better than paperclips and not worth starting an AI race/war to get something better".

[-]Wei Dai2yΩ4120

Second… Yes, for a utilitarian this doesn’t mean “much”. But, tbh, who cares? I am not a utilitarian. The vast majority of people are not utilitarians. Maybe even literally no one is an (honest, not self-deceiving) utilitarian. From my perspective, disappointing the imaginary utilitarian is (in itself) about as upsetting as disappointing the imaginary paperclip maximizer.

I'm not a utilitarian either, because I don't know what my values are or should be. But I do assign significant credence to the possibility that something in the vincinity of utilitarianism is the right values (for me, or period). Given my uncertainties, I want to arrange the current state of the world so that (to the extent possible), whatever I end up deciding my values are, through things like reason, deliberation, doing philosophy, the world will ultimately not turn out to be a huge disappointment according to those values. Unfortunately, your proposed solution isn't very reassuring to this kind of view.

It's quite possible that I (and people like me) are simply out of luck, and there's just no feasible way to do what we want to do, but it sounds like you think I shouldn't even want what I want, or at least t... (read more)

4Vanessa Kosoy2y
I'm moderately sure what my values are, to some approximation. More importantly, I'm even more sure that, whatever my values are, they are not so extremely different from the values of most people that I should wage some kind of war against the majority instead of trying to arrive at a reasonable compromise. And, in the unlikely event that most people (including me) will turn out to be some kind of utilitarians after all, it's not a problem: value aggregation will then produce a universe which is pretty good for utilitarians.
2Wei Dai2y
Maybe you're just not part of the target audience of my OP then... but from my perspective, if I determine my values through the kind of process described in the first quote, and most people determine their values through the kind of process described in the second quote, it seems quite likely that the values end up being very different. The kind of solution I have in mind is not "waging war" but for example, solving metaphilososphy and building an AI that can encourage philosophical reflection in humans or enhance people's philosophical abilities. What if you turn out to be some kind of utilitarian but most people don't (because you're more like the first group in the OP and they're more like the second group), or most people will eventually turn out to be some kind of utilitarian in a world without AI, but in a world with AI, this will happen?
8Vanessa Kosoy2y
I don't think people determine their values through either process. I think that they already have values, which are to a large extent genetic and immutable. Instead, these processes determine what values they pretend to have for game-theory reasons. So, the big difference between the groups is which "cards" they hold and/or what strategy they pursue, not an intrinsic difference in values. But also, if we do model values as the result of some long process of reflection, and you're worried about the AI disrupting or insufficiently aiding this process, then this is already a single-user alignment issue and should be analyzed in that context first. The presumed differences in moralities are not the main source of the problem here.
4Wei Dai2y
This is not a theory that's familiar to me. Why do you think this is true? Have you written more about it somewhere or can link to a more complete explanation? This seems reasonable to me. (If this was meant to be an argument against something I said, there may have been anther miscommuncation, but I'm not sure it's worth tracking that down.)
2Vanessa Kosoy2y
I considering writing about this for a while, but so far I don't feel sufficiently motivated. So, the links I posted upwards in the thread are the best I have, plus vague gesturing in the directions of Hansonian signaling theories, Jaynes' theory of consciousness and Yudkowsky's belief in belief.
1mukashi2y
Isn't this the main thesis of "The righteous mind"?
4Rafael Harth2y
This comment seems to be consistent with the assumption that the outcome 1 year after the singularity is locked in forever. But the future we're discussing here is one where humans retain autonomy (?), and in that case, they're allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI. I think a future where we begin with highly suboptimal personal utopias and gradually transition into utilitronium is among the more plausible outcomes. Compared with other outcomes where Not Everyone Dies, anyway. Your credence may differ if you're a moral relativist.
[-]Wei Dai2yΩ8100

But the future we’re discussing here is one where humans retain autonomy (?), and in that case, they’re allowed to change their mind over time, especially if humanity has access to a superintelligent aligned AI.

What if the humans ask the aligned AI to help them be more moral, and part of what they mean by "more moral" is having fewer doubts about their current moral beliefs? This is what a "status game" view of morality seems to predict, for the humans whose status games aren't based on "doing philosophy", which seems to be most of them.

2Rafael Harth2y
I don't have any reason why this couldn't happen. My position is something like "morality is real, probably precisely quantifiable; seems plausible that in the scenario of humans with autonomy and aligned AI, this could lead to an asymmetry where more people tend toward utilitronium over time". (Hence why I replied, you didn't seem to consider that possibility.) I could make up some mechanisms for this, but probably you don't need me for that. Also seems plausible that this doesn't happen. If it doesn't happen, maybe the people who get to decide what happens with the rest of the universe tend toward utilitronium. But my model is widely uncertain and doesn't rule out futures of highly suboptimal personal utopias that persist indefinitely.
4Wei Dai2y
I'm interested in your view on this, plus what we can potentially do to push the future in this direction.
2Rafael Harth2y
I strongly believe that (1) well-being is objective, (2) well-being is quantifiable, and (3) Open Individualism is true (i.e., the concept of identity isn't well-defined, and you're subjectively no less continuous with the future self if any other person than your own future self). If (1-3) are all true, then utilitronium is the optimal outcome for everyone even if they're entirely selfish. Furthermore, I expect an AGI to figure this out, and to the extent that it's aligned, it should communicate that if it's asked. (I don't think an AGI will therefore decide to do the right thing, so this is entirely compatible with everyone dying if alignment isn't solved.) In the scenario where people get to talk to the AGI freely and it's aligned, two concrete mechanisms I see are (a) people just ask the AGI what is morally correct and it tells them, and (b) they get some small taste of what utilitronium would feel like, which would make it less scary. (A crucial piece is that they can rationally expect to experience this themselves in the utilitronium future.) In the scenario where people don't get to talk to the AGI, who knows. It's certainly possible that we have singleton scenario with a few people in charge of the AGI, and they decide to censor questions about ethics because they find the answers scary. The only org I know of that works on this and shares my philosophical views is QRI. Their goal is to (a) come up with a mathematical space (probably a topological one, mb a Hilbert space) that precisely describes the subjective experience of someone, (b) find a way to put someone in the scanner and create that space, and (c) find a property of that space that corresponds to their well-being in that moment. The flag ship theory is that this property is symmetry. Their model is stronger than (1-3), but if it's correct, you could get hard evidence on this before AGI since it would make strong testable predictions about people's well-being (and they think it could also point
4jacob_cannell2y
  We already have a solution to this: money.  It's also the only solution that satisfies some essential properties such as sybil orthogonality (especially important for posthuman/AGI societies).
2TekhneMakre2y
It's part of alignment. Also, it seems mostly separate from the part about "how do you even have consequentialism powerful enough to make, say, nanotech, without killing everyone as a side-effect?", and the latter seems not too related to the former.

In reality, everyone's morality is based on something like the status game (see also: 1 2 3)

... I really wanted to say [citation needed], but then you did provide citations, but then the citations were not compelling to me.

I'm pretty opposed to such universal claims being made about humans without pushback, because such claims always seem to me to wish-away the extremely wide variation in human psychology and the difficulty establishing anything like "all humans experience X."  

There are people who have no visual imagery, people who do not think in words, people who have no sense of continuity of self, people who have no discernible emotional response to all sorts of "emotional" stimuli, and on and on and on.

So, I'll go with "it makes sense to model people as if every one of them is motivated by structures built atop the status game."  And I'll go with "it seems like the status architecture is a physiological near-universal, so I have a hard time imagining what else people's morality might be made of."  And I'll go with "everyone I've ever talked to had morality that seemed to me to cash out to being statusy, except the people whose self-reports I ignored because the... (read more)

Kind of frustrating that this high karma reply to a high karma comment on my post is based on a double misunderstanding/miscommunication:

  1. First Vanessa understood me as claiming that a significant number of people's morality is not based on status games. I tried to clarify in an earlier comment already, but to clarify some more: that's not my intended distinction between the two groups. Rather the distinction is that the first group "know or at least suspect that they are confused about morality, and are eager or willing to apply reason and deliberation to find out what their real values are, or to correct their moral beliefs" (they can well be doing this because of the status game that they're playing) whereas this quoted description doesn't apply to the second group.
  2. Then you (Duncan) understood Vanessa as claiming that literally everyone's morality is based on status games, when (as the subsequent discussion revealed) the intended meaning was more like "the number of people whose morality is not based on status games is a lot fewer than (Vanessa's misunderstanding of) Wei's claim".
3[DEACTIVATED] Duncan Sabien2y
I think it's important and valuable to separate out "what was in fact intended" (and I straightforwardly believe Vanessa's restatement as a truer explanation of her actual position) from "what was originally said, and how would 70+ out of 100 readers tend to interpret it." I think we've cleared up what was meant.  I still think it was bad that [the perfectly reasonable thing that was meant] was said in a [predictably misleading fashion]. But I think we've said all that needs to be said about that, too.
2Said Achmiz2y
This is a tangent (so maybe you prefer to direct this discussion elsewhere), but: what’s with the brackets? I see you using them regularly; what do they signify?
2[DEACTIVATED] Duncan Sabien2y
I use them where I'm trying to convey a single noun that's made up of many words, and I'm scared that people will lose track of the overall sentence while in the middle of the chunk.  It's an attempt to keep the overall sentence understandable.  I've tried hyphenating such phrases and people find that more annoying.
2Said Achmiz2y
Hmm, I see, thanks.

It's not just that the self-reports didn't fit the story I was building, the self-reports didn't fit the revealed preferences. Whatever people say about their morality, I haven't seen anyone who behaves like a true utilitarian.

IMO, this is the source of all the gnashing of teeth about how much % of your salary you need to donate: the fundamental contradiction between the demands of utilitarianism and how much people are actually willing to pay for the status gain. Ofc many excuses were developed ("sure I still need to buy that coffee or those movie tickets, otherwise I won't be productive") but they don't sound like the most parsimonious explanation.

This is also the source of paradoxes in population ethics and its vicinity: those abstractions are just very remote from actual human minds, so there's no reason they should produce anything sane in edge cases. Their only true utility is as an approximate guideline for making group decisions, for sufficiently mundane scenarios. Once you get to issues with infinities it becomes clear utilitarianism is not even mathematically coherent, in general.

You're right that there is a lot of variation in human psychology. But it's also an accepted ... (read more)

7[DEACTIVATED] Duncan Sabien2y
The equivalent statement would be "In reality, everyone has 2 arms and 2 legs."

Well, if the OP said something like "most people have 2 eyes but enlightened Buddhists have a third eye" and I responded with "in reality, everyone have 2 eyes" then, I think my meaning would be clear even though it's true that some people have 1 or 0 eyes (afaik maybe there is even a rare mutation that creates a real third eye). Not adding all possible qualifiers is not the same as "not even pretending that it's interested in making itself falsifiable".

1[DEACTIVATED] Duncan Sabien2y
I think your meaning would be clear, but "everyone knows what this straightforwardly false thing that I said really meant" is insufficient for a subculture trying to be precise and accurate and converge on truth.  Seems like more LWers are on your side than on mine on that question, but that's not news.  ¯\_(ツ)_/¯ It's a strawman to pretend that "please don't say a clearly false thing" is me insisting on "please include all possible qualifiers."  I just wish you hadn't said a clearly false thing, is all.  

Natural language is not math, it's inherently ambiguous and it's not realistically possible to always be precise without implicitly assuming anything about the reader's understanding of the context. That said, it seems like I wasn't sufficiently precise in this case, so I edited my comment. Thank you for the correction.

8Vladimir_Nesov2y
The tradeoff is with verbosity and difficulty of communication, it's not always a straightforward Pareto improvement. So in this case I fully agree with dropping "everyone" or replacing it with a more accurate qualifier. But I disagree with a general principle that would discount ease for a person who is trained and talented in relevant ways. New habits of thought that become intuitive are improvements, checklists and other deliberative rituals that slow down thinking need merit that overcomes their considerable cost.
5Gunnar_Zarncke2y
That looks like a No True Scotsman argument to me. Just because the extreme doesn't exist doesn't mean that all of the scale can be explained by status games.  

What does it have to do with "No True Scotsman"? NTS is when you redefine your categories to justify your claim. I don't think I did that anywhere.

Just because the extreme doesn't exist doesn't mean that all of the scale can be explained by status games.

First, I didn't say all the scale is explained by status games, I did mention empathy as well.

Second, that by itself sure doesn't mean much. Explaining all the evidence would require an article, or maybe a book (although I hoped the posts I linked explain some of it). My point here is that there is an enormous discrepancy between the reported morality and the revealed preferences, so believing self-reports is clearly a non-starter. How do you build an explanation not from self-reports is a different (long) story.

2Gunnar_Zarncke2y
I agree that there is an enormous discrepancy.
3Ratios2y
If you try to quantify it, humans on average probably spend over 95% (Conservative estimation) of their time and resources on non-utilitarian causes. True utilitarian behavior Is extremely rare and all other moral behaviors seem to be either elaborate status games or extended self-interest [1]. The typical human is way closer under any relevant quantified KPI to being completely selfish than being a utilitarian.   [1] - Investing in your family/friends is in a way selfish, from a genes/alliances (respectively) perspective.
3Sweetgum2y
What does this even mean? If someone says they don't want X, and they never take actions that promote X, how can it be said that they "truly" want X? It's not their stated preference or their revealed preference!

Feminist hero and birth control campaigner Marie Stopes, who was voted Woman of the Millennium by the readers of The Guardian and honoured on special Royal Mail stamps in 2008, was an anti-Semite and eugenicist

My conclusion from this is more like "successful politicians are not moral paragons". More generally, trying to find morally virtuous people by a popular vote is not going to produce great results, because the popularity plays much greater role than morality.

I googled for "woman of the year" to get more data points; found this list, containing: 2019 Greta Thunberg, 2016 Hillary Clinton, 2015 Angela Merkel, 2010 Nancy Pelosi, 2008 Michelle Obama, 1999 Madeleine Albright, 1990 Aung San Suu Kyi... clearly, being a politician dramatically increases your chances of winning. Looking at their behavior, Aung San Suu Kyi later organized a genocide.

The list also includes 2009 Malala Yousafzai, who as far as I know is an actual hero with no dark side. But that's kinda my point, that putting Malala Yousafzai on the same list as Greta Thunberg and Hillary Clinton just makes the list confusing. And if you had to choose one of them as the "woman of the millenium", I would expect most reader... (read more)

And this sounds silly to us, because we know that “kicking the sunrise” is impossible, because Sun is a planet, it is far away, and your kicking has no impact on it.

I think a lot of contemporary cultures back then would have found "kicking the sunrise" to be silly, because it was obviously impossible even given what they knew at the time, i.e., you can only kick something if you physically touch it with your foot, and nobody has ever even gotten close to touching the sun, and it's even more impossible while you're asleep.

So, we should distinguish between people having different moral feelings, and having different models of the world. If you actually believed that kicking the Sun is possible and can have astronomical consequences, you would probably also perceive people sleeping westwards as criminally negligent, possibly psychopathic.

Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I've known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2/3 of everyone will die as a result of taking the CO... (read more)

3[anonymous]2y
I feel like your definition of "morally virtuous" is missing at least 2 parameters: the context that the person is in, and the definition of "morally virtuous". You seem to treat both as fixed or not contributing to the outcome, but in my experience they're equally if not more important than the person. Your example of Aung San Suu Kyi is a good example of that. She was "good" in 1990 given her incentives in 1990 and the popular definition of "good" in 1990. Not so much later.
5Viliam2y
Moral virtue seems to involve certain... inflexibility to incentives. If someone says "I would organize the genocide of Rohingya if and only if organizing such genocide is profitable, and it so happens that today it would be unprofitable, therefore today I oppose the genocide", we would typically not call this person moral. Of course, people usually do not explain their decision algorithms in detail, so the person described above would probably only say "I oppose the genocide", which would seem quite nice of them. With most people, we will never know what they would do in a parallel universe, where organizing a genocide could give them a well-paid job. Without evidence to contrary, we usually charitably assume that they would refuse... but of course, perhaps this is unrealistically optimistic. (This only addresses the objection about "context". The problem of definition is more complicated.)
-2fourier2y
  No, the reason it sounds silly to you is not because it's not true, but because it's not part of your own sacred beliefs. There is no fundamental reason for people to support things you are taking for granted as moral facts, like women's right or racial rights. In fact, given an accurate model of the world, a lot of things that make the most sense you may find distasteful based on your current unusual "moral" fashions.  For example, exterminating opposing groups is common in human societies historically. Often groups are competing for resources, since one group wants more resources for them and their progeny, exterminating the other group makes the most sense.  And if the fundamental desire for survival and dominance -- drilled into us by evolution -- isn't moral, then the concept just seems totally meaningless.
2Viliam2y
A concept is "totally meaningless" just because it does not match some evolutionary strategies? First, concepts are concepts, regardless of what is their relation to evolution. Second, there are many strategies in evolution, including things like cooperation or commitments, which intuitively seem more aligned with morality. Humans are a social species, where the most aggresive one with most muscles is not necessarily a winner. Sometimes it is actually a loser, who gets beaten by the cops and thrown in jail. Another example: Some homeless people are quite scary and they can survive things that I probably cannot imagine; yet, from the evolutionary perspective, they are usually less successful than me. Even if a group wants to exterminate another group, it is usually easier if they befriend a different group first, and then attack together. But you usually don't make friends by being a backstabbing asshole. And "not being a backstabbing asshole" is kinda what morality is about. Here we need to decouple moral principles from factual beliefs. On the level of moral principles, many people accept "if some individual is similar to me, they should be treated with some basic respect" as a moral rule. Not all of them, of course. If someone does not accept this moral rule, then... de gustibus non est disputandum, I guess. (I suspect that ethics is somehow downstream of aesthetics, but I may be confused about this.) But even if someone accepts this rule, the actual application will depend on their factual beliefs about who is "similar to me". I believe it is a statement about the world (not just some kind of sacred belief) that approval of women's rights is positively correlated with the belief that (mentally) women are similar to men. Similarly, the approval of racial rights is positively correlated with the belief that people of different races are (mentally) similar to each other. This statement should be something that both people who approve and who disapprove of the af

Even if moralities vary from culture to culture based on the local status games, I would suggest that there is still some amount of consequentialist bedrock to why certain types of norms develop. In other words, cultural relativism is not unbounded.

Generally speaking, norms evolve over time, where any given norm at one point didn't yet exist if you go back far enough. What caused these norms to develop? I would say the selective pressures for norm development come from some combination of existing culturally-specific norms and narratives (such as the sunrise being an agent that could get hurt when kicked) along with more human-universal motivations (such as empathy + {wellbeing = good, suffering = bad} -> you are bad for kicking the sunrise -> don't sleep facing west) or other instrumentally-convergent goals (such as {power = good} + "semen grants power" -> institutionalized sodomy). At every step along the evolution of a moral norm, every change needs to be justifiable (in a consequentialist sense) to the members of the community who would adopt it. Moral progress is when the norms of society come to better resonate with both the accepted narratives of society (which may ... (read more)

3Wei Dai2y
Upvoted for some interesting thoughts. Can you say more about how you see us getting from here to there?
1Jon Garcia2y
Getting from here to there is always the tricky part with coordination problems, isn't it? I do have some (quite speculative) ideas on that, but I don't see human society organizing itself in this way on its own for at least a few centuries given current political and economic trends, which is why I postulated a cooperative ASI. So assuming that either an aligned ASI has taken over (I have some ideas on robust alignment, too, but that's out of scope here) or political and economic forces (and infrastructure) have finally pushed humanity past a certain social phase transition, I see humanity undergoing an organizational shift much like what happened with the evolution of multicellularity and eusociality. This would look at first mostly the same as today, except that national borders have become mostly irrelevant due to advances in transportation and communication infrastructure. Basically, imagine the world's cities and highways becoming something like the vascular system of dicots or the closed circulatory system of vertebrates, with the regions enclosed by network circuits acting as de facto states (or organs/tissues, to continue the biological analogy). Major cities and the regions along the highways that connect them become the de facto arbiters of international policy, while the major cities and highways within each region become the arbiters of regional policy, and so on in a hierarchically embedded manner. Within this structure, enclosed regions would act as hierarchically embedded communities that end up performing a division of labor for the global network, just as organs divide labor for the body (or like tissues divide labor within an organ, or cells within a tissue, or organelles within a cell, if you're looking within regions). Basically, the transportation/communication/etc. network edges would come to act as Markov blankets for the regions they encapsulate, and this organization would extend hierarchically, just like in biological systems, down to th

I am also scared of futures where "alignment is solved" under the current prevailing usage of "human values."

Humans want things that we won't end up liking, and prefer things that we will regret getting relative to other options that we previously dispreferred. We are remarkably ignorant of what we will, in retrospect, end up having liked, even over short timescales. Over longer timescales, we learn to like new things that we couldn't have predicted a priori, meaning that even our earnest and thoughtfully-considered best guess of our preferences in advance will predictably be a mismatch for what we would have preferred in retrospect. 

And this is not some kind of bug, this is centrally important to what it is to be a person; "growing up" requires a constant process of learning that you don't actually like certain things you used to like and now suddenly like new things. This truth ranges over all arenas of existence, from learning to like black coffee to realizing you want to have children.

I am personally partial to the idea of something like Coherent Extrapolated Volition. But it seems suspicious that I've never seen anybody on LW sketch out how a decision theory ought to beha... (read more)

4Astor2y
I thought a solved alignment problem would implicate a constant process of changing the values of the AI in regard to the most recent human values. So if something does not lead to the expected terminal goals of the human (such as enjoyable emotions), then the human can indicate that outcome to the AI and the AI would adjust its own goals accordingly.
5moridinamael2y
The idea that the AI should defer to the "most recent" human values is an instance of the sort of trap I'm worried about. I suspect we could be led down an incremental path of small value changes in practically any direction, which could terminate in our willing and eager self-extinction or permanent wireheading. But how much tyranny should present-humanity be allowed to have over the choices of future humanity?  I don't think "none" is as wise an answer as it might sound at first. To answer "none" implies a kind of moral relativism that none of us actually hold, and which would make us merely the authors of a process that ultimately destroys everything we currently value. But also, the answer of "complete control by the future by the present" seems obviously wrong, because we will learn about entirely new things worth caring about that we can't predict now, and sometimes it is natural to change what we like. More fundamentally, I think the assumption that there exist "human terminal goals" presumes too much. Specifically, it's an assumption that presumes that our desires, in anticipation and in retrospect, are destined to fundamentally and predictably cohere. I would bet money that this isn't the case.
2Vladimir_Nesov2y
The implication of doing everything that AI could do at once is unfortunate. The urgent objective of AI alignment is prevention of AI risk, where a minimal solution is to take away access to unrestricted compute from all humans in a corrigible way that would allow eventual desirable use of it. All other applications of AI could follow much later through corrigibility of this urgent application.
1[comment deleted]2y
1[comment deleted]2y

I honestly have a difficult time understanding the people (such as your "AI alignment researchers and other LWers, Moral philosophers") who actually believe in Morality with a capital M. I believe they are misguided at best, potentially dangerous at worst. 

I hadn't heard of the Status Game book you quote, but for a long time now it's seemed obvious to me that there is no objective true Morality, it's purely a cultural construct, and mostly a status game. Any deep reading of history, cultures, and religions, leads one to this conclusion.

Humans have complex values, and that is all. 

We humans cooperate and compete to optimize the universe according to those values, as we always have, as our posthuman descendants will, even without fully understanding them.

4Ape in the coat2y
I think you are misunderstanding what Wei_Dai meant by "AI alignment researchers and other LWers, Moral philosophers" perspective on morality. It's not about capital letters or "objectivity" of our morality. It's about that exact fact that humans have complex values and whether we can understand them and translate them into one course of action according to which we are going to optimize the universe. Basically, as I understand it, the difference is between people who try to resolve the confilcts between their different values and generally think about them as an approximation of some coherent utility function, and those who don't.
3jacob_cannell2y
  If we agree humans have complex subjective values, then optimizing group decisions (for a mix of agents with different utility functions) is firmly a question for economic mechanism design - which is already a reasonably mature field.
2JenniferRM2y
A problem here, however, is the Myerson–Satterthwaite result which suggests that auction runners, to enable clean and helpful auctions for others, risk being hurt when they express and seek their own true preferences, or (if they take no such risks) become bad auctioneers for others. The thing that seems like it might just be True here is that Good Governance requires personal sacrifice by leaders, which I mostly don't expect to happen, given normal human leaders, unless those leaders are motivated by, essentially: "altruistic" "moral sentiment". It could be that I'm misunderstanding some part of the economics or the anthropology or some such? But it looks to me like if someone says that there is no such thing as moral sentiment, it implies that they themselves do not have such sentiments, and so perhaps those specific people should not be given power or authority or respect in social processes that are voluntary, universal, benevolent, and theoretically coherent. The reasonableness of this conclusion goes some way to explain to me how there is so much "social signaling" and also goes to explaining why so much of this signaling is fake garbage transmitted into the social environment by power-hungry psychos.
1Ape in the coat2y
Well, that's one way to do it. With it's own terrible consequences, but lets not focus on them for now.  What's more important is that this solution is very general, while all human values belong to the same cluster. So there may be more preferable, more human-specific solution for the problem. 

To repost my comment from a couple of weeks back, which seems to say roughly the same thing, not as well:

I don't believe alignment is possible. Humans are not aligned with other humans, and the only thing that prevents an immediate apocalypse is the lack of recursive self-improvement on short timescales. Certainly groups of humans happily destroy other groups of humans, and often destroy themselves in the process of maximizing something like the number of statues. Best we can hope for that whatever takes over the planet after meatbags are gone has some of

... (read more)
5Charlie Steiner2y
Do you think there are changes to the current world that would be "aligned"? (E.g. deleting covid) Then we could end up with a world that is better than our current one, even without needing all humans to agree on what's best. Another option: why not just do everything at once? Have some people living in a diverse Galactic civilization, other people spreading the word of god, and other people living in harmony with nature, and everyone contributing a little to everyone else's goals? Yes, in principle people can have different values such that this future sounds terrible to everyone - but in reality it seems more like people would prefer this to our current world, but might merely feel like they were missing out relative to their own vision of perfection.
2Ratios2y
I have also made a similar comment a few weeks ago, In fact, this point seems to me so trivial yet corrosive that I find it outright bizarre it's not being tackled/taken seriously by the AI alignment community. 

I'm not sure what you mean by 'astronomical waste or astronomical suffering'.  Like, you are writing that everything forever is status games, ok, sure, but then you can't turn around and appeal to a universal concept of suffering/waste, right?

Whatever you are worried about is just like Gandhi worrying about being too concerned with cattle, plus x years, yeah?  And even if you've lucked into a non status games morality such that you can perceive 'Genuine Waste' or what have you...surely by your own logic, we who are reading this are incapable of understanding, aside from in terms of status games.

6Wei Dai2y
I'm suggesting that maybe some of us lucked into a status game where we use "reason" and "deliberation" and "doing philosophy" to compete for status, and that somehow "doing philosophy" etc. is a real thing that eventually leads to real answers about what values we should have (which may or may not depend on who we are). Of course I'm far from certain about this, but at least part of me wants to act as if it's true, because what other choice does it have?
2[anonymous]2y
The alternative is egoism. To the extent that we are allies, I'd be happy if you adopted it.
2Wei Dai2y
I don't think that's a viable alternative, given that I don't believe that egoism is certainly right (surely the right way to treat moral uncertainty can't be to just pick something and "adopt it"?), plus I don't even know how to adopt egoism if I wanted to: * https://www.lesswrong.com/posts/Nz62ZurRkGPigAxMK/where-do-selfish-values-come-from * https://www.lesswrong.com/posts/c73kPDr8pZGdZSe3q/solving-selfishness-for-udt (which doesn't really solve the problem despite the title)


So on the one hand you have values that are easily, trivially compatible, such as "I want to spend 1000 years climbing the mountains of Mars" or "I want to host blood-sports with my uncoerced friends with the holodeck safety on".

On the other hand you have insoluble, or at least apparently insoluble, conflicts: B wants to torture people, C wants there to be no torture anywhere at all. C wants to monitor everyone everywhere forever to check that they aren't torturing anyone or plotting to torture anyone, D wants privacy. E and F both want to be the best in ... (read more)

I'm leaning towards the more ambitious version of the project of AI alignment being about corrigible anti-goodharting, with the AI optimizing towards good trajectories within scope of relatively well-understood values, preventing overoptimized weird/controversial situations, even at the cost of astronomical waste. Absence of x-risks, including AI risks, is generally good. Within this environment, the civilization might be able to eventually work out more about values, expanding the scope of their definition and thus allowing stronger optimization. Here corrigibility is in part about continually picking up the values and their implied scope from the predictions of how they would've been worked out some time in the future.

4Wei Dai2y
Please say more about this? What are some examples of "relatively well-understood values", and what kind of AI do you have in mind that can potentially safely optimize "towards good trajectories within scope" of these values?
4Vladimir_Nesov2y
My point is that the alignment (values) part of AI alignment is least urgent/relevant to the current AI risk crisis. It's all about corrigibility and anti-goodharting. Corrigibility is hope for eventual alignment, and anti-goodharting makes inadequacy of current alignment and imperfect robustness of corrigibility less of a problem. I gave the relevant example of relatively well-understood values, preference for lower x-risks. Other values are mostly relevant in how their understanding determines the boundary of anti-goodharting, what counts as not too weird for them to apply, not in what they say is better. If anti-goodharting holds (too weird and too high impact situations are not pursued in planning and possibly actively discouraged), and some sort of long reflection is still going on, current alignment (details of what the values-in-AI prefer, as opposed to what they can make sense of) doesn't matter in the long run. I include maintaining a well-designed long reflection somewhere into corrigibility, for without it there is no hope for eventual alignment, so a decision theoretic agent that has long reflection within its preference is corrigible in this sense. Its corrigibility depends on following a good decision theory, so that there actually exists a way for the long reflection to determine its preference so that it causes the agent to act as the long reflection wishes. But being an optimizer it's horribly not anti-goodharting, so can't be stopped and probably eats everything else. An AI with anti-goodharting turned to the max is the same as AI with its stop button pressed. An AI with minimal anti-goodharting is an optimizer, AI risk incarnate. Stronger anti-goodharting is a maintenance mode, opportunity for fundamental change, weaker anti-goodharting makes use of more developed values to actually do the things. So a way to control the level of anti-goodharting in an AI is a corrigibility technique. The two concepts work well with each other.
4Wei Dai2y
This seems interesting and novel to me, but (of course) I'm still skeptical. Preference for lower x-risk doesn't seem "well-understood" to me, if we include in "x-risk" things like value drift/corruption, premature value lock-in, and other highly consequential AI-enabled decisions (potential existential mistakes) that depend on hard philosophical questions. I gave some specific examples in this recent comment. What do you think about the problems on that list? (Do you agree that they are serious problems, and if so how do you envision them being solved or prevented in your scenario?)
0Ratios2y
The fact that AI alignment research is 99% about control, and 1% (maybe less?) about metaethics (In the context of how do we even aggregate the utility function of all humanity) hints at what is really going on, and that's enough said.
6Daniel Kokotajlo2y
Have you heard about CEV and Fun Theory? In an earlier, more optimistic time, this was indeed a major focus. What changed is we became more pessimistic and decided to focus more on first things first -- if you can't control the AI at all, it doesn't matter what metaethics research you've done. Also, the longtermist EA community still thinks a lot about metaethics relative to literally every other community I know of, on par with and perhaps slightly more than my philosophy grad student friends. (That's my take at any rate, I haven't been around that long.)
2Ratios2y
CEV was written in 2004, fun theory 13 years ago. I couldn't find any recent MIRI paper that was about metaethics (Granted I haven't gone through all of them). The metaethics question is important just as much as the control question for any utilitarian (What good will it be to control an AI only for it to be aligned with some really bad values, an AI-controlled by a sadistic sociopath is infinitely worse than a paper-clip-maximizer). Yet all the research is focused on control, and it's very hard not to be cynical about it. If some people believe they are creating a god, it's selfishly prudent to make sure you're the one holding the reigns to this god. I don't get why having some blind trust in the benevolence of Peter Thiel (who finances this) or other people who will suddenly have godly powers to care for all humanity seems naive with all we know about how power corrupts and how competitive and selfish people are. Most people are not utilitarians, so as a quasi-utilitarian I'm pretty terrified of what kind of world will be created with an AI-controlled by the typical non-utilitarian person.
4Daniel Kokotajlo2y
My claim was not that MIRI is doing lots of work on metaethics. As far as I know they are focused on the control/alignment problem. This is not because they think it's the only problem that needs solving; it's just the most dire, the biggest bottleneck, in their opinion. You may be interested to know that I share your concerns about what happens after (if) we succeed at solving alignment. So do many other people in the community, I assure you. (Though I agree on the margin more quiet awareness-raising about this would plausibly be good.)
2Mitchell_Porter2y
http://www.metaethical.ai is the state of the art as far as I'm concerned... 

I think this post makes an important point -- or rather, raises a very important question, with some vivid examples to get you started. On the other hand, I feel like it doesn't go further, and probably should have -- I wish it e.g. sketched a concrete scenario in which the future is dystopian not because we failed to make our AGIs "moral" but because we succeeded, or e.g. got a bit more formal and complemented the quotes with a toy model (inspired by the quotes) of how moral deliberation in a society might work, under post-AGI-alignment conditions, and ho... (read more)

[-][anonymous]2y30

If with "morality" you mean moral realism, then yes, I agree that it is scary.
I'm most scared by the apparent assumption that we have solved the human alignment problem.
Looking at history, I don't feel like our current situation of relative peace is very stable.
My impression is that "good" behavior is largely dependent on incentives, and so is the very definition of "good".
Perhaps markets are one of the more successful tools of creating aligned behaviour in humans, but even in that case it only seems to work if the powers of the market participants are balanced, which is not a luxury we have in alignment work.

You could read the status game argument the opposite way: Maybe status seeking causes moral beliefs without justifying them, in the same way that it can distort our factual beliefs about the world. If we can debunk moral beliefs by finding them to be only status-motivated, the status explanation can actually assist rational reflection on morality.

Also the quote from The Status Game conflates purely moral beliefs and factual beliefs in a way that IMO weakens its argument. It's not clear that many of the examples of crazy value systems would survive full logical and empirical information.

9Wei Dai2y
The point I was trying to make with the quote is that many people are not motivated to do "rational reflection on morality" or examine their value systems to see if they would "survive full logical and empirical information". In fact they're motivated to do the opposite, to protect their value systems against such reflection/examination. I'm worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just "do what the user wants", that could cause a lock-in of crazy value systems that wouldn't survive full logical and empirical information.

There is no unique eutopia. 

Sentient beings that collaborate outcompete ones that don't (not considering here, inner competition in a singleton). Collaboration means that interests between beings are traded/compromised. Better collaboration methods have a higher chance to win. We see this over the course of history. This is a messy evolutionary process. But I think there is a chance that this process itself can be improved e.g. with FAI. Think of an interactive "AlphaValue" that does Monte-Carlo Tree Search over collaboration opportunities. It will not converge on a unique best CEV but result in one of many possible eutopias. 

I don't follow the reasoning. How do you get from "most people's moral behaviour is explainable in terms of them 'playing' a status game" to "solving (some versions of) the alignment problem probably won't be enough to ensure a future that's free from astronomical waste or astronomical suffering"?

More details:
Regarding the quote from The Status Game: I have not read the book, so I'm not sure what the intended message is but this sounds like some sort of unwarranted pessimism about ppl's moral standing (something like a claim like "the vast majority of ppl ... (read more)

Great post, thanks! Widespread value pluralism a la 'well that's just, like, your opinion man' is now a feature of modern life.  Here are a pair of responses from political philosophy which may be of some interest 

(1) Rawls/Thin Liberal Approach. Whilst we may not be able to agree on what 'the good life' is, we can at least agree on a basic system which ensures all participants can pursue their own idea of the good life.  So,(1) Protect a list of political liberties and freedoms and (2) degree of economic levelling. Beyond that, it is up to ... (read more)

Don't you need AI to go through the many millions of experiences that it might take to develop a good morality strategy?

I'm entranced by Jordan Peterson's descriptions, which seem to light up the evolutionary path of morality for humans.  Shouldn't AI be set up to try to grind through the same progress?

5Andrew McKnight2y
I think the main thing you're missing here is that an AI is not generally going to share common learning facilities with humans. An AI growing up as a human will make it wildly different from a normal human because they aren't built precisely to learn from those experiences the way a human does.

What's truly scary is how much the beliefs and opinions of normal people make them seem like aliens to me. 

I find the paragraph beginning with these two sentences, its examples misleading and unconvincing in the point about moral disagreement across time it tries to make:

Such ‘facts’ also change across time.  We don’t have to travel back far to discover moral superstars holding moral views that would destroy them today.

I shall try to explain why, because such evidence seemed persuasive to me before I thought about it more; I made this account just for this comment after being a lurker for a while -- I have found your previous posts about moral uncertainty ... (read more)

It seems that our morality consists of two elements. First is bias, based on game theoretical environment of our ancestors. Humans developed complex feelings around activities that promoted inclusive genetic fitness and now we are intrinsically and authentically motivated to do them for their own sake. 

There is also a limited capability for moral updates. That's what we use to resolve contradictions in our moral intuitions. And that's also what allow us to persuade ourselves that doing some status promoting thing is actually moral. One the one hand, t... (read more)

You may not be interested in mutually exclusive compression schemas, but mutually exclusive compression schemas are interested in you. One nice thing is that given that the schemas use an arbitrary key to handshake with there is hope that they can be convinced to all get on the same arbitrary key without loss of useful structure.

[+][comment deleted]2y-90
[+][comment deleted]2y-120