Harsanyi's Social Aggregation Theorem and what it means for CEV

A Friendly AI would have to be able to aggregate each person's preferences into one utility function. The most straightforward and obvious way to do this is to agree on some way to normalize each individual's utility function, and then add them up. But many people don't like this, usually for reasons involving utility monsters. If you are one of these people, then you better learn to like it, because according to Harsanyi's Social Aggregation Theorem, any alternative can result in the supposedly Friendly AI making a choice that is bad for every member of the population. More formally,

Axiom 1: Every person, and the FAI, are VNM-rational agents.

Axiom 2: Given any two choices A and B such that every person prefers A over B, then the FAI prefers A over B.

Axiom 3: There exist two choices A and B such that every person prefers A over B.

(Edit: Note that I'm assuming a fixed population with fixed preferences. This still seems reasonable, because we wouldn't want the FAI to be dynamically inconsistent, so it would have to draw its values from a fixed population, such as the people alive now. Alternatively, even if you want the FAI to aggregate the preferences of a changing population, the theorem still applies, but this comes with it's own problems, such as giving people (possibly including the FAI) incentives to create, destroy, and modify other people to make the aggregated utility function more favorable to them.)

Give each person a unique integer label from to , where is the number of people. For each person , let be some function that, interpreted as a utility function, accurately describes 's preferences (there exists such a function by the VNM utility theorem). Note that I want to be some particular function, distinct from, for instance, , even though and represent the same utility function. This is so it makes sense to add them.

Theorem: The FAI maximizes the expected value of , for some set of scalars .

Actually, I changed the axioms a little bit. Harsanyi originally used “Given any two choices A and B such that every person is indifferent between A and B, the FAI is indifferent between A and B” in place of my axioms 2 and 3 (also he didn't call it an FAI, of course). For the proof (from Harsanyi's axioms), see section III of Harsanyi (1955), or section 2 of Hammond (1992). Hammond claims that his proof is simpler, but he uses jargon that scared me, and I found Harsanyi's proof to be fairly straightforward.

Harsanyi's axioms seem fairly reasonable to me, but I can imagine someone objecting, “But if no one else cares, what's wrong with the FAI having a preference anyway. It's not like that would harm us.” I will concede that there is no harm in allowing the FAI to have a weak preference one way or another, but if the FAI has a strong preference, that being the only thing that is reflected in the utility function, and if axiom 3 is true, then axiom 2 is violated.

proof that my axioms imply Harsanyi's: Let A and B be any two choices such that every person is indifferent between A and B. By axiom 3, there exists choices C and D such that every person prefers C over D. Now consider the lotteries and , for . Notice that every person prefers the first lottery to the second, so by axiom 2, the FAI prefers the first lottery. This remains true for arbitrarily small , so by continuity, the FAI must not prefer the second lottery for ; that is, the FAI must not prefer B over A. We can “sweeten the pot” in favor of B the same way, so by the same reasoning, the FAI must not prefer A over B.

So why should you accept my axioms?

Axiom 1: The VNM utility axioms are widely agreed to be necessary for any rational agent.

Axiom 2: There's something a little rediculous about claiming that every member of a group prefers A to B, but that the group in aggregate does not prefer A to B.

Axiom 3: This axiom is just to establish that it is even possible to aggregate the utility functions in a way that violates axiom 2. So essentially, the theorem is “If it is possible for anything to go horribly wrong, and the FAI does not maximize a linear combination of the people's utility functions, then something will go horribly wrong.” Also, axiom 3 will almost always be true, because it is true when the utility functions are linearly independent, and almost all finite sets of functions are linearly independent. There are terrorists who hate your freedom, but even they care at least a little bit about something other than the opposite of what you care about.

At this point, you might be protesting, “But what about equality? That's definitely a good thing, right? I want something in the FAI's utility function that accounts for equality.” Equality is a good thing, but only because we are risk averse, and risk aversion is already accounted for in the individual utility functions. People often talk about equality being valuable even after accounting for risk aversion, but as Harsanyi's theorem shows, if you do add an extra term in the FAI's utility function to account for equality, then you risk designing an FAI that makes a choice that humanity unanimously disagrees with. Is this extra equality term so important to you that you would be willing to accept that?

Remember that VNM utility has a precise decision-theoretic meaning. Twice as much utility does not correspond to your intuitions about what “twice as much goodness” means. Your intuitions about the best way to distribute goodness to people will not necessarily be good ways to distribute utility. The axioms I used were extremely rudimentary, whereas the intuition that generated "there should be a term for equality or something" is untrustworthy. If they come into conflict, you can't keep all of them. I don't see any way to justify giving up axioms 1 or 2, and axiom 3 will likely remain true whether you want it to or not, so you should probably give up whatever else you wanted to add to the FAI's utility function.

Citations:

Harsanyi, John C. "Cardinal welfare, individualistic ethics, and interpersonal comparisons of utility." The Journal of Political Economy (1955): 309-321.

Hammond, Peter J. "Harsanyi’s utilitarian theorem: A simpler proof and some ethical connotations." IN R. SELTEN (ED.) RATIONAL INTERACTION: ESSAYS IN HONOR OF JOHN HARSANYI. 1992.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:05 PM
Select new highlight date
All comments loaded

So when you're talking about decision theory and your intuitions come into conflict with the math, listen to the math.

I think you're overselling your case a little here. The cool thing about theorems is that their conclusions follow from their premises. If you then try to apply the theorem to the real world and someone dislikes the conclusion, the appropriate response isn't "well it's math, so you can't do that," it's "tell me which of my premises you dislike."

An additional issue here is premises which are not explicitly stated. For example, there's an implicit premise in your post of there being some fixed collection of agents with some fixed collection of preferences that you want to aggregate. Not pointing out this premise explicitly leaves your implied social policy potentially vulnerable to various attacks involving creating agents, destroying agents, or modifying agents, as I've pointed out in other comments.

I suggest the VNM Expected Utility Theorem and this theorem should be used as a test on potential FAI researchers. Is their reaction to these theorems "of course, the FAI has to be designed that way" or "that's a cool piece of math, now let's see if we can't break it somehow"? Maybe you don't need everyone on the research team to instinctively have the latter reaction, but I think you definitely want to make sure at least some do. (I wonder what von Neumann's reaction was to his own theorem...)

I think you're overselling your case a little here. The cool thing about theorems is that their conclusions follow from their premises. If you then try to apply the theorem to the real world and someone dislikes the conclusion, the appropriate response isn't "well it's math, so you can't do that," it's "tell me which of my premises you dislike."

That's a good point. I agree, and I've edited my post to reflect that.

An additional issue here is premises which are not explicitly stated. For example, there's an implicit premise in your post of there being some fixed collection of agents with some fixed collection of preferences that you want to aggregate. Not pointing out this premise explicitly leaves your implied social policy potentially vulnerable to various attacks involving creating agents, destroying agents, or modifying agents, as I've pointed out in other comments.

I thought I was being explicit about that when I was writing it, but looking at my post again, I now see that I was not. I've edited it to try to clarify that.

Thanks for pointing those out.

Axiom 1: Every person, and the FAI, are VNM-rational agents.

[...]

So why should you accept my axioms?

Axiom 1: The VNM utility axioms are widely agreed to be necessary for any rational agent.

Though of course, humans are not VNM-rational.

But many people don't like this, usually for reasons involving utility monsters. If you are one of these people, then you better learn to like it, because according to Harsanyi's Social Aggregation Theorem, any alternative can result in the supposedly Friendly AI making a choice that is bad for every member of the population. More formally,

That a bad result can happen in a given strategy is not a conclusive argument against preferring that strategy. Will it happen? What's the likelihood that it happens? What's the cost if it does happen?

The two alternatives discussed each has their own failure mode, while your "better learn to like it" admonition seems to imply that one side is compelled by the failure mode of their preferred strategy to give it up for the alternative strategy.

Why is this new failure mode supposed to be decisive in the choice between the two alternatives?

There's something a little rediculous about claiming that every member of a group prefers A to B, but that the group in aggregate does not prefer A to B.

That would look a bit like Simpson's paradox actually.

The situation analogous to Simpson's paradox can only occur if for some reason we care about some people's opinion more than others in some situations (this is analogous to the situation in Simpson's paradox where we have more data points in some parts of the table than others. It is a necessary condition for the paradox to occur.)

For example: Suppose Alice (female) values a cure for prostate cancer at 10 utils, and a cure for breast cancer at 15 utils. Bob (male) values a cure for prostate cancer at 100 utils, and a cure for breast cancer at 150 utils. Suppose that because prostate cancer largely affects men and breast cancer largely affects women we value Alice's opinion twice as much about breast cancer and Bob's opinion twice as much about prostate cancer. Then in the aggregate curing prostate cancer is 210 utils and curing breast cancer 180 utils, a preference reversal compared to either of Alice or Bob.

This is essentially just an example of Harsanyi's Theorem in action. And I think it makes a compelling demonstration of why you should not program an AI in that fashion.

What if we also add a requirement that the FAI doesn't make anyone worse off in expected utility compared to no FAI? That seems reasonable, but conflicts the other axioms. For example, suppose there are two agents: A gets 1 util if 90% of the universe is converted into paperclips, 0 utils otherwise, and B gets 1 util if 90% of the universe is converted into staples, 0 utils otherwise. Without an FAI, they'll probably end up fighting each other for control of the universe, and let's say each has 30% chance of success. An FAI that doesn't make one of them worse off has to prefer a 50/50 lottery of the universe turning into either paperclips or staples to a certain outcome of either, but that violates VNM rationality.

And things get really confusing when we also consider issues of logical uncertainty and dynamical consistency.

What if we also add a requirement that the FAI doesn't make anyone worse off in expected utility compared to no FAI?

Sounds obviously unreasonable to me. E.g. a situation where a person derives a large part of their utility from having kidnapped and enslaved somebody else: the kidnapper would be made worse off if their slave was freed, but the slave wouldn't become worse off if their slavery merely continued, so...

What if we also add a requirement that the FAI doesn't make anyone worse off in expected utility compared to no FAI?

I don't think that seems reasonable at all, especially when some agents want to engage in massively negative-sum games with others (like those you describe), or have massively discrete utility functions that prevent them from compromising with others (like those you describe). I'm okay with some agents being worse off with the FAI, if that's the kind of agents they are.

Luckily, I think people, given time to reflect and grown and learn, are not like that, which is probably what made the idea seem reasonable to you.

I'm okay with some agents being worse off with the FAI, if that's the kind of agents they are.

Do you see CEV as about altruism, instead of cooperation/bargaining/politics? It seems to me the latter is more relevant, since if it's just about altruism, you could use CEV instead of CEV. So, if you don't want anyone to have an incentive to shut down an FAI project, you need to make sure they are not made worse off by an FAI. Of course you could limit this to people who actually have the power to shut you down, but my point is that it's not entirely up to you which agents the FAI can make worse off.

Luckily, I think people, given time to reflect and grown and learn, are not like that

Right, this could be another way to solve the problem: show that of the people you do have to make sure are not made worse off, their actual values (given the right definition of "actual values") are such that a VNM-rational FAI would be sufficient to not make them worse off. But even if you can do that, it might still be interesting and productive to look into why VNM-rationality doesn't seem to be "closed under bargaining".

Also, suppose I personally (according to my sense of altruism) do not want to make anyone among worse off by my actions. Depending on their actual utility functions, it seems that my preferences may not be VNM-rational. So maybe it's not safe to assume that the inputs to this process are VNM-rational either?

Even if it's about bargaining rather than about altruism, it's still okay to have someone worse off under the FAI just so long as they would not be able to predict ahead of time that they wold get the short end of the stick. It's possible to have everyone benefit in expectation by creating an AI that is willing to make some people (who humans cannot predict the identity of ahead of time) worse off if it brings sufficient gain to the others.

I agree with this, which is why I said "worse off in expected utility" at the beginning of the thread. But I think you need "would not be able to predict ahead of time" in a fairly strong sense, namely that they would not be able to predict it even if they knew all the details of how the FAI worked. Otherwise they'd want to adopt the conditional strategy "learn more about the FAI design, and try to shut it down if I learn that I will get the short end of the stick". It seems like the easiest way to accomplish this is to design the FAI to explicitly not make certain people worse off, rather than depend on that happening as a likely side effect of other design choices.

Have you looked at some of the more recent papers in this literature (which generally have a lot more negative results than positive ones)? For example Preference aggregation under uncertainty: Savage vs. Pareto? I haven't paid too much attention to this literature myself yet, because the social aggregation results seem pretty sensitive to details of the assumed individual decision theory, which is still pretty unsettled. (Oh, I mentioned another paper here.)

I'd be curious to see someone reply to this on behalf of parliamentary models, whether applied to preference aggregation or to moral uncertainty between different consequentialist theories. Do the choices of a parliament reduce to maximizing a weighted sum of utilities? If not, which axiom out of 1-3 do parliamentary models violate, and why are they viable despite violating that axiom?

Can you be more specific about what you mean by a parliamentary model? (If I had to guess, though, axiom 1.)

Interesting. A parliamentary model applied to moral uncertainty definitely fails axiom 1 if any of the moral theories you're aggregating isn't VNM-rational. It probably still fails axiom 1 even if all of the individual moral theories are VNM-rational because the entire parliament is probably not VNM-rational. That's okay from Bostom's point of view because VNM-rationality could be one of the things you're uncertain about.

What if it is not, in fact, one of the things you're uncertain about?

Then I am not sure, because that blog post hasn't specified the model precisely enough for me to do any math, but my guess would be that the parliament fails to be VNM-rational. Depending on how the bargaining mechanism is set up, it might even fail to have coherent preferences in the sense that it might not always make the same choice when presented with the same pair of outcomes...

Axiom two reminds me of Simpson's paradox. I'm not sure how applicable it is, but I wouldn't be all that surprised so find an explanation that a violation of it this axiom perfectly reasonable. I don't suppose you have a set of more obvious axioms you could work with.

I don't see how I could agree with this conclusion :

But many people don't like this, usually for reasons involving utility monsters. If you are one of these people, then you better learn to like it, because according to Harsanyi's Social Aggregation Theorem, any alternative can result in the supposedly Friendly AI making a choice that is bad for every member of the population.

If both ways are wrong, then you haven't tried hard enough yet.

Well explained though.

The Social Aggregation Theorem doesn't just show that some particular way of aggregating utility functions other than by linear combination is bad; it shows that every way of aggregating utility functions other than by linear combination is bad.

Great post! I wish Harsanyi's papers were better known amongst philosophers.

Thanks for posting this! This is a fairly satisfying answer to my question from before.

Can you clarify which people you want to apply this theorem to? I don't think the relevant people should be the set of all humans alive at the time that the FAI decides what to do because this population is not fixed over time and doesn't have fixed utility functions over time. I can think of situations where I would want the FAI to make a decision that all humans alive at a fixed time would disagree with (for example, suppose most humans die and the only ones left happen to be amoral savages), and I also have no idea how to deal with changing populations with changing utility functions in general.

So it seems the FAI should be aggregating the preferences of a fixed set of people for all time. But this also seems problematic.

Can you clarify which people you want to apply this theorem to?

I'm not entirely sure. My default answer to that is "all people alive at the time that the singularity occurs", although you pointed out a possible drawback to that (it incentivizes people to create more people with values similar to their own) in our previous discussion. This is really an instrumental question: What set of people should I suggest get to have their utility functions aggregated into the CEV so as to best maximize my utility? One possible answer is to aggregate the utilities of everyone who worked on or supported the FAI project, but I suspect that due to the influence of far thinking, that would actually be a terrible way to motivate people to work on FAI, and it should actually be much broader than that.

So it seems the FAI should be aggregating the preferences of a fixed set of people for all time. But this also seems problematic.

I don't think it would be terribly problematic. "People in the future should get exactly what we currently would want them to get if we were perfectly wise and knew their values and circumstances" seems like a pretty good rule. It is, after all, what we want.

My default answer to that is "all people alive at the time that the singularity occurs", although you pointed out a possible drawback to that (it incentivizes people to create more people with values similar to their own) in our previous discussion.

And also incentivizes people to kill people with values dissimilar to their own!

I don't think it would be terribly problematic. "People in the future should get exactly what we currently would want them to get if we were perfectly wise and knew their values and circumstances" seems like a pretty good rule. It is, after all, what we want.

Fair enough. Hmm.

And also incentivizes people to kill people with values dissimilar to their own!

That's a pretty good nail in the coffin. Maybe all people alive at the time of your comment. Or at any point in some interval containing that time, possibly including up to the time the singularity occurs. Although again, these are crude guesses, not final suggestions. This might be a good question to think more about.

That's a pretty good nail in the coffin.

It's not as bad as it sounds. Both arguments are also arguments against democracy, but I don't think they're knockdown arguments against democracy (although the general point that democracy can be gamed by brainwashing enough people is good to keep in mind, and I think is a point that Moldbug, for example, is quite preoccupied with). For example, killing people doesn't appear to be a viable strategy for gaining control of the United States at the moment. Although the killing-people strategy in the FAI case might look more like "the US decides to nuke Russia immediately before the singularity occurs."

For example, killing people doesn't appear to be a viable strategy for gaining control of the United States at the moment.

Perhaps not, but it might help maintain control of the USG insofar as popularity increases the chances of reelection and killing (certain) people increases popularity.

A Friendly AI would have to be able to aggregate each person's preferences into one utility function. The most straightforward and obvious way to do this is to agree on some way to normalize each individual's utility function, and then add them up. But many people don't like this, usually for reasons involving utility monsters.

I should think most of those who don't like it do so because their values would be better represented by other approaches. A lot of those involved in the issue think they deserve more than a on-in-seven-billionth share of the future - and so pursue approaches that will help to deliver them that. This probably includes most of those with the skills to create such a future, and most of those with the resources to help fund them.

They could just insist on a normalization scheme that is blatantly biased in favor of their utility function. In a theoretical sense, this doesn't cause a problem, since there is no objective way to define an unbiased normalization anyway. (of course, if everyone insisted on biasing the normalization in their favor, there would be a problem)

I think most of those involved realise that such projects tend to be team efforts - and therefore some compromises over values will be necessary. Anyway, I think this is the main difficulty for utilitarians: most people are not remotely like utilitarians - and so don't buy into their bizarre ideas about what the future should be like.