I agree: Defect!

Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

I sometimes feel that nitpicking is the only contribution I'm competent to make around here, so... here you endorsed Steven's formulation of what "should" means; a formulation which doesn't allow you to apply the word to paperclip maximizers.

The True Prisoner's Dilemma

It occurred to me one day that the standard visualization of the Prisoner's Dilemma is fake.

The core of the Prisoner's Dilemma is this symmetric payoff matrix:

1: C 1:  D
2: C (3, 3) (5, 0)
2: D (0, 5) (2, 2)

Player 1, and Player 2, can each choose C or D.  1 and 2's utility for the final outcome is given by the first and second number in the pair.  For reasons that will become apparent, "C" stands for "cooperate" and D stands for "defect".

Observe that a player in this game (regarding themselves as the first player) has this preference ordering over outcomes:  (D, C) > (C, C) > (D, D) > (C, D).

D, it would seem, dominates C:  If the other player chooses C, you prefer (D, C) to (C, C); and if the other player chooses D, you prefer (D, D) to (C, D).  So you wisely choose D, and as the payoff table is symmetric, the other player likewise chooses D.

If only you'd both been less wise!  You both prefer (C, C) to (D, D).  That is, you both prefer mutual cooperation to mutual defection.

The Prisoner's Dilemma is one of the great foundational issues in decision theory, and enormous volumes of material have been written about it.  Which makes it an audacious assertion of mine, that the usual way of visualizing the Prisoner's Dilemma has a severe flaw, at least if you happen to be human.

The classic visualization of the Prisoner's Dilemma is as follows: you are a criminal, and you and your confederate in crime have both been captured by the authorities.

Independently, without communicating, and without being able to change your mind afterward, you have to decide whether to give testimony against your confederate (D) or remain silent (C).

Both of you, right now, are facing one-year prison sentences; testifying (D) takes one year off your prison sentence, and adds two years to your confederate's sentence.

Or maybe you and some stranger are, only once, and without knowing the other player's history, or finding out who the player was afterward, deciding whether to play C or D, for a payoff in dollars matching the standard chart.

And, oh yes - in the classic visualization you're supposed to pretend that you're entirely selfish, that you don't care about your confederate criminal, or the player in the other room.

It's this last specification that makes the classic visualization, in my view, fake.

You can't avoid hindsight bias by instructing a jury to pretend not to know the real outcome of a set of events.  And without a complicated effort backed up by considerable knowledge, a neurologically intact human being cannot pretend to be genuinely, truly selfish.

We're born with a sense of fairness, honor, empathy, sympathy, and even altruism - the result of our ancestors adapting to play the iterated Prisoner's Dilemma.  We don't really, truly, absolutely and entirely prefer (D, C) to (C, C), though we may entirely prefer (C, C) to (D, D) and (D, D) to (C, D).  The thought of our confederate spending three years in prison, does not entirely fail to move us.

In that locked cell where we play a simple game under the supervision of economic psychologists, we are not entirely and absolutely unsympathetic for the stranger who might cooperate.  We aren't entirely happy to think what we might defect and the stranger cooperate, getting five dollars while the stranger gets nothing.

We fixate instinctively on the (C, C) outcome and search for ways to argue that it should be the mutual decision:  "How can we ensure mutual cooperation?" is the instinctive thought.  Not "How can I trick the other player into playing C while I play D for the maximum payoff?"

For someone with an impulse toward altruism, or honor, or fairness, the Prisoner's Dilemma doesn't really have the critical payoff matrix - whatever the financial payoff to individuals.  (C, C) > (D, C), and the key question is whether the other player sees it the same way.

And no, you can't instruct people being initially introduced to game theory to pretend they're completely selfish - any more than you can instruct human beings being introduced to anthropomorphism to pretend they're expected paperclip maximizers.

To construct the True Prisoner's Dilemma, the situation has to be something like this:

Player 1:  Human beings, Friendly AI, or other humane intelligence.

Player 2:  UnFriendly AI, or an alien that only cares about sorting pebbles.

Let's suppose that four billion human beings - not the whole human species, but a significant part of it - are currently progressing through a fatal disease that can only be cured by substance S.

However, substance S can only be produced by working with a paperclip maximizer from another dimension - substance S can also be used to produce paperclips.  The paperclip maximizer only cares about the number of paperclips in its own universe, not in ours, so we can't offer to produce or threaten to destroy paperclips here.  We have never interacted with the paperclip maximizer before, and will never interact with it again.

Both humanity and the paperclip maximizer will get a single chance to seize some additional part of substance S for themselves, just before the dimensional nexus collapses; but the seizure process destroys some of substance S.

The payoff matrix is as follows:

1: C 1:  D
2: C (2 billion human lives saved, 2 paperclips gained) (+3 billion lives, +0 paperclips)
2: D (+0 lives, +3 paperclips) (+1 billion lives, +1 paperclip)

I've chosen this payoff matrix to produce a sense of indignation at the thought that the paperclip maximizer wants to trade off billions of human lives against a couple of paperclips.  Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

In this case, we really do prefer the outcome (D, C) to the outcome (C, C), leaving aside the actions that produced it.  We would vastly rather live in a universe where 3 billion humans were cured of their disease and no paperclips were produced, rather than sacrifice a billion human lives to produce 2 paperclips.  It doesn't seem right to cooperate, in a case like this.  It doesn't even seem fair - so great a sacrifice by us, for so little gain by the paperclip maximizer?  And let us specify that the paperclip-agent experiences no pain or pleasure - it just outputs actions that steer its universe to contain more paperclips.  The paperclip-agent will experience no pleasure at gaining paperclips, no hurt from losing paperclips, and no painful sense of betrayal if we betray it.

What do you do then?  Do you cooperate when you really, definitely, truly and absolutely do want the highest reward you can get, and you don't care a tiny bit by comparison about what happens to the other player?  When it seems right to defect even if the other player cooperates?

That's what the payoff matrix for the true Prisoner's Dilemma looks like - a situation where (D, C) seems righter than (C, C).

But all the rest of the logic - everything about what happens if both agents think that way, and both agents defect - is the same.  For the paperclip maximizer cares as little about human deaths, or human pain, or a human sense of betrayal, as we care about paperclips.  Yet we both prefer (C, C) to (D, D).

So if you've ever prided yourself on cooperating in the Prisoner's Dilemma... or questioned the verdict of classical game theory that the "rational" choice is to defect... then what do you say to the True Prisoner's Dilemma above?

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 12:06 AM
Select new highlight date
All comments loaded
I would certainly hope you would defect, Eliezer. Can I really trust you with the future of the human race?

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

But wait for tomorrow's post before you accuse me of disloyalty to humanity.

Ha, I was waiting for someone to accuse me of antisocial behavior for hinting that I might cooperate in the Prisoner's Dilemma.

It is fascinating looking at the conversation on this subject back in 2008, back before TDT and UDT had become part of the culture. The objections (and even the mistakes) all feel so fresh!

At this point Yudkowsky sub 2008 has already (awfully) written his TDT manuscript (in 2004) and is silently reasoning from within that theory, which the margins of his post are too small to contain.

On the off chance anyone actually sees this - I don't actually see a "next post" follow-up to this. Can anyone provide me with a link, and instructions as to how you got it?

Robin, the point I'm complaining about is precisely that the standard illustration of the Prisoner's Dilemma, taught to beginning students of game theory, fails to convey those entries in the payoff matrix - as if the entries were merely money instead of utilons, which is not at all what the Prisoner's Dilemma is about.

The point of the True Prisoner's Dilemma is that it gives you a payoff matrix that is very nearly the standard matrix in utilons, not just years in prison or dollars in an encounter.

I.e., you can tell people all day long that the entries are in utilons, but until you give them a visualization where those really are the utilons, it's around as effective as telling juries to ignore hindsight bias.

I agree: Defect!

I didn't say I would defect.

By the way, this was an extremely clever move: instead of announcing your departure from CDT in the post, you waited for the right prompt in the comments and dropped it as a shocking twist. Well crafted!

The entries in a payoff matrix are supposed to sum up everything you care about, including whatever you care about the outcomes for the other player. Most every game theory text and lecture I know gets this right, but even when we say the right thing to students over and over, they mostly still hear it the wrong way you initially heard it. This is just part of the facts of life of teaching game theory.

I suspect that the True Prisoner's Dilemma played itself out in the Portugese and Spanish conquest of Mesoamerica. Some natives were said to ask, "Do they eat gold?" They couldn't comprehend why someone would want a shiny decorative material so badly, they'd kill for it. The Spanish were Shiny Decorative Material maximizers.

That's a really insightful comment!

But I should correct you, that you are only talking about the Spanish conquest, not the Portuguese, since 1) Mesoamerica was not conquered by the Portuguese; 2) Portuguese possessions in America (AKA Brazil) had very little gold and silver, which was only discovered much later, when it was already in Portuguese domain.

Prase, Chris, I don't understand. Eliezer's example is set up in such a way that, regardless of what the paperclip maximizer does, defecting gains one billion lives and loses two paperclips.

Basically, we're being asked to choose between a billion lives and two paperclips (paperclips in another universe, no less, so we can't even put them to good use).

The only argument for cooperating would be if we had reason to believe that the paperclip maximizer will somehow do whatever we do. But I can't imagine how that could be true. Being a paperclip maximizer, it's bound to defect, unless it had reason to believe that we would somehow do whatever it does. I can't imagine how that could be true either.

Or am I missing something?

By the way:

Human: "What do you care about 3 paperclips? Haven't you made trillions already? That's like a rounding error!" Paperclip Maximizer: "How can you talk about paperclips like that?"


PM: "What do you care about a billion human algorithm continuities? You've got virtually the same one in billions of others! And you'll even be able to embed the algorithm in machines one day!" H: "How can you talk about human lives that way?"

I would certainly hope you would defect, Eliezer. Can I really trust you with the future of the human race?

I don't think Eliezer misunderstood. I think you are missing his point, that economists are defining away empathy in the way they present the problem, including the utilities presented.

Michael: This is not a prisoner's dilemma. The nash equilibrium (C,C) is not dominated by a pareto optimal point in this game.

I don't believe this is correct. Isn't the Nash equilibrium here (D,D)? That's the point at which neither player can gain by unilaterally changing strategy.

It's clear that in the "true" prisoner it is better to defect. The frustrating thing about the other prisoner's dilemma is that some people use it to imply that it is better to defect in real life. The problem is that the prisoner's dilemma is a drastic oversimplification of reality. To make it more realistic you'd have to make it iterated amongst a person's social network, add a memory and a perception of the other player's actions, change the payoff matrix depending on the relationship between the players etc etc.

This versions shows cases in which defection has a higher expected value for both players, but it's more contrived and unlikely to come into existence than the other prisoner's dilemma.

Eliezer,

The other assumption made about Prisoner's Dilemma, that I do not see you allude to, is that the payoffs account for not only a financial reward, time spent in prison, etc., but every other possible motivating factor in the decision making process. A person's utility related to the decision of whether to cooperate or defect will be a function of not only years spent in prison or lives saved but ALSO guilt/empathy. Presenting the numbers within the cells as actual quantities doesn't present the whole picture.

Important point.

Let's assume that your utility function (which is identical to theirs) simply weights and adds your payoff and theirs; that is, if you get X and they get Y, your function is U(X,Y) = aX+bY. In that case, working backwards from the utilities in the table, and subject to the constraint that a+b=1, here are the payoffs:

a/b=2: (you care twice as much about yourself)
(3,3) (-5,10)
(10,-5) (1,1)

a/b=3:
(3,3) (-2.5,7.5)
(7.5,-2.5) (1,1)

a=b:
Impossible. With both people being unselfish utilitarians, the utilities can never differ based on the same outcome.

b=0: (selfish)
The table as given in the post

I think the most important result is the case a=b: the dilemma makes no sense at all if the players weight both payoffs equally, because you can never produce asymmetrical utilities.

EDIT: My newbishness is showing. How do I format this better? Is it HTML?

I heard a funny story once (online somewhere, but this was years ago and I can't find it now). Anyway I think it was the psychology department at Stanford. They were having an open house, and they had set up a PD game with M&M's as the reward. People could sit at either end of a table with a cardboard screen before them, and choose 'D' or 'C', and then have the outcome revealed and get their candy.

So this mother and daughter show up, and the grad student explained the game. Mom says to the daughter "Okay, just push 'C', and I'll do the same, and we'll get the most M&M's. You can have some of mine after."

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

So the daughter pushes 'C', Mom pushes 'D', swallows all 5 M&M's, and with a full mouth says "Let that be a lesson! You can't trust anybody!"

I have seen various variations of this story, some told firsthand. In every case I have concluded that they are just bad parents. They aren't clever. They aren't deep. They are incompetent and banal. Even if parents try as hard as they can to be fair, just and reliable they still fall short of that standard enough for children to be aware of that they can't be completely trusted. Moreover children are exposed to other children and other adults and so are able to learn to distinguish people they trust from people that they don't. Adding the parent to the untrusted list achieves little benefit.

I'd like to hear the follow up to this 'funny' story. Where the daughter updates on the untrustworthiness of the parent and the meaninglessness of her word. She then proceeds to completely ignore the mother's commands, preferences and even her threats. The mother destroyed a valuable resource (the ability to communicate via 'cheap' verbal signals) for the gain of a brief period of feeling smug superiority. The daughter (potentially) realises just how much additional freedom and power she has in practice when she feels no internal motivation to comply with her mother's verbal utterances.

(Bonus follow up has the daughter steal the mother's credit card and order 10kg of M&Ms online. Reply when she objects "Let that be a lesson! You can't trust anybody!")

I suppose the biggest lesson for the daughter to learn is just how significant the social and practical consequences of reckless defection in social relationships can be.

I apologize if this is covered by basic decision theory, but if we additionally assume:

  • the choice in our universe is made by a perfectly rational optimization process instead of a human

  • the paperclip maximizer is also a perfect rationalist, albeit with a very different utility function

  • each optimization process can verify the rationality of the other

then won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Each side's choice necessarily reveals the other's; they're the outputs of equivalent computations.

michael webster,

You seem to have inverted the notation; not Eli.

(D,D) is the Nash equilibrium, not (C,C); and (D,D) is indeed Pareto dominated by (C,C), so this does seem to be a standard Prisoners' Dilemma.

Definitely defect. Cooperation only makes sense in the iterated version of the PD. This isn't the iterated case, and there's no prior communication, hence no chance to negotiate for mutual cooperation (though even if there was, meaningful negotiation may well be impossible depending on specific details of the situation). Superrationality be damned, humanity's choice doesn't have any causal influence on the paperclip maximizer's choice. Defection is the right move.

Allan Crossman: Only if they believe that their decision somehow causes the other to make the same decision.

No line of causality from one to the other is required.

If a computer finds that (2^3021377)-1 is prime, it can also conclude that an identical computer a light year away will do the same. This doesn't mean one computation caused the other.

The decisions of perfectly rational optimization processes are just as deterministic.

Eliezer, I agree that your example makes more clear the point you are trying to make clear, but in an intro to game theory course I'd still start with the standard prisoner's dilemma example first, and only get to your example if I had time to make the finer point clearer. For intro classes for typical students the first priority is to be understood at all in any way, and that requires examples as simple clear and vivid as possible.

It's likely deliberate that prisoners were selected in the visualization to imply a relative lack of unselfish motivations.

Allan: There are benefits and no costs to defecting.

This is the same error as in the Newcomb's problem: there is in fact a cost. In case of prisoner's dilemma, you are penalized by ending up with (D,D) instead of better (C,C) for deciding to defect, and in the case of Newcomb's problem you are penalized by having only $1000 instead of $1,000,000 for deciding to take both boxes.

Interesting. There's a paradox involving a game in which players successively take a single coin from a large pile of coins. At any time a player may choose instead to take two coins, at which point the game ends and all further coins are lost. You can prove by induction that if both players are perfectly selfish, they will take two coins on their first move, no matter how large the pile is. People find this paradox impossible to swallow because they model perfect selfishness on the most selfish person they can imagine, not on a mathematically perfect selfishness machine. It's nice to have an "intuition pump" that illustrates what genuine selfishness looks like.

Hmm. We could also put that one in terms of a human or FAI competing against a paperclip maximizer, right? The two players would successively save one human life or create one paperclip (respectively), up to some finite limit on the sum of both quantities.

If both were TDT agents (and each knows that the other is a TDT agent), then would they successfully cooperate for the most part?

In the original version of this game, is it turn-based or are both players considered to be acting simultaneously in each round? If it is simultaneous, then it seems to me that the paperclip-maximizing TDT and the human[e] TDT would just create one paperclip at a time and save one life at a time until the "pile" is exhausted. Not quite sure about what would happen if the game is turn-based, but if the pile is even, I'd expect about the same thing to happen, and if the pile is odd, they'd probably be able to successfully coordinate (without necessarily communicating), maybe by flipping a coin when two pile-units remain and then acting in such a way to ensure that the expected distribution is equal.

How might we and the paperclip-maximizer credibly bind ourselves to cooperation? Seems like it would be difficult dealing with such an alien mind.

I think Eliezer's "We have never interacted with the paperclip maximizer before, and will never interact with it again" was intended to preclude credible binding.

I agree: Defect!

Clearly the paperclip maximizer should just let us have all of substance S; but a paperclip maximizer doesn't do what it should, it just maximizes paperclips.

I sometimes feel that nitpicking is the only contribution I'm competent to make around here, so... here you endorsed Steven's formulation of what "should" means; a formulation which doesn't allow you to apply the word to paperclip maximizers.

simpleton: won't each side choose to cooperate, after correctly concluding that it will defect iff the other does?

Only if they believe that their decision somehow causes the other to make the same decision.

CarlJ: How about placing a bomb on two piles of substance S and giving the remote for the human pile to the clipmaximizer and the remote for its pile to the humans?

It's kind of standard in philosophy that you aren't allowed solutions like this. The reason is that Eliezer can restate his example to disallow this and force you to confront the real dilemma.

Vladimir: It's preferrable to choose (C,C) [...] if we assume that other player also bets on cooperation.

No, it's preferable to choose (D,C) if we assume that the other player bets on cooperation.

decide self.C; if other.D, decide self.D

We're assuming, I think, that you don't get to know what the other guy does until after you've both committed (otherwise it's not the proper Prisoner's Dilemma). So you can't use if-then reasoning.

An excellent way to pose the problem.

Obviously, if you know that the other party cares nothing about your outcome, then you know that they're more likely to defect.

And if you know that the other party knows that you care nothing about their outcome, then it's even more likely that they'll defect.

Since the way you posed the problem precludes an iteration of this dilemma, it follows that we must defect.

Alan: They don't have to believe they have such casual powers over each other. Simply that they are in certain ways similar to each other.

ie, A simply has to believe of B "The process in B is sufficiently similar to me that it's going to end up producing the same results that I am. I am not causing this, but simply that both computations are going to compute the same thing here."

Cooperate (unless paperclip decides that Earth is dominated by traditional game theorists...)

The standard argument looks like this (let's forget about the Nash equilibrium endpoint for a moment): (1) Arbiter: let's (C,C)! (2) Player1: I'd rather (D,C). (3) Player2: I'd rather (D,D). (4) Arbiter: sold!

The error is that this incremental process reacts on different hypothetical outcomes, not on actual outcomes. This line of reasoning leads to the outcome (D,D), and yet it progresses as if (C,C) and (D,C) were real options of the final outcome. It's similar to the Unexpected hanging paradox: you can only give one answer, not build a long line of reasoning where each step assumes a different answer.

It's preferrable to choose (C,C) and similar non-Nash equilibrium options in other one-off games if we assume that other player also bets on cooperation. And he will do that only if he assumes that first player does the same, and so on. This is a situation of common knowledge. How can Player1 come to the same conclusion as Player2? They search for the best joint policy that is stable under common knowledge.

Let's extract the decision procedures selected by both sides to handle this problem as self-contained policies, P1 and P2. Each of these policies may decide differently depending on what policy another player is assumed to use. The stable set of policies is where there is no thrashing, when P1=P1(P2) and P2=P2(P1). Players don't select outcomes, but policies, where policy may not reflect player's preferences, but joint policy (P1,P2) that players select is a stable policy that is preferable to other stable policies for each player. In our case, both policies for (C,C) are something like "decide self.C; if other.D, decide self.D". Works like iterated prisoner's dilemma, but without actual iteration, iteration happens in the model when it needs to be mutually accepted.

(I know it's somewhat inconclusive, couldn't find time to pinpoint it better given a time limit, but I hope one can construct a better argument from the corpse of this one.)