In reply to:

Patrick, congrats on writing this up! It's nice to see MIRI step up its game.

Will two PrudentBots cooperate if they're using theories of different strength?

Yes! We checked that both formally and using the fixed-point evaluator. I should add that to the draft.

EDIT: Whoops, it's a little more complicated than I remember. They find mutual cooperation iff the system PrudentBot1 uses to prove (its opponent defects against DefectBot) is strong enough to prove the consistency of the system PrudentBot2 uses to prove (its opponent cooperates with it), and vice versa.

Robust Cooperation in the Prisoner's Dilemma

I'm proud to announce the preprint of Robust Cooperation in the Prisoner's Dilemma: Program Equilibrium via Provability Logic, a joint paper with Mihaly Barasz, Paul Christiano, Benja Fallenstein, Marcello Herreshoff, Patrick LaVictoire (me), and Eliezer Yudkowsky.

This paper was one of three projects to come out of the 2nd MIRI Workshop on Probability and Reflection in April 2013, and had its genesis in ideas about formalizations of decision theory that have appeared on LessWrong. (At the end of this post, I'll include links for further reading.)

Below, I'll briefly outline the problem we considered, the results we proved, and the (many) open questions that remain. Thanks in advance for your thoughts and suggestions!

Background: Writing programs to play the PD with source code swap

(If you're not familiar with the Prisoner's Dilemma, see here.)

The paper concerns the following setup, which has come up in academic research on game theory: say that you have the chance to write a computer program X, which takes in one input and returns either Cooperate or Defect. This program will face off against some other computer program Y, but with a twist: X will receive the source code of Y as input, and Y will receive the source code of X as input. And you will be given your program's winnings, so you should think carefully about what sort of program you'd write!

Of course, you could simply write a program that defects regardless of its input; we call this program DefectBot, and call the program that cooperates on all inputs CooperateBot. But with the wealth of information afforded by the setup, you might wonder if there's some program that might be able to achieve mutual cooperation in situations where DefectBot achieves mutual defection, without thereby risking a sucker's payoff. (Douglas Hofstadter would call this a perfect opportunity for superrationality...)

Previously known: CliqueBot and FairBot

And indeed, there's a way to do this that's been known since at least the 1980s. You can write a computer program that knows its own source code, compares it to the input, and returns C if and only if the two are identical (and D otherwise). Thus it achieves mutual cooperation in one important case where it intuitively ought to: when playing against itself! We call this program CliqueBot, since it cooperates only with the "clique" of agents identical to itself.

There's one particularly irksome issue with CliqueBot, and that's the fragility of its cooperation. If two people write functionally analogous but syntactically different versions of it, those programs will defect against one another! This problem can be patched somewhat, but not fully fixed. Moreover, mutual cooperation might be the best strategy against some agents that are not even functionally identical, and extending this approach requires you to explicitly delineate the list of programs that you're willing to cooperate with. Is there a more flexible and robust kind of program you could write instead?

As it turns out, there is: in a 2010 post on LessWrong, cousin_it introduced an algorithm that we now call FairBot. Given the source code of Y, FairBot searches for a proof (of less than some large fixed length) that Y returns C when given the source code of FairBot, and then returns C if and only if it discovers such a proof (otherwise it returns D). Clearly, if our proof system is consistent, FairBot only cooperates when that cooperation will be mutual. But the really fascinating thing is what happens when you play two versions of FairBot against each other. Intuitively, it seems that either mutual cooperation or mutual defection would be stable outcomes, but it turns out that if their limits on proof lengths are sufficiently high, they will achieve mutual cooperation!

The proof that they mutually cooperate follows from a bounded version of Löb's Theorem from mathematical logic. (If you're not familiar with this result, you might enjoy Eliezer's Cartoon Guide to Löb's Theorem, which is a correct formal proof written in much more intuitive notation.) Essentially, the asymmetry comes from the fact that both programs are searching for the same outcome, so that a short proof that one of them cooperates leads to a short proof that the other cooperates, and vice versa. (The opposite is not true, because the formal system can't know it won't find a contradiction. This is a subtle but essential feature of mathematical logic!)

Generalization: Modal Agents

Unfortunately, FairBot isn't what I'd consider an ideal program to write: it happily cooperates with CooperateBot, when it could do better by defecting. This is problematic because in real life, the world isn't separated into agents and non-agents, and any natural phenomenon that doesn't predict your actions can be thought of as a CooperateBot (or a DefectBot). You don't want your agent to be making concessions to rocks that happened not to fall on them. (There's an important caveat: some things have utility functions that you care about, but don't have sufficient ability to predicate their actions on yours. In that case, though, it wouldn't be a true Prisoner's Dilemma if your values actually prefer the outcome (C,C) to (D,C).)

However, FairBot belongs to a promising class of algorithms: those that decide on their action by looking for short proofs of logical statements that concern their opponent's actions. In fact, there's a really convenient mathematical structure that's analogous to the class of such algorithms: the modal logic of provability (known as GL, for Gödel-Löb).

So that's the subject of this preprint: what can we achieve in decision theory by considering agents defined by formulas of provability logic?

More formally (skip the next two paragraphs if you're willing to trust me), we inductively define the class of "modal agents" as formulas using propositional variables and logical connectives and the modal operator  (which represents provability in some base-level formal system like Peano Arithmetic), of the form , where  is fully modalized (i.e. all instances of variables are contained in an expression ), and with each  corresponding to a fixed modal agent of lower rank. For example, FairBot is represented by the modal formula .

When two modal agents play against each other, the outcome is given by the unique fixed point of the system of modal statements, where the variables are identified with each other so that  represents the expression  represents , and the  represent the actions of lower-rank modal agents against  and vice-versa. (Modal rank is defined as a natural number, so this always bottoms out in a finite number of modal statements; also, we interpret outcomes as statements of provability in Peano Arithmetic, evaluated in the model where PA is consistent, PA+Con(PA) is consistent, and so on. See the paper for the actual details.)

The nice part about modal agents is that there are simple tools for finding the fixed points without having to search through proofs; in fact, Mihaly and Marcello wrote up a computer program to deduce the outcome of the source-code-swap Prisoner's Dilemma between any two (reasonably simple) modal agents. These tools also made it much easier to prove general theorems about such agents.

PrudentBot: The best of both worlds?

Can we find a modal agent that seems to improve on FairBot? In particular, we should want at least the following properties:

  • It should be un-exploitable: if our axioms are consistent in the first place, then it had better only end up cooperating when it's mutual.
  • It should cooperate with itself, and also mutually cooperate with FairBot (both are, common-sensically, the best actions in those cases).
  • It should defect, however, against CooperateBot and lots of similarly exploitable modal agents.

It's nontrivial that such an agent exists: you may remember the post I wrote about the Masquerade agent, which is a modal agent that does almost all of those things (it doesn't cooperate with the original FairBot, though it does cooperate with some more complicated variants), and indeed we didn't find anything better until after we had Mihaly and Marcello's modal-agent-evaluator to help us.

But as it turns out, there is such an agent, and it's pretty elegant: we call it PrudentBot, and its modal version cooperates with another agent Y if and only if (there's a proof in Peano Arithmetic that Y cooperates with PrudentBot and there's a proof in PA+Con(PA) that Y defects against DefectBot). This agent can be seen to satisfy all of our criteria. But is it optimal among modal agents, by any reasonable criterion?

Results: Obstacles to Optimality

It turns out that, even within the class of modal agents, it's hard to formulate a definition of optimality that's actually true of something, and which meaningfully corresponds to our intuitions about the "right" decisions on decision-theoretic problems. (This intuition is not formally defined, so I'm using scare quotes.)

There are agents that give preferential treatment to DefectBot, FairBot, or even CooperateBot, compared to PrudentBot, though these agents are not ones you'd program in an attempt to win at the Prisoner's Dilemma. (For instance, one agent that rewards CooperateBot over PrudentBot is the agent that cooperates with Y iff PA proves that Y cooperates against DefectBot; we've taken to jokingly calling that agent TrollBot.) One might well suppose that a modal agent could still be optimal in the sense of making the "right" decision in every case, regardless of whether it's being punished for some other decision. However, this is not the only obstacle to a useful concept of optimality.

The second obstacle is that any modal agent only checks proofs at some finite number of levels on the hierarchy of formal systems, and agents that appear indistinguishable at all those levels may have obviously different "right" decisions. And thirdly, an agent might mimic another agent in such a way that the "right" decision is to treat the mimic differently from the agent it imitates, but in some cases one can prove that no modal agent can treat the two differently.

These three strikes appear to indicate that if we're looking to formalize more advanced decision theories, modal agents are too restrictive of a class to work with. We might instead allow things like quantifiers over agents, which would invalidate these specific obstacles, but may well introduce new ones (and certainly would make for more complicated proofs). But for a "good enough" algorithm on the original problem (assuming that the computer will have lots of computational resources), one could definitely do worse than submit a finite version of PrudentBot.

Why is this awesome, and what's next?

In my opinion, the result of Löbian cooperation deserves to be published for its illustration of Hofstadterian superrationality in action, apart from anything else! It's really cool that two agents reasoning about each other can in theory come to mutual cooperation for genuine reasons that don't have to involve being clones of each other (or other anthropic dodges). It's a far cry from a practical approach, of course, but it's a start: mathematicians always begin with a simplified and artificial model to see what happens, then add complications one at a time.

As for what's next: First, we don't actually know that there's no meaningful non-vacuous concept of optimality for modal agents; it would be nice to know that one way or another. Secondly, we'd like to see if some other class of agents contains a simple example with really nice properties (the way that classical game theory doesn't always have a pure Nash equilibrium, but always has a mixed one). Thirdly, we might hope that there's an actual implementation of a decision theory (TDT, UDT, etc) in the context of program equilibrium.

If we succeed in the positive direction on any of those, we'd next want to extend them in several important ways: using probabilistic information rather than certainty, considering more general games than the Prisoner's Dilemma (bargaining games have many further challenges, and games of more than two players could be more convoluted still), etc. I personally hope to work on such topics in future MIRI workshops.

Further Reading on LessWrong

Here are some LessWrong posts that have tackled similar material to the preprint:

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:52 PM
Select new highlight date
All comments loaded

Incidentally, at the MIRI workshop, we soon started referring to the source-code-swap Prisoner's Dilemma between two modal agents as "modal combat".

Or "MODAL COMBAT!", if you prefer.

That's extremely un-Googleable, as I realized when reading about modal combat in a different thread and trying to figure out what the hell that was about.

One downside of having a bot that's too complicated is that it makes the other bot less likely to trust you.

An interesting question is to what extent a similar phenomenon is present in human relationships.

Well, not at all for the literal complexity of agents, because we don't estimate the complexity of our peers. Aristotle thought the heart was the seat of intelligence, Shannon thought AGI could be built in a year, everyone and their mother anthropomorphizes inanimate objects like smoke alarms and printers.

I suspect perceived character traits that engender distrust, the Dark Triad traits, make the trait-possessor seem complex not because their brain must be described in more bits absolutely, but conditionally given the brain of the character judge. That is, we require a larger encoding diff to predict the behavior of people who display unfamiliar desires and intents, or to predict them with comparable accuracy as one does for one's warm, honest, emotionally stable peer group. For example, someone who appears paranoid is displaying extreme caution in situations the character judge finds forthright and nonthreatening, an extra piece of situational context to the other person's decision making.

This is a poor explanation overall because we're much less likely to distrust atypically nice humane people than Machiavellian, sub-psychopath people, even if they're both less conditionally compressible. It takes a lot of niceness (Stepford Wives) before the uncanny-differential-encoding-valley reaction trips.

Edit: This might have been uncharitable. People who are more prone to lying may be more absolutely complex, because lying skillfully requires keeping track of ones lies and building further lies to support them, while honest beliefs can simply be verified against reality. People who decide by a few fixed, stable criteria (e.g. always voting for the nominated candidate of their political party) might be called trustworthy in the sense of being reliable (if not reliably pro-social). Fulfilling promises and following contracts also make one more trustworthy, in both the weak sense of predictability and the stronger sense of moral behavior. Yudkowsky makes the argument that moral progress tends to produce simplified values.

When people make purchasing decisions, pricing models that are too complex make them less likely to purchase. If it's too confusing to figure out whether something is a good deal or not, we generally tend to just assume it's a bad deal. See http://ideas.repec.org/p/ags/ualbsp/24093.html (Choice Environment, Market Complexity and Consumer Behavior: A Theoretical and Empirical Approach for Incorporating Decision Complexity into Models of Consumer Choice), for example.

Wow, MIRI is just churning out [1] math papers now.

[1]: Using this expression entirely in a positive sense.

Actually, "MIRI is just churning out writeups of research which happened over a long preceding time", which I'd emphasize because I think there's some sort of systematic "science as press release bias" wherein people think about science as though it occurs in bursts of press releases because that's what they externally observe. Not aimed at you in particular but seems like an important general observation. We couldn't do twice as much research by writing twice as many papers.

I don't have any background in automated proofs; could someone please outline the practical implications of this, considering the algorithmic complexity involved?

How effective would a real PrudentBot be if written today? What amount of complexity in opponent programs could it handle? How good are proof-searching algorithms? How good are computer-assisted humans searching for proofs about source code they are given?

Are there formal languages designed so that if an agent is written in them, following certain rules with the intent to make its behavior obvious, it is significantly easier (or more probable) for a proof-finding algorithm to analyze its behavior? Without, of course, restricting the range of behaviors that can be expressed in that design.

Algorithmic proof search in general is NP-hard. Special cases can be simpler: i.e. if two modal agents are playing against each other, they could employ Kripke semantics (which may also be NP-hard, but has much smaller computational requirements in practice than searching through all proofs).

(Technically, you'd define a class of ModalPrime agents, which begin with code to see if their opponent is syntactically a ModalPrime agent, then use Kripke semantics to see what the outcome of their modal statement versus the other modal statement would be. Or you could have Omega accept modal statements as inputs, and then Omega would only need to use Kripke semantics to resolve the matches in the tournament.)

Long story short, I wouldn't submit a "search through proofs" modal algorithm into the PD tournament currently being run: it would always run out of time before finding Lobian proofs.

If search is hard, perhaps there is another way.

Suppose an agent intends to cooperate (if the other agent cooperates, etc.). Then it will want the other agent to be able to prove that it will cooperate. If it knew of a proof about itself, it would advertise this proof, so the others wouldn't have to search. (In the original tournament setup, it could embed the proof in its source code.)

An agent can't spend much time searching for proofs about every agent it meets. Perhaps it meets new agents very often. Or perhaps there is a huge amount of agents out there, and it can communicate with them all, and the more agents it cooperates with the more it benefits.

But an agent has incentive to spend a lot of time finding a proof about itself - once - and then it can give that proof to counterparties it wants to cooperate with.

(It's been pointed out that this could give an advantage to families of similar agents that can more easily prove things about one another by using existing knowledge and proofs).

Finally, somewhere there may be someone who created this agent - a programming intelligence. If so, that someone with their understanding of the agent's behavior may be able to create such a proof.

Do these considerations help in practice?

We only need to perform proof search when we're given some unknown blob of code. There's no need to do a search when we're the ones writing the code; we know it's correct, otherwise we wouldn't have written it that way.

Admittedly many languages allow us to be very sloppy; we may not have to justify our code to the compiler and the language may not be powerful enough to express the properties we want. However there are some languages which allow this (Idris, Agda, Epigram, ATS, etc.). In such languages we don't actually write a program at all; we write down a constructive proof that our properties are satisfied by some program, then the compiler derives that program from the proof. Such programs are known as "correct by construction".

http://en.wikipedia.org/wiki/Program_derivation http://en.wikipedia.org/wiki/Proof-carrying_code

Patrick, congrats on writing this up! It's nice to see MIRI step up its game.

Will two PrudentBots cooperate if they're using theories of different strength?

Yes! We checked that both formally and using the fixed-point evaluator. I should add that to the draft.

EDIT: Whoops, it's a little more complicated than I remember. They find mutual cooperation iff the system PrudentBot1 uses to prove (its opponent defects against DefectBot) is strong enough to prove the consistency of the system PrudentBot2 uses to prove (its opponent cooperates with it), and vice versa.

OK, I see. So, unlike FairBots, a PrudentBot using PA and PA+1 won't cooperate with a PrudentBot using PA+1 and PA+2. I wonder if that can be fixed?

I found situation where PrudentBot behaves suboptimally and a more intelligent agent can do better. I'm considering an opponent I call pseudo tit-for-tat at level N, or PsTitTatBot[N]. PsTitTatBot[0] is simply CooperateBot. For N>0, the strategy for PsTitTatBot[N] is to simulate its opponent running against PsTitTatBot[N-1], and do the same thing its opponent did in the simulation. PrudentBot would have an outcome of (D,D) against PsTitTatBot[N] for any N. However, it's easy for an agent to ensure that against sufficiently high level tit-for-tat agent the outcome will be (C,C). To do this, there must be a specific N such that this agent plays suboptimally against PsTitTatBot[N]. However, the benefit of mutual cooperation at higher levels should compensate for that.

Stupid question: are there clear examples of any real-life events that would actually correspond to classic, one-shot Prisoner's Dilemma? I thought that I knew of plenty of examples before, but when I was challenged to name some, all the ones I could think of failed some of the conditions, like being tragedies of the commons rather than PD, or being iterated PD rather than one-shot PD, or the players being able to adapt their plans based on the way the opponent reacts instead of it being a single all-or-nothing choice (arms races are sometimes held up as an example of PD, but there both sides have access to intelligence reports on the kind of strategy that the enemy seems to be adopting, and can adjust their strategy accordingly).

Of course, one can always construct hypothetical scenarios in a way that makes them classic one-shot PD by construction, but being unable to name any clear real-life examples would seem to suggest that this particular scenario isn't that interesting to focus on.

The point of using the one-shot Prisoner's Dilemma (and variants like this one) isn't actually that it's a realistic approximation of real-life phenomena, so much as that it's one of the simplest decision-theory problems that standard game theory looks suboptimal for, and so it's a good proving ground for further development. (Think of it like the assumption that objects in kinetic physics are frictionless.)

Similarly, game theory started with assumptions of perfect common knowledge of payoff matrices, which didn't model real-life situations all that well, but the theory developed there was later extended to more realistic setups in politics and business.

That being said, I'm also curious to know if there are real-world examples that model one-shot Prisoner's Dilemmas reasonably well.

Suppose we decide that proving systems are too (NP) hard and we want to restrict agents to polynomial time. If an agent has access to a random coin, can it interactively prove that its opponent is cooperative?

I stumbled across an interesting class of agent while attempting to design a probabilistically proving bot for AlexMennen's contest. The agent below always cooperates with opponents who always cooperate back, (almost) always defects against opponents who always defect against them, and runs in probabilistic polynomial time if its opponent does as well.

def mimic_bot(opponent):
    if random() > ε:
        my_source = quine_my_source()
        return eval(opponent, my_source)      # do as your opponent does to you
    else:
        # unlikely, but necessary to break out of recursive simulations in polynomial time
        return COOPERATE

This particular agent is a bit naïve, but more sophisticated versions (e.g ones that do not cooperate with CooperateBot) are possible. Does anyone have any insight on probabilistically proving opponents will cooperate?

Sounds awesome, great job !

But there is something I didn't get : what does Con() mean in PA+Con(PA) ? Maybe it's stupid question and everyone is supposed to know, but I don't remember ever encountering that symbol, and I can't find it's meaning, neither on Wikipedia nor on mathworld...

It means consistent. PA + Con(PA) means PA together with the axiom that PA is consistent (which can be expressed in PA because PA can express "PA proves X," so "PA is consistent" can be expressed as "PA does not prove a contradiction").

Actually, "not proving a falsehood" is not the same as being consistent; assuming that PA is consistent, the theory PA+~Con(PA) is also consistent, but proves the false statement ~Con(PA). Consistency is the weaker condition of not proving both a formula and its negation.

I should have said "contradiction"; edited. I intended "falsehood" to mean "false in all models," not "false in the standard model."

Stupid question... Can one construct ImpudentBot, specifically designed to defeat the analysis done by PrudentBot, say, by forcing it to never halt or something? Or PrudeBot, which resists all attempts at uncovering its true appearance?

PrudentBot halts on all inputs. More accurately, I should say, the modal form of PrudentBot has an output that's always decidable in PA+2, while the algorithm version only looks for proofs up to a certain pre-specified length, and thus always halts. In either case, PrudentBot defects by default if it cannot prove anything.

So ImpudentBot is not possible, and PrudeBot earns a big fat D from PrudentBot.

Incidentally, this is the big distinction between "programs that try and run other programs' code" and "programs that try to prove things about other programs' output": Löb's Theorem allows you to avoid the danger of infinite regress.

I'm coming to this quite late, and these are the notes I wrote as I read the paper, before reading the comments, followed by my notes in the comments.

Has any of the (roughest) analysis been done to bound proving time for PrudentBot? It should be fairly trivial to get very bad bounds and if those are not so bad that seems like a worthwhile thing to know (although what does "not so bad" really mean I guess is a problem.)

Is there a good (easy) reference for the statement about quining in PA (on page 6 below CliqueBot)? Under modal agents Kleene's recursion algorithm is mentioned; should probably be mentioned at first use rather than second?

The proof of theorem 5.1 is losing me where other proofs didn't; I don't understand how minimality of modal rank gives other results.

Reading the comments to answer some of my own questions: orthonormal seems to believe that PrudentBot couldn't be implemented for the LessWrong PD competition, although he did say with algorithmic proof search, would he change his opinion using Kripke semantics?

Vaniver's comment seems like a promising way of thinking about how to identify TrollBots. Or at least, the most promising thing about it I've seen; has this been pursued any further? Perhaps as a possibly-answerable question: call a bot a troll if it can't survive in a population with only itself or only itself and CB and DB (since these are the simplest bots to describe). I'll also note that PseudoTitForTatBot seems like an interesting basic candidate to play against; or say "TimeLimitTitForTatBot" which plays tit for tat until the time limit is near, so no single N can always fool it. Is there a way to (modally; if that is the right word for not-by-quoting-source-code) defeat TLTFTB?

Overall I still have no understanding of theorem 5.1 though. I'm not terribly familiar with the field in general but the other proofs were still fairly straightforward, whereas this proof loses me in the first sentence, without referencing a result I can look up either inside or outside of the paper. Was this written by a different person/at a different time from the rest of the paper? The second sentence is less bad than the first but still worse than the entire rest of the paper.

Haskell code to run PrudentBot (quickly) and the other bots (also quickly) can be found on github here:

https://github.com/klao/provability/blob/master/modal.hs

The nice part about modal agents is that there are simple tools for finding the fixed points without having to search through proofs; in fact, Mihaly and Marcello wrote up a computer program to deduce the outcome of the source-code-swap Prisoner's Dilemma between any two (reasonably simple) modal agents. These tools also made it much easier to prove general theorems about such agents.

Would it be possible to make this program publicly available? I'm curious about how certain modal agents play against each other, but struggling to caculate it manually.