AI cooperation in practice

You know that automated proof verifiers exist, right? And also that programs can know their own source code? Well, here's a puzzle for you:

Consider a program A that knows its own source code. The algorithm of A is as follows: generate and check all possible proofs up to some huge size (3^^^^3). If A finds a proof that A returns 1, it returns 1. If the search runs to the end and fails, it returns 0. What will this program actually return?

Wait, that was the easy version. Here's a harder puzzle:

Consider programs A and B that both know their own, and each other's, source code. The algorithm of A is as follows: generate and check all proofs up to size 3^^^^3. If A finds a proof that A returns the same value as B, it returns 1. If the search fails, it returns 0. The algorithm of B is similar, but possibly with a different proof system and different length limit. What will A and B return?

This second puzzle is a formalization of a Prisoner's Dilemma strategy proposed by Eliezer: "I cooperate if and only I expect you to cooperate if and only if I cooperate". So far we only knew how to make this strategy work by "reasoning from symmetry", also known as quining. But programs A and B can be very different - a human-created AI versus an extraterrestrial crystalloid AI. Will they cooperate?

I may have a tentative proof that the answer to the first problem is 1, and that in the second problem they will cooperate. But: a) it requires you to understand some logic (the diagonal lemma and Löb's Theorem), b) I'm not sure it's correct because I've only studied the subject for the last four days, c) this margin is too small to contain it. So I put it up here. I submit this post with the hope that, even though the proof is probably wrong or incomplete, the ideas may still be useful to the community, and someone else will discover the correct answers.

Edit: by request from Vladimir Nesov, I reposted the proofs to our wiki under my user page. Many thanks to all those who took the time to parse and check them.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:51 PM
Select new highlight date
All comments loaded

I notice a curious pattern in this post: technically illiterate critical comments get voted up to +1/+2 (even after getting debunked), while technically accurate arguments (replies) stay at 0/+1. Please vote up only things you know you understand, and do vote up comments that debunk incorrect arguments.

ETA: Mostly corrected now.

I wanted to say that yesterday, but chose instead to stay silent on the meta-issue, try as best I can to clarify the technical issues (though having to explain that there are actual technical issues was a nasty surprise for me), and refrain from downvoting illiterate critiques because I could be biased.

So thanks for saying what I wanted to say - it sounds better when someone else says it.

Cousin_it, do you have an argument that n1 ... n6 < nmax? It seems like one way for the proof to fail is if no matter what nmax is, step 11 forces n1 > nmax.

Only a very rough one, I'm afraid. The number n1 is chosen to "contain" the blow-up from the short proof of 10 to the proof of 11, which grows with the computational complexities of n1,...,nmax, not their values. These computational complexities can be made way smaller than n1.

Ok, I was checking if there was an obvious flaw that you might have missed. Now I'm trying to actually understand your proof, and I'm getting stuck on the first step, where you invoke the Diagonal Lemma. Now a typical statement of the Diagonal Lemma is:

Diagonal Lemma: If T is a theory in which diag is representable, then for any formula B(x) with exactly one free variable x there is a formula G such that G <=> B(g(G)) is a theorem in T. (g(G) is the Gödel number of G. This was copied from page 2 of http://www.cs.cornell.edu/courses/cs4860/2009sp/lec-23.pdf)

If you look at page 3 of that PDF, it gives an explicit formula for G, which happens to contain an existential quantifier. Now my question is, how do I translate that G into a code string (the one named "box" in your proof)? Also, is this supposed to be obvious, or are you assuming some background knowledge beyond the standard expositions of the Diagonal Lemma and Löb's Theorem?

You can build "box" by quining, I think. Like this:

var s = "...";
return implies(
           proves(s, n1),
           eval(outputs(1)));

Where s contains the source code of the entire snippet. I'd kinda assumed that quining and the diagonal lemma were the same thing, I guess that was wrong.

Quining is implied by the diagonal lemma, and but you can use the diagonal lemma like you've written. See this webpage on quines to understand the distinction. Wei Dai, it's easy to convert between code strings and Gödel numbers, at least in theory. I don't know about your particular example's numbering, because that link isn't working for me. You may want to find something specific to complexity theory, instead of math, which I suspect you (personally) would find more understandable. The quine page has a brief intro, which includes a proof of the fixed-point theorem.

In this case I think we have B(x) = implies(proves(x, n1), eval(outputs(1))) which is fine, but doesn't bound the size of BOX. However, cousin_it specifies a way to keep BOX small using quining, so this proof seems legit to me. Its conclusion is just as unintuitive as Lob's Theorem, but the steps all look all right.

Edit: For an explicit way to compute fixed points, see the Y combinator.

Actually, someone shot down my proposed my.C = (my.C == your.C) algorithm, because if two prisoners both defect, that is in a mathematical sense "cooperating if and only if you cooperate". So if the other prisoner is Defection Rock, then my.C = (my.C == your.C) has no consistent solution.

In the first version, there are different ways to go about programming algorithm A, and some will output 1, and some will output 0. In other words, "if A finds a proof that A returns 1" is ambiguous, because it depends on the exact value of A, not the generic description given.

In the second version, both programs output 1, since this is the only possible consistent result. If either program might output a 0, then both programs would have to, which implies that both should have output 1.

Edit: The following is not obviously possible, see this comment.

This can be generalized to provide even better algorithms for more general games. Let the following be a simplified outline of the algorithm for cooperation problem, with bounds omitted (players are A and B, our player is A):

function main1() {
    if (proves("A()==B()"))
        return "Cooperate";
    return "Defect";
}

The proof that the outcome is "Cooperate" basically hinges on the provability of Löb statement

proves("A()==B()") => (A()==B())

from which we conclude that A()==B() and so A()=="Cooperate". The statement can be constructed from

proves("A()==B()") => (A()=="Cooperate")
proves("A()==B()") => (B()=="Cooperate")

and so

proves("A()==B()") => (A()==B())

Now, we can get basically the same proof if we use a simpler algorithm for the players:

function main2() {
    if (proves("A()=="Cooperate" && B()=="Cooperate""))
        return "Cooperate";
    return "Defect";
}

This gets rid of the more general relation, but achieves basically the same result. We can improve on this by defecting against cooperating rocks:

function main3() {
    if (proves("A()=="Defect" && B()=="Cooperate""))
        return "Defect";
    if (proves("A()=="Cooperate" && B()=="Cooperate""))
        return "Cooperate";
    return "Defect";
}

This suggests a general scheme for construction of decision-making algorithms: for all possible actions a and b of players A and B, sort the pairs by utility for A, starting from highest, and implement the Löb algorithm for each possibility in turn:

function main4() {
    for(action pairs <a,b> in order of descending utility for A) {
        if (proves("A()==a && B()==b"))
            return a;
    }
    return something;
}

This plays against arbitrary opponents quite well.

function main3() {
     if (proves("A()=="Defect" && B()=="Cooperate""))
         return "Defect";
     if (proves("A()=="Cooperate" && B()=="Cooperate""))
         return "Cooperate";
     return "Defect";
 }

Sorry if I'm being naive, but if we amend the program that way then doesn't it lose the ability to prove the Löb statement implies(proves(P), P) where P = "A()=='Cooperate' && B()=='Cooperate'"? (Due to the unprovability of unprovability.)

According to my experience, the study of game-playing programs has two independent components: fairness/bargaining and enforcement. If you have a good method of enforcement (quining, löbbing, Wei Dai's joint construction), you can implement almost any concept of fairness. Even though Freaky Fairness relied on quining, I feel you can port it to löbbing easily enough. But fairness in games with non-transferable utility is so difficult that it should really be studied separately, without thinking of a specific enforcement mechanism.