A model of UDT with a halting oracle

This post requires some knowledge of mathematical logic and computability theory. The basic idea is due to Vladimir Nesov and me.

Let the universe be a computer program U that can make calls to a halting oracle. Let the agent be a subprogram A within U that can also make calls to the oracle. The source code of both A and U is available to A.

Here's an example U that runs Newcomb's problem and returns the resulting utility value:

  def U():
    # Fill boxes, according to predicted action.
    box1 = 1000
    box2 = 1000000 if (A() == 1) else 0
    # Compute reward, based on actual action.
    return box2 if (A() == 1) else (box1 + box2)

A complete definition of U should also include the definition of A, so let's define it. We will use the halting oracle only as a provability oracle for some formal system S, e.g. Peano arithmetic. Here's the algorithm of A:

  1. Play chicken with the universe: if S proves that A()≠a for some action a, then return a.
  2. For every possible action a, find some utility value u such that S proves that A()=a ⇒ U()=u. If such a proof cannot be found for some a, break down and cry because the universe is unfair.
  3. Return the action that corresponds to the highest utility found on step 2.

Now we want to prove that the agent one-boxes, i.e. A()=1 and U()=1000000. That will follow from two lemmas.

Lemma 1: S proves that A()=1 ⇒ U()=1000000 and A()=2 ⇒ U()=1000. Proof: you can derive that from just the source code of U, without looking at A at all.

Lemma 2: S doesn't prove any other utility values for A()=1 or A()=2. Proof: assume, for example, that S proves that A()=1 ⇒ U()=42. But S also proves that A()=1 ⇒ U()=1000000, therefore S proves that A()≠1. According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent unsound (thx Misha). So if S is sound, that can't happen.

We see that the agent defined above will do the right thing in Newcomb's problem. And the proof transfers easily to many other toy problems, like the symmetric Prisoner's Dilemma.

But why? What's the point of this result?

There's a big problem about formalizing UDT. If the agent chooses a certain action in a deterministic universe, then it's a true fact about the universe that choosing a different action would have caused Santa to appear. Moreover, if the universe is computable, then such silly logical counterfactuals are not just true but provable in any reasonable formal system. When we can't compare actual decisions with counterfactual ones, it's hard to define what it means for a decision to be "optimal".

For example, one previous formalization searched for formal proofs up to a specified length limit. Problem is, that limit is a magic constant in the code that can't be derived from the universe program alone. And if you try searching for proofs without a length limit, you might encounter a proof of a "silly" counterfactual which will make you stop early before finding the "serious" one. Then your decision based on that silly counterfactual can make it true by making its antecedent false... But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system. The agent will always return whatever action is proved by the formal system to be optimal, if such an action exists. This notion of optimality matches our intuitions even though the universe is still perfectly deterministic and the agent is still embedded in it, because the oracle ensures that determinism is just out of the formal system's reach.

P.S. I became a SingInst research associate on Dec 1. They did not swear me to secrecy, and I hope this post shows that I'm still a fan of working in the open. I might just try to be a little more careful because I wouldn't want to discredit SingInst by making stupid math mistakes in public :-)

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:54 PM
Select new highlight date
All comments loaded

The basic idea is due to Vladimir Nesov and me.

(In the setting where there were no oracles, this problem was generally discussed on the decision theory list, with various conditions used to prove when intended moral arguments [A=A1=>U=U1] will in fact be proven by the agent (are agent-provable, not just provable). The "chicken rule" condition used here I figured out in explicit form sometime in April ("To avoid confusion, immediately perform any action that implies absurdity"), though I'm not sure anyone published an actual proof using this condition. A few days ago, cousin_it posted a thought experiment involving a Turing oracle, and I pointed out that using the chicken rule in this setting elicits a much nicer result.)

That's an accurate account, thanks for posting it here! I'm always worried about coming off as though I'm grabbing priority...

Thanks for writing this up. I'm currently trying to repurpose my failed unpublished post draft from August (on acausal dependencies) to this result, to give a more systematic account of the problem it addresses, and describe the no-oracles case in more detail. Hope to get it done in about a week.

Edit: And posted, 2.5 months late.

This is one of the most interesting posts I've seen on Less Wrong. It is a non-trivial result and it goes a long way towards dealing with a lot of the problems with UDT. I'm deeply impressed and wish I had some helpful comment to make other than general praise.

Congratulations! This is a key step forward. And also, congrats to the SIAI for getting both you and Vladimir Nesov as official researchers.

Thanks!

In case anyone's interested: I'm not a paid researcher and don't want to become one, as long as I can support myself by programming.

Congratulations! Do you mind an off-topic question? I see you work at Google. What's Google's attitude toward such extracurricular activities? Obviously, I am not asking about specific negotiations, but about written and unwritten company rules. For example, can you dedicate Google company time to SingInst work? (I am thinking of the famous Google 20% rule here.)

The answer is refreshingly boring: I can do this stuff, but not on company time. 20% time is supposed to be used for Google projects.

Sorry, I just had this image:

cousin_it: I want to use my 20% time to prevent the extermination of humanity.
Google overlord: Okay, and this would help Google ... how, exactly?

(I know, I know, oversimplification.)

But the bigger problem is that we can't say exactly what makes a "silly" counterfactual different from a "serious" one.

Would it be naive to hope for a criterion that roughly says: "A conditional P ⇒ Q is silly iff the 'most economical' way of proving it is to deduce it from ¬P or else from Q." Something like: "there exists a proof of ¬P or of Q which is strictly shorter than the shortest proof of P ⇒ Q"?

A totally different approach starts with the fact that your 'lemma 1' could be proved without knowing anything about A. Perhaps this could be deemed a sufficient condition for a counterfactual to be serious. But I guess it's not a necessary condition?

Both these approaches have been proposed on the workshop list. Good job figuring them out so quickly!

Would it be naive to hope for a criterion that roughly says: "A conditional P ⇒ Q is silly iff the 'most economical' way of proving it is to deduce it from ¬P or else from Q." Something like: "there exists a proof of ¬P or of Q which is strictly shorter than the shortest proof of P ⇒ Q"?

I can make such a criterion fall into nasty Loebian traps by maliciously tweaking the formal system to make some proofs longer than others. That means any proof of correct behavior (like one-boxing) must rely on the intimate details of the proof enumeration order, but we have no idea how to talk formally about such things.

A totally different approach starts with the fact that your 'lemma 1' could be proved without knowing anything about A. Perhaps this could be deemed a sufficient condition for a counterfactual to be serious. But I guess it's not a necessary condition?

A doesn't necessarily get the code of U neatly factored into A and everything else. The agent has to find copies of itself in the universe, it doesn't get told the positions of all copies explicitly. Note that if we replace the U in the post with some other U' that can be proven equivalent to U by S, then A can notice that equivalence, unscramble the code of U' into U, and win.

Is this the first time an advanced decision theory has had a mathematical expression rather than just a verbal-philosophical one?

This totally deserves to be polished a bit and published in a mainstream journal.

Is this the first time an advanced decision theory has had a mathematical expression rather than just a verbal-philosophical one?

That's a question of degree. Some past posts of mine are similar to this one in formality.

Nesov also said in an email on Jan 4 that now we can write this stuff up. I think Wei and Gary should be listed as coauthors too.

In contrast, the new model with oracles has a nice notion of optimality, relative to the agent's formal system.

Specifically, given any formal system S for reasoning about the world and agent's place in it, the chicken rule (step 1) forces S to generate consistent theories of consequences for all possible actions. This seems to crack a long-standing problem in counterfactual reasoning, giving a construction for counterfactual worlds (in form of consistent formal theories) from any formal theory that has actual world as a model.

...and the construction turns out not as interesting as I suspected. Something like this is very easy to carry out by replacing the agent A with another that can't be understood in S, but is equivalent to A (according to a system stronger than S). As a tool for understanding decision problems, this is intended to solve the problem of parsing the world in terms of A, finding how it depends on A, finding where A is located in the world, but if we can find all instances of A in the world to perform such surgery on them, we've already solved the problem!

Perhaps A can decide to make itself incomprehensible to itself (to any given S, rather), thus performing the transformation without surgery, formalization of free will by an act of then-mysterious free will? This could still be done. But it's not clear if this can be done "from the outside", where we don't have the power of making A transform to make the dependence of the world on its actions clear.

According to the first step of the algorithm, A will play chicken with the universe and return 1, making S inconsistent. So if S is consistent, that can't happen.

Is this right? I'm wondering if you're assuming soundness relative to the natural semantics about A, because in step one, it isn't clear that there is a contradiction in S rather than a failure of S to be sound with respect to the semantics it's supposed to model (what actions A can take and their utility). I might be confused, but wouldn't this entail contradiction of the soundness of S rather than entailing that S is inconsistent? S would only be inconsistent if it can prove both A() = a and A()≠a, but unless you have further hidden assumptions about S I don't see why A returning a would entail that S proves A() = a.

This is how I want to interpret this: S is some deduction system capable of talking about all actions A() can make, and proving some range of utility results about them. S is also consistent and sound.

Play chicken with the universe: if S proves that A()≠a for some action a, then return a.

If S proves for all a there is some u such that [A() = a ⇒ U() = u] , output argmax (a) else exit and output nothing.

Since proving A can't take an action a ( that is, A()≠a ) entails that S is not sound (because A will take such an action in step 1), S can't prove any such result. Also, since proving that an action has two distinct utility values leads to A≠a, the soundness and consistency of S entails that this can't happen. Does this seem right?

Also, step two seems too strong. Wouldn't it suffice to have it be:

1) For all actions a, if there is some u such that [A() = a ⇒ U() = u] and u > 0, (), else add [A() = a ⇒ U() = 0] to the axioms of S (only for the duration of this decision)

2) output argmax (a)

My thought is that there could be some possible actions the agent can take that might not have provable utility and it seems like you should assign an expected utility of 0 to them (no value being privileged, it averages out to 0), but if you can prove that at least one action has positive utility, then you maximize expected utility by choosing the one with the highest positive utility.

This is weaker than the current step two but still seems to have the desired effect. Does this hold water or am I missing something?

I'm wondering if you're assuming soundness relative to the natural semantics about A

You're right. Misha already pointed that out and I edited the post.

For all actions a, if there is some u such that [A() = a ⇒ U() = u] and u > 0, else add [A() = a ⇒ U() = 0] to the axioms of S

Yeah, such a variation would work, but I'm uncomfortable calling A the optimal algorithm in a universe where some actions don't have provable utility. Such universes often arise from non-symmetric multiplayer games and I'd rather not insinuate that I have a good solution for those.