What a reduction of "could" could look like

By requests from Blueberry and jimrandomh, here's an expanded repost of my comment which was itself a repost of my email sent to decision-theory-workshop.

(Wait, I gotta take a breath now.)

A note on credit: I can only claim priority for the specific formalization offered here, which builds on Vladimir Nesov's idea of "ambient control", which builds on Wei Dai's idea of UDT, which builds on Eliezer's idea of TDT. I really, really hope to not offend anyone.

(Whew!)

Imagine a purely deterministic world containing a purely deterministic agent. To make it more precise, agent() is a Python function that returns an integer encoding an action, and world() is a Python function that calls agent() and returns the resulting utility value. The source code of both world() and agent() is accessible to agent(), so there's absolutely no uncertainty involved anywhere. Now we want to write an implementation of agent() that would "force" world() to return as high a value as possible, for a variety of different worlds and without foreknowledge of what world() looks like. So this framing of decision theory makes a subprogram try to "control" the output of a bigger program it's embedded in.

For example, here's Newcomb's Problem:

def world():
  box1 = 1000
  box2 = (agent() == 2) ? 0 : 1000000
  return box2 + ((agent() == 2) ? box1 : 0)

A possible algorithm for agent() may go as follows. Look for machine-checkable mathematical proofs, up to a specified max length, of theorems of the form "agent()==A implies world()==U" for varying values of A and U. Then, after searching for some time, take the biggest found value of U and return the corresponding A. For example, in Newcomb's Problem above there are easy theorems, derivable even without looking at the source code of agent(), that agent()==2 implies world()==1000 and agent()==1 implies world()==1000000.

The reason this algorithm works is very weird, so you might want to read the following more than once. Even though most of the theorems proved by the agent are based on false premises (because it is obviously logically contradictory for agent() to return a value other than the one it actually returns), the one specific theorem that leads to maximum U must turn out to be correct, because the agent makes its premise true by outputting A. In other words, an agent implemented like that cannot derive a contradiction from the logically inconsistent premises it uses, because then it would "imagine" it could obtain arbitrarily high utility (a contradiction implies anything, including that), therefore the agent would output the corresponding action, which would prove the Peano axioms inconsistent or something.

To recap: the above describes a perfectly deterministic algorithm, implementable today in any ordinary programming language, that "inspects" an unfamiliar world(), "imagines" itself returning different answers, "chooses" the best one according to projected consequences, and cannot ever "notice" that the other "possible" choices are logically inconsistent with determinism. Even though the other choices are in fact inconsistent, and the agent has absolutely perfect "knowledge" of itself and the world, and as much CPU time as it wants. (All scare quotes are intentional.)

This is progress. We started out with deterministic programs and ended up with a workable concept of "could".

Hopefully, results in this vein may someday remove the need for separate theories of counterfactual reasoning based on modal logics or something. This particular result only demystifies counterfactuals about yourself, not counterfactuals in general: for example, if agent A tries to reason about agent B in the same way, it will fail miserably. But maybe the approach can be followed further.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 6:50 PM
Select new highlight date
All comments loaded

A subtlety: If the proof search finds "agent()==1 implies world()==5" and "agent()==2 implies world()==1000", then the argument in the post shows that it will indeed turn out that agent()==2 and world()==1000. But the argument does not show in any similar sense that 'if the agent had returned 1, then world() would equal 5' -- in fact this has no clear meaning (other than the one used inside the algorithm, i.e. that the theorem above is found in the proof search).

I do not think that this is actually a problem here, i.e. I don't think it's possible to construct a world() where we would intuitively say that the algorithm doesn't optimize [except of course by making the world so complicated that the proofs get too long], but there's a related situation where something similar does play a role:

Suppose that instead of searching only proofs, you let agent() call a function probability(logicalStatement), and then do expected utility maximization. [To be precise, let both the possible actions and the possible utility values come from a finite set; then you can call probability(...) for all possible combinations, and since different utilities for the same action are mutually exclusive, their probabilities are additive, so the expected utility calculation is straight-forward.]

You might imagine that you could score different candidates for probability(...) on how much probability they assign to false statements (at least I at one point thought this might work), but this is not sufficient: an evil probability function might assign 99% probability to "agent()==1 implies world()==1" and to "agent()==2 implies world()==5", even though

def world(): return 1000 if agent()==1 else 5

because both logical statements would turn out to be perfectly true.

Sorry, my previous reply was based on a misparsing of your comment, so I deleted it.

So the "mathematical intuition module" can be 100% truthful and evil at the same time, because making true statements isn't the same as making provable statements. Funny. It seems the truth vs provability distinction is actually the heart of decision theory.

It seems the truth vs provability distinction is actually the heart of decision theory.

I came across this paragraph from Bruno Marchal today, which strikingly reminds me of Vinge's "Hexapodia as the key insight":

Talking about Smullyan's books, I recall that "Forever Undecided" is a recreational (but ok ... not so easy, nor really recreational) introduction to the modal logic G (the one Solovay showed to be a sound and complete theory for the Godel-Lob (Gödel, Löb, or Goedel, Loeb) provability/concistency logic. G is the key for the math in the TOE approach I am developing. The logic G is the entry for all arithmetical "Plotinian Hypostases".

Bruno is a prolific participant on the "everything-list" that I created years ago, but I've never been able to understand much of what he talked about. Now I wonder if I should have made a greater effort to learn mathematical logic. Do his writings make sense to you?

No, they look like madness.

But logic is still worth learning.

EDIT: it seems I was partly mistaken. What I could parse of Marchal's mathy reasoning was both plausible and interesting, but parsing is quite hard work because he expresses his ideas in a very freaky manner. And there still remain many utterly unparseable philosophical bits.

Thanks for taking another look. BTW, I suggest if people change the content of their comments substantially, it's better to make a reply to yourself, otherwise those who read LW by scanning new comments will miss the new content.

This has been extremely useful for me, even though I understood the basic idea already, because with the discussion it made me understand something very basic that somehow I'd always missed so far: I'd always thought of this kind of program in the context of applying classical decision theory with the possible actions being particular agent()s. But using classical decision theory already presupposes some notion of "could." Now I understand that the principle here yields (or is a real step towards) a notion of could-ness even in the real world: If you know the formula that computes/approximates the universe (assuming there is such) and can state a utility function over results of this program, you can program a computer with something like your program, which gives a notion of "could." Thank you for helping me fill my stupid gap in understanding!

Side note: the intimate way that your notion of "could" is bound up with the program you're running makes it nontrivial (or at any rate non-obvious to me) how to compare different possible programs. I don't immediately see a direct equivalent to how, given a decision-theoretic problem, you can compare different efficient heuristic programs on how close they come to the theoretical optimum.

Note that by proving "agent()==1 implies world()==1000000", and then performing action 1, you essentially conclude that "world()==1000000" was valid in the first place. "You had the mojo all along", as Aaronson said (on Löb's theorem). Decision making is all about lack of logical transparency, inability to see the whole theory from its axioms.

Note that world() essentially doesn't talk about what can happen, instead it talks about how to compute utility, and computation of utility is a single fixed program without parameters (not even depending on the agent), that the agent "controls" from the inside.

To clarify, in the Newcomb example, the preference (world) program could be:

def world(): 
  box1 = 1000 
  box2 = (agent1() == 2) ? 0 : 1000000 
  return box2 + ((agent2() == 2) ? box1 : 0)

while the agent itself knows its code in the form agent3(). If it can prove both agent1() and agent2() to be equivalent to agent3(), it can decide in exactly the same way, but now the preference program doesn't contain the agent even once, there is no explicit dependency of the world program (utility of the outcome) on the agent's actions. Any dependency is logically inferred.

And we now have a prototype of the notion of preference: a fixed program that computes utility.

And we now have a prototype of the notion of preference: a fixed program that computes utility.

One possible problem: There is a difference between a program and the function it computes. The notion of preference is perhaps captured best by a function, not a program. Many programs could be written, all computing the same function. This wouldn't matter much if we were only going to call or invoke the program, since all versions of the program compute the same result. But, this is not what cousin_it suggests here. He wants us to examine the source of the program, to prove theorems about the program. Given finite resources, the things we can prove about one program may not match the things we can prove about another program, even if the two programs would compute the same result in all cases if agent() returned the same result.

There is a difference between a program and the function it computes. The notion of preference is perhaps captured best by a function, not a program.

No, a program. What the program defines is a single constant value, not a function (remember: if we are talking about a program, it's a constant program, taking no parameters!). And of course it matters how that constant is defined, and not at all what that constant is. More generally, we can define that constant not by a program, but by a logical theory, which will be the topic of my next post.

By a "logical theory" do you mean what logicians usually mean by a "theory"? A deductively closed set of sentences?

Wow! Well, that eliminates a lot of the arbitrary character I was objecting to in using programs to represent the world/decision problem. But there still are a lot of deductive systems to choose among. I await your next post with interest.

I won't settle the choice, only point out the generality of notion and how it applies, direction to look for further refinements.

Yep - thanks. This is an important idea that I didn't want to burden the post with. It would have brought the argument closer to UDT or ADT research, and I'd much prefer if you and Wei Dai wrote posts on these topics, not me. In matters of ethics and priority, one cannot be too careful. Besides, I want to read your posts because you have different perspectives from mine on these matters.

I was confused on the first reading: I thought you were presenting a contrasting reduction of couldness. Then, on reading the earlier comment, I saw the context in which you were presenting this: as a way to clarify the existing free will solution EY gave. Now, it makes more sense.

(It might be helpful to include more of the motivation for this explanation at the start.)

It's true that I agree with Eliezer's solution and my post can be seen as a formalization of that, but I'd much rather not discuss free will here and now :-)

OK, so just spelling out the argument for my own amusement:

The agent can prove the implications "agent() == 1 implies world() == 1000000" and "agent() == 2 implies world() == 1000", both of which are true.

Let x and y in {1, 2} be such that agent() == x and agent() != y.

Let Ux be the greatest value of U for which the agent proves "agent() == x implies world() == U". Let Uy be the greatest value of U for which the agent proves "agent() == y implies world() == U".

Then Ux is greater than or equal to Uy.

If the agent's formalised reasoning is consistent then since agent() == x is true, Ux must be the only U such that "agent() == x implies world() == U" is provable.

Now, if there is more than one U such that "agent() == y implies world() == U" is provable then "agent() == y implies contradiction" is provable, and hence so is "agent() == y implies world() == U" for some value of U larger than Ux. Hence Uy would be larger than Ux, which is impossible.

Therefore, the implications "agent() == 1 implies world() == 1000000" and "agent() == 2 implies world() == 1000" are the only ones provable.

Therefore, agent() == 1.

The setup can be translated into a purely logical framework. There would be constants A (Agent) and W (World), satisfying the axioms:

(Ax1) A=1 OR A=2

(Ax2) (A=1 AND W=1000) OR (A=2 AND W=1000000)

(Ax3) forall a1,a2,w1,w2 ((A=a1 => W=w1) AND (A=a2 => W=w2) AND (w1>w2)) => NOT (A=a2)

Then the Agent's algorithm would be equivalent to a step-by-step deduction:

(1) Ax2 |- A=1 => W=1000

(2) Ax2 |- A=2 => W=1000000

(3) Ax3 + (1) + (2) |- NOT (A=1)

(4) Ax1 + (3) |- A=2

In this form it is clear that there are no logical contradictions at any time. The system never believes A=1, it only believes (A=1 => W=1000), which is true.

In the purely logical framework, the word "could" is modeled by what Hofstadter calls the "Fantasy Rule" of propositional calculus.

The direction of setting the problem up in logical setting is further pursued here and here.

The axioms as you wrote them don't fit or work for multiple reasons. Conceptually, the problem is in figuring out how decisions influence outcomes, so having something like (Ax2) defies this purpose. Statements like that are supposed to be inferred by the agent from more opaque-to-the-purpose-of-deciding definitions of A and W, and finding the statements of suitable form constitutes most of the activity in making a decision.

If the agent can figure out what it decides, then by default it becomes able to come up with spurious arguments for any decision, which makes its inference system (axioms) either inconsistent or unpredictable (if we assume consistency and don't let (Ax3) into it). The rules that say which arguments lead to which actions shouldn't be part of the theory that the agent uses to come up with the arguments. This excludes your (Ax3), which can only be assumed "externally" to the agent, a way of seeing if it's doing OK, not something it should itself be able to see.

A contradiction can be inferred from your axioms in this way: (4) says A=2, which implies [A=1 => W=0], which by (Ax3) together with [A=2 => W=1000000] implies that A=1, which is a contradiction and a decision to do whatever (I think you've also mixed up A=1 and A=2 while going from (Ax2) to (1) and (2)).

Thanks! This site is rather large :) And the most interesting (for me) stuff is rarely linked from the sequences. That is, it seems so... Is there a more efficient way to get up to date? [I would SO like to have found this site 5 years ago!]

The axioms were not proposed seriously as a complete solution, of course. The (Ax2) can be thought of as a cached consequence of a long chain of reasoning. The main point of the post was about the word "could", after all.

(Ax3) is just an axiomatic restatement of the "utility maximization" goal. I don't see why an agent can't know about itself that it's a utility maximizer.

A contradiction can be inferred from your axioms in this way: (4) says A=2, which implies [A=1 => W=0], which by (Ax3) together with [A=2 => W=1000000] implies that A=1, which is a contradiction and a decision to do whatever (I think you've also mixed up A=1 and A=2 while going from (Ax2) to (1) and (2)).

Yes, A=1 and A=2 were mixed up in the (Ax2), fixed it now. But I don't see your contradiction. (Ax3) + [A=1 => W=0] + [A=2 => W=1000000] does not imply A=1. It implies NOT(A=1)

Is there a more efficient way to get up to date?

See the posts linked to from UDT, ADT, TDT, section "Decision theory" in this post.

I don't see why an agent can't know about itself that it's a utility maximizer.

It can, it's just a consideration it can't use in figuring out how the outcome depends on its decision. While looking for what is influenced by a decision, the decision itself should remain unknown (given that the decision is determined by what this dependence turns out to be, it's not exactly self-handicapping).

Yes, A=1 and A=2 were mixed up in the (Ax2), fixed it now. But I don't see your contradiction. (Ax3) + [A=1 => W=0] + [A=2 => W=1000000] does not imply A=1. It implies NOT(A=1)

It seems the mixing up of A=1 and A=2 didn't spare this argument, but it's easy enough to fix. From A=2 we have NOT(A=1), so [A=1 => W=1,000,000,000] and together with [A=2 => W=1,000,000] by (Ax3) we have A=1, contradiction.

I thought about a different axiomatization, which would not have the same consistency problems. Not sure whether this is the right place, but:

Let X be a variable ranging over all possible agents.

(World Axioms)
forall X [Decides(X)=1 => Receives(X)=1000000]
forall X [Decides(X)=2 => Receives(X)=1000]

(Agent Axioms - utility maximization)
BestX = argmax{X} Receives(X)
Decision = Decides(BestX)

Then Decision=1 is easily provable, regardless of whether any specific BestX is known.
Does not seem to lead to contradictions.

Yes, I see now, thanks.

If I attempted to fix this, I would try to change (Ax3) to something like:

forall a1,a2,w1,w2 ((A=a1 => W=w1) AND (A=a2 => W=w2) AND (w1>w2)) => NOT (FinalA=a2)

where FinalA is the actual decision. But then, why should the world's axiom (Ax2) be defined in terms of A and not FinalA? This seems conceptually wrong...

Ok, I'll go read up on the posts :)

I enjoyed reading and thinking about this, but now I wonder if it's less interesting than it first appears. Tell me if I'm missing the point:

Since world is a function depending on the integers agent outputs, and not on agent itself, and since agent knows world's source code, agent could do the following: instead of looking for proofs shorter than some fixed length that world outputs u if agent outputs a, it could just simulate world and tabulate the values world(1), world(2), world(3),... up to some fixed value world(n). Then output the a < n + 1 for which world(a) is largest.

In other words, agent can just ask world what's the best it can get, then do that.

Granted there are situations in which it takes longer to compute world(n) than to prove that world(n) takes a certain value, but the reverse is also true (and I think more likely: the situations I can think of in which cousin it's algorithm beats mine, resource-wise, are contrived). But cousin it is considering an idealized problem in which resources don't matter, anyway.

Edit: If I'm wrong about this, I think it will be because the fact that world depends only on agent's output is somehow not obvious to agent. But I can't figure out how to state this carefully.

You're not completely wrong, but the devil is in the details. The world's dependence on the agent's output may indeed be non-obvious, as in Nesov's comment.

Are you saying there's a devil in my details that makes it wrong? I don't think so. Can you tell me a tricky thing that world can do that makes my code for agent worse than yours?

About "not obvious how world depends on agent's output," here's a only very slightly more tricky thing agent can do, that is still not as tricky as looking for all proofs of length less than n. It can write a program agent-1 (is this what Nesov is getting at?) that always outputs 1, then compute world(agent-1). It can next write a program agent-2 that always outputs 2, then compute world(agent-2). On and on up to agent-en. Then have agent output the k for which world(agent-k) is largest. Since world does not depend on any agent's source code, if agent outputs k then world(agent) = world(agent-k).

Again this is just a careful way of saying "ask world what's the best agent can get, and do that."

Am I right that if world's output depends on the length of time agent spends thinking, then this solution breaks?

Edit: I guess "time spent thinking" is not a function of "agent", and so world(agent) cannot depend on time spent thinking. Wrong?

Edit 2: Wrong per cousin's comment. World does not even depend on agent, only on agent's output.

Yes. This solution is only "optimal" if world() depends only on the return value of agent(). If world() inspects the source code of agent(), or measures cycles, or anything like that, all bets are off - it becomes obvously impossible to write an agent() that works well for every possible world(), because world() could just special-case and penalize your solution.

Edit: You weren't wrong! You identified an important issue that wasn't clear enough in the post. This is the right kind of discussion we should be having: in what ways can we relax the restrictions on world() and still hope to have a general solution?

Neat. It seems like one could think of the agent as surveying mathematical objects (sets of theorems of the form "agent()==A implies world()==U") and then forcing itself to be in a particular mathematical object (that which contains a particular true statement "agent()==A implies world()==U"). If we're thinking in terms of an ensemble the agent would already be in a particular mathematical object, so perhaps it's procedure could be thought of as forcing itself into a particular mathematical sub-structure.

When I think about this stuff I get the feeling I'm confused on many levels.