Decision Theories: A Semi-Formal Analysis, Part I

Or: The Problem with Naive Decision Theory

Previously: Decision Theories: A Less Wrong Primer

Summary of Sequence: In the context of a tournament for computer programs, I give almost-explicit versions of causal, timeless, ambient, updateless, and several other decision theories. I explain the mathematical considerations that make decision theories tricky in general, and end with a bunch of links to the relevant recent research. This sequence is heavier on the math than the primer was, but is meant to be accessible to a fairly general audience. Understanding the basics of game theory (and Nash equilibria) will be essential. Knowing about things like Gödel numbering, quining and Löb's Theorem will help, but won't be required.

Summary of Post: I introduce a context in which we can avoid most of the usual tricky philosophical problems and formalize the decision theories of interest. Then I show the chief issue with what might be called "naive decision theory": the problem of spurious counterfactual reasoning. In future posts, we'll see how other decision theories get around that problem.

In my Decision Theory Primer, I gave an intuitive explanation of decision theories; now I'd like to give a technical explanation. The main difficulty is that in the real world, there are all sorts of complications that are extraneous to the core of decision theory. (I'll mention more of these in the last post, but an obvious one is that we can't be sure that our perception and memory match reality.)

In order to avoid such difficulties, I'll need to demonstrate decision theory in a completely artificial setting: a tournament among computer programs.

You're a computer programmer entering a tournament for spectacular stakes. But you won't be competing in person: instead, you'll be submitting code for a program to represent you, and that program will be competing one-on-one with other programs in a series of games.

You don't know what the specific games are, but the contest has specified some ground rules:

  • In each game, your program and its opponent will have to choose separately between several options. (There are independent sources of randomness available to each program, so mixed strategies are legal.)
  • The games are pure strategic conflicts: the expected payouts to each programmer depend on the outputs of both programs, and can be calculated simply if you know those outputs. (In particular, you can represent each game as a payoff matrix in terms of the programs' outputs.) For example, your program might play the Prisoner's Dilemma against another program.
  • Both programs have access to the source code of each other and of the game they're playing.
  • The programs don't get to carry any memories from round to round, or modify their source code at any stage. (This makes the analysis much simpler, and makes it even more impressive if we find unexploitable programs which are capable of mutual cooperation.)
  • While many of the programs were written by other programmers trying to win prizes for themselves, there are also some special algorithms included to test your reactions; for example, there could be programs that always cooperate in the Prisoner's Dilemma, or an instance of Newcomb's Predictor.
  • The tournament may also include some exact copies of your own program, but you won't get any of the prizes won by these extra copies, only the prizes won by your actual entry. (There's no way for your program to distinguish whether it's the original or a copy.)
  • There are more than enough prizes to go around, so you don't have to make anyone else lose, you just want your program to win as much as it can.
  • Also, there will be enough games played that most of the variance should wash out, so you should aim for the highest expected value per round rather than worry about the concave utility of more prizes.

So, what kind of program should you write?

In the next few posts, we'll examine several ideas, problems and decision theories, increasing our sophistication as we go. We'll use X to denote your program, and x1, . . . , xn to denote its possible outputs; likewise, your opponent in the current round is Y with possible outputs y1, . . . , ym. We'll let U(xi,yj) denote the resulting payout to you if X outputs xi and Y outputs yj.

Idea 1: Play defense with a Nash equilibrium

In our example, we know what utility function the opponent should have: its own expected payout.0 Any such game has at least one Nash equilibrium, a pair of strategies (which may be mixed) such that if X and Y both adopted them, then neither would be better off unilaterally switching strategies. In that sense, at least, if X plays a Nash equilibrium, it can be sure of not being exploited by Y. (In particular, X will never end up tricked into cooperating in the Prisoner's Dilemma while Y defects.)

Of course, there may be more than one Nash equilibrium in a game, and these may be of unequal value if the game is non-zero-sum. (Coordination problems are tricky in general.) So this is underspecified; still, choosing an arbitrary Nash equilibrium is a decent backup strategy. But we can often do better:

Idea 2: Inference

The most basic intuition a human being has in this situation is to start trying to deduce things about Y's output from its source code, or even deduce things directly about U. This idea is best illustrated by playing Rock, Paper, Scissors against a player who always throws Rock: if you figure this out, then you should of course play Paper rather than the Nash equilibrium of 1/3 Rock, 1/3 Paper, 1/3 Scissors. (And in a coordination game, you'd prefer to settle on the same Nash equilibrium that Y outputs.)

Automating inference is an exceedingly difficult problem in general, though researchers have made substantial progress. All the decision theories we'll talk about will include some sort of "inference module", which can be applied to the source code of Y to deduce its output, applied to the code of the full round (including X, Y, and the payoff matrix) to deduce the value of U, etc.

Problem: You can't deduce everything

Gödel's First Incompleteness Theorem and the Halting Problem both imply that it's impossible to write a program that correctly deduces in general1 the output of arbitrary other programs. So we have to be prepared for our inference module to fail sometimes.

A well-written inference module will either return a correct answer for a question or return "Unknown"; a sloppily-written module can get stuck in an infinite process, and a badly-written one will return an incorrect answer sometimes. It should be clear that we'll want our inference module to be of the first sort.

It seems we have enough already to define our first candidate decision theory:

Naive Decision Theory

Let's first consider the approach that seems most obvious. Since we know the source code of the entire round (including X and Y), we could implement the following program:

  • For each xi, assume the output of X is xi, and try to deduce the expected value of U. (That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui).
  • If this succeeds for each xi, output the xi for which ui is the largest.
  • If this does not succeed for some xi, output a Nash equilibrium strategy.

This "naive decision theory" certainly qualifies for our tournament; it may be a bit trickier to write an inference module that does an open-ended search for the value of U, but it's not impossible (since human mathematicians solve open-ended deduction problems all the time). And it looks like the worst-case scenario is a Nash equilibrium, not total exploitation. What could possibly go wrong?

Problem: Beware self-fulfilling prophecies!

There's a reason that we don't normally ask an automated theorem prover to consider questions about its own mathematical structure: if we ask the question in a certain way, any choice becomes a self-fulfilling prophecy.

If X deduces its own output by a valid process, then it's created a self-fulfilling prophecy for its output, and the problem with that is that a bad self-fulfilling prophecy is just as consistent as a good one. If we want to use statements like "if (output X)=xi then U=ui" to make our final choice, then we have to beware the other half of logical implication, that "not P" implies "if P then Q". This allows for what we might call spurious counterfactuals, which can throw off the actual decision in a perfectly self-consistent way.

Consider, for example, the one-player game where X gets $1 for outputting a, or $10 for outputting b. We want X to do the following:

  • Prove "if (output X)=a then U=1"
  • Prove "if (output X)=b then U=10"
  • Output b.

But it's just as consistent for X to do the following:

  • Prove "(output X)=a"
  • Prove "if (output X)=a then U=1"
  • Prove "if (output X)=b then U=0"
  • Output a.

How could that possibly work? Since (output X)=a, the third line is a true logical statement! It's like the fact that you can prove anything if you assume a falsehood (though rather than unconditionally accepting a false premise, X is using a false premise as the antecedent of a material conditional). In this example, the very order in which X looks for proofs (which is part of the definition of X) affects which counterfactuals X can and cannot prove. (One important thing to note is that X cannot prove a spurious counterfactual about the action that it does output, only about the ones that it doesn't!)

I don't need to tell you that the second chain of proofs is not what we want X to do. Worse, if this is a real bug, then it could also be an exploitable vulnerability: if your source code for X were released in advance of the tournament, then other programmers might write programs that cause X to generate spurious counterfactuals for all but the moves that are most favorable to Y.

Can NDT be salvaged?

Let's consider some quick fixes before we give up on Naive Decision Theory.

Can we simply prohibit X from ever deducing "(output X)=xi" as a step? This doesn't work because of the possibility of indirect self-reference; X could end up deducing some seemingly innocuous statements which happen to correspond to its own Gödel numbering, and the spurious counterfactuals would follow from those- without X ever having noticed that it had done anything of the sort. And it's provably impossible for X to recognize every possible Gödel numbering for its own inference module!

Next, it might seem like an inference module should stumble on the "genuine" counterfactuals before running into spurious ones, since the "genuine" ones seem necessarily simpler. However, it turns out (as proved by cousin_it) that one can write a valid but malicious inference module which returns and acts on a spurious proof, and (as proved by gRR) that a game with malicious code can similarly dupe a NDT agent with a good inference module!

Lastly, it seems safer to deduce counterfactuals "if (output X)=xi then (output Y)=yj" and apply the U(xi,yj) afterwards. And indeed, I can't see how to make Y exploit X in a straight-up Prisoner's Dilemma if that algorithm is used. There are still two problems, though. First, this algorithm now depends on the values U(xi,yj) being given to it by authority- it can't safely deduce them from the source code for the game. And secondly, it could two-box on Newcomb's Problem and defect against itself in the Prisoner's Dilemma if it finds spurious counterfactuals there.

Thus it seems we'll need to do something substantially different.

Well, now what?

There are several ways we could write a decision theory to do inference without risking spurious counterfactuals, and indeed the decision theories we discuss on Less Wrong correspond to different valid approaches. The differences in their decisions come not from better-written inference modules, but from more effective strategies for using their inference module. In the posts to come, I'll show you how they work in detail.

Next: Part II: Causal Decision Theory and Substitution

Notes:

0. Things get wacky if we don't know the utility function of the opponent; fortunately, even the special cases like Newcomb's predictor can be expressed as expected utility maximizers for some payoff matrix (in this case, one where the predictor gets rewarded when it matches your decision exactly).

1. That "in general" is important: it's quite possible to write programs that deduce the outputs of plenty of other programs. It just so happens that there's always some program that your inference module will fail on. The classic way to generate a failure case is to run the inference module on a modified version of itself, such that returning a correct answer would induce a contradiction. This isn't just a hypothetical disability: if X is trying to deduce the output of Y in this round, and Y is trying to deduce the output of X in this round, then we might have exactly this problem!

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:54 PM
Select new highlight date
All comments loaded

Markup note: Footnote links should be made relative so they don't force a reload of the page due to its URL having been changed.

if your source code for X were released in advance

But everyone's source code is communicated at each round to everyone. What do you mean, "in advance"? What for?

The problem is that we have an exploitable vulnerability here: if your source code for X were released in advance, then other programmers could write a Gödel-numbering for X's inference module into their own programs in such a way that X has to take a circuitous route to the "genuine" counterfactuals (or so that the genuine ones come back "Unknown"), and in the meantime X will stumble upon exactly the spurious counterfactuals that help Y.

This is very speculative. While an agent can make predictions about itself as long as it wishes, thus making a natural counterfactual inference performed by the other agent longer, that makes the spurious counterfactuals longer as well. Wei asks in another thread:

Having f(L)=L+1 would be sufficient to protect against any "moral arguments" that involve first proving agent()!=a for some a. Do we have any examples of "spurious" proofs that are not of this form?

Unless an agent sabotages itself by deciding arbitrarily, and then observes spurious inferences from that decision, I don't see how to push spurious inferences in front of the natural ones.

I think this part of the post should be rephrased, so that it says that we don't know how to prove that even such bizarre exploitation strategy is impossible, instead of the current assertion (unless you have background knowledge or a new result that suggests that this conjecture is reasonable).

Edit: Turns out this is indeed possible, if X can be tricked into checking a specifically contrived proof out of order. Here's a post describing the setup.

Well, this article puts my confusion from the primer into context, and I think I understand the issue now.

The "problem" (and it's ultimately not a problem, but I'll get to that) I have is that the game these programs are playing is ill-posed, in that it's not a closed system. It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar. Now I'm not enough of an expert to tell you exactly what defines a "well posed" game, but I'm quite confident this is not it. Or maybe I'm abusing the terminology a bit but hopefully my point is clear enough (if not I'll elaborate). The distinction though is that the mechanism by which you make your decision on how to play should be outside the game itself, and therefore only observed indirectly.

One property this metagame has is that there is fundamentally no "correct" answer for what "decision theory" the programs should use. The proof is simple, imagine you have a correct answer D. Imagine then that this program faces a population of the following opponents (I'll use the Prisoners' Dilemna subgame as a simple example):

  • Look at your opponent's source code.
  • If your opponent is using D, defect.
  • Otherwise co-operate.

D is in fact the single worst strategy in this scenario by quite a margin, regardless of what it is.

But that's fine, why would we even expect there to be a single correct strategy for every possible set of opponents? Of course there isn't. This is where we take a step back to a game that is actually well posed and for which Naive Decision Theory works fine - the programmers' game. Applying it there would go something like this:

  • Create a probability distribution over the (potentially infinite) space of all possible programs that yours might be up against.
  • Find the program which maximises the expected utility against that distribution of opponents.

For an infinite space of possible programs the second step might technically be impossible to perform optimally, but you do the best you can. There is of course a devil in the details of the first step in how you create that distribution, and when you look into it ultimately it's a game between programmers, with its own Nash Equilibria and co-operative strategies and so on, not just a simple decision. But until you start having programmers read each others' minds (in which case you've really just pointlessly shifted the whole thing up one meta-level) it's at least a well-posed (if complex) game that the ordinary approaches should work for.

As I said earlier though the fact that I might call these games ill-posed doesn't mean they're not interesting. I'll be fascinated to read about the strategies that people have devised for them and the attempts to be as general as possible. I'm a little concerned by the fact that the arguments so far seem to be along the lines of "You could do X, but then what if Y happens?", purely because as I've shown above there is ultimately no end to that chain of logic. But maybe you'll eventually step outside of it. My guess is that ultimately the "smarter" program wins. Certainly my generic counter-strategy has to generally be more complex (and therefore "smarter"? Maybe) than the "D" strategy it punishes in order to identify "D" in the first place.

In any case I look forward to finding out.

There's a relevant footnote in the next post: while it's possible to write agents that punish a particular decision theory, an agent that's trying (in any meaningful sense) to maximize its stated payoffs won't do this, and it's certainly not a constant strategy (i.e. a one-player game) either. In that sense, we can say that such problems are less likely to be encountered in reality.

It might appear as if they're playing, for example, Prisoners' Dilemna, but with access to each other's source code they're really playing some sort of Prisoners' Meta-Dilemna, a fundamentally different game for all that it might seem superficially similar.

I agree, in the same way that Iterated Prisoners' Dilemma (with no source code access) is a fundamentally different game from One-Shot Prisoners' Dilemma (with no source code access). But there are real-life instances that are closer to the source-code access version than to the classical game theory version- for instance, a Cold War between two powers, each of which has a network of spies in the other's government.

I suspect the difficulty you've hit here is that Eliezer's theory (TDT and variants) involves consistent reasoning about counterpossible statements, not just counterfactual statements ("If my algorithm were to pick a in this situation, then my expected utility would be -10000000, so I won't do that then"). I can see what he's trying to achieve, but unfortunately I don't know any way in standard Mathematics to reason consistently about inconsistencies, or (in Modal Logic) to analyze impossible possible worlds. So this looks like a major problem.

I'm also struck by the similarity to an argument for 2-boxing in Newcomb's problem. Basically, the causal decision theorist knows he is a causal decision theorist, and knows that Omega knows that, so he proves a theorem to the effect that the opaque box won't contain $1 million. And then having proved that, it's blindingly obvious that he should 2-box (it's $1000 or nothing). So he 2 boxes. Nothing inconsistent there, just tragic.

So this looks like a major problem.

But one that seems to have multiple "lines of attacks". Have you seen cousin_it's recent posts in discussion for example?

Basically, the causal decision theorist knows he is a causal decision theorist

What do you mean by "causal decision theorist"? Is it:

  1. A decision theorist (somebody who studies decision theory) who, based on the arguments he has seen so far, thinks CDT is the right decision theory? Or
  2. An AI that has access to its own source code and can see that it's running CDT?

I hadn't seen the cousin_it posts, thanks, though I'm reading them now.

One observation is that I thought myself of the "tweak" whereby if TDT, UDT or similar manages to prove that it will not perform an action a, then it immediately does perform an action a.

This at least prevents a sound decision theory finding two distinct proofs of the form "If my algorithm were to do a, then my utility would be 1000" and "If my algorithm were to do a, then my utility would be -1000". However, a couple of reservations:

  1. The tweak sounds pretty weird (if I'm president, and I prove to myself that I won't push the nuclear button, then suddenly I will. Huh???)

  2. If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A. Indeed it's not obvious that I can infer anything at all; it will depend on how badly my impossible worlds behave. Some models of impossible worlds are complete anarchy, with no logical relationships at all between propositions.

On the question about CDT, I hadn't really thought about whether the agent was human or automated, or whether it has access to its own source-code or just a general self-awareness of its decision-making powers. To be honest, I don't think it makes much difference; human agents who are utterly convinced of the correctness of CDT (and there are some of them) will reach much the same conclusion as the automated agent (I.e. that there just isn't going to be $1 million in the opaque box, so don't worry about it). And when noticing other agents who seem to be getting the $1 million, they'll shrug and say "Oh dear, Omega is awarding irrationality, but so what? Even if I had switched to TDT or UDT instead of CDT, there would still be some possible Omega who'd penalise me for that switch, and I had no way of knowing in advance which Omega I was likely to meet; in fact I was pretty confident I wouldn't meet any Omega ever. Can't win them all..."

If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A.

In our models so far, this isn't a problem, you just use a factory-standard first order inference system. What do you mean by "can't legitimately infer"? The worst that can happen is that you infer something misleading (but still valid), and the diagonal step/chicken rule is one way of ensuring that doesn't happen.

Thanks for the pointers to these... I'll probably move my comments to there. On your other point:

In our models so far, this isn't a problem, you just use a factory-standard first order inference system.

and...

If I'm trying to reason about counterpossibles (or impossible possible words) then the usual laws of logic and proof just fly out of the window. I can't legitimately infer from A -> B and A-> ~B that ~A.

The inference rule A-> B, A->~B |- ~A is the familiar principle of proof by contradiction i.e. "reductio an absurdam" of the alternative. The difficulty I see is that you can't really use reduction ad absurdam in an impossible worlds semantics, because the impossible worlds are - ahem - absurd. That's kinda the point.

I haven't read all the details of what you're trying, but my suspicion is that using standard off-the-shelf logics won't work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference). Maybe you need some sort of belief semantics, where the "worlds" admitted into the belief-system are strictly a mixture of possible and impossible worlds, but the agent hasn't realized yet that the impossible ones actually are impossible or will never realize it, so they "seem" to be possible. So he's chugging along analysing those worlds using inference rules that will eventually hit a contradiction, but not yet. For "practical purposes" the system is consistent and won't blow up. Though be very careful, because an inconsistent system with the rule ("If I prove I won't do a, then do a") could accidentally prove that it won't nuke the planet and then (inconsistently) nuke the planet. Oops.

my suspicion is that using standard off-the-shelf logics won't work here: you will at least need to limit the application of the inference rules e.g. so that they can be applied only up to a maximum finite number of times (and so the set of theorems is not closed under logical inference)

This happens automatically, because (1) only the statements contributing to the decision matter, and there's a finite number of them, and (2) presence of the things like the diagonal step/chicken rule in the decision algorithm implies that inferring of absurdity doesn't happen. So we can prove that it's not the case that an agent can infer absurdity, even though it's free to use any first order inference it wants, and even though absurdity does follow from the axioms (in the setting without provability oracle).

In the setting with provability oracle, agent's algorithm is constructed in such a way that its axioms become sufficiently weak that the impossible counterfactuals (from the point of view of a stronger theory) remain consistent according to the theory used by the agent, and so in that setting the impossible worlds have actual models.

Your self-fulfilling prophecy example works for the iteration of the for loop (described in "For each xi, assume the output of X is xi, and try to deduce the expected value of U.") in which the output is assumed to be a, but for the iteration in which the output is assumed to be b, proving that the output is a would be to prove a contradiction. "if (output X)=b then U=0" is one possible outcome, but U could also equal anything else.

I don't see how the NDT algorithm as given allows "(output X)=a" to be proved outside of the for loop at all. I would think it would take (output X)=whatever for each iteration through the for loop as a given before trying to prove anything, in which case in the run of the for loop in which (output X)=b is the given, proving (output X)=a is a clear contradiction, one which I would think our prover could avoid unless our axiomatic system is contradictory in the first place.

Or to rephrase, I don't think "For each xi, assume the output of X is xi, and try to deduce the expected value of U." and "(That is, try and deduce statements of the form "if (output X)=xi then U=ui" for some ui)." are actually equivalent at all, and I think the self-fulfilling prophecy example follows the second and ignores the first.

  1. Actually, this is an open problem so far as I know: show that if X is a Naive Decision Theory agent as above, with some analyzable inference module like a halting oracle, then there exists an agent Y written so that X cooperates against Y in a Prisoner's Dilemma while Y defects.

Let me just spell out to myself what would have to happen in this instance. For definiteness, let's take the payoffs in prisoner's dilemma to be $0 (CD), $1 (DD), $10 (CC) and $11 (DC).

Now, if X is going to co-operate and Y is going to defect then X is going to prove "If I co-operate then I get $0". Therefore, in order to co-operate, X must also prove the spurious counterfactual "If I defect then I get $x" for some negative value of x.

But suppose I tweak the definition of the NDT agent so that whenever it can prove (1) "if output = a then utility >= u" and (2) "if output != a then utility <= u" it will immediately output a. (And if several statements of the forms (1) and (2) have been proved then the agent searches for them in the order that they were proved) Note that our agent will quickly prove "if output = 'defect' then utility >= $1". So if it ever managed to prove "if output = 'co-operate' then utility = $0" it would defect right away.

Since I have tweaked the definition, this doesn't address your 'open problem' (which I think is a very interesting one) but it does show that if we replace the NDT agent with something only slightly less naive, then the answer is that no such Y exists.

(We could replace Prisoner's Dilemma with an alternative game where each player has a third option called "nuclear holocaust", such that if either player opts for nuclear holocaust then both get (say) -$1, and ask the same question as in your note 2. Then even for the tweaked version of X it's not clear that no such Y exists.)

ETA: I'm afraid my idea doesn't work: The problem is that the agent will also quickly prove "if 'co-operate' then I receive at least $0." So if it can prove the spurious counterfactual "if 'defect' then receive -1" before proving the 'real' counterfactual "if 'co-operate' then receive 0" then it will co-operate.

We could patch this up with a rule that said "if we deduce a contradiction from the assumption 'output = a' then immediately output a" which, if I remember rightly, is Nesov's idea about "playing chicken with the inconsistency". Then on deducing the spurious counterfactual "if 'defect' then receive -1" the agent would immediately defect, which could only happen if the agent itself were inconsistent. So if the agent is consistent, it will never deduce this spurious counterfactual. But of course, this is getting even further away from the original "NDT".

I look forward to the rest of this sequence!

The programs don't get to carry any memories from round to round, or modify their source code at any stage.

! No memory is a big hole to leave in your explanation. It also destroys your ability to do inference, i.e. idea 2.

Overall, it looks like for NDT you don't want to have access to your opponent's source code, and you do want to have memory. You don't need to infer if you can read!

This branch of math is outside my training. I'm stumbling over the self-fulfilling prophecies section.

How can these two statements

if (output X)=b then U=10
if (output X)=b then U=0

both be true?

Because in the second example, it's been deduced that (output X)=a. It's like how you can prove anything from a false premise.

Yes, though I would say "because you can prove anything from a false premise".

Subtle distinction: it's not unconditionally taking a false axiom and deriving a spurious conclusion, it's proving a conditional by proving the antecedent is false.

I'll see if I can improve the wording.

These articles seem beyond my skillset also, but this may help you:

In math, the sentence "if A then B", is equivalent to "(not A) or B"
So, "if A then B" is considered true if "A" is false, regardless of what B is.