The Absent-Minded Driver

This post examines an attempt by professional decision theorists to treat an example of time inconsistency, and asks why they failed to reach the solution (i.e., TDT/UDT) that this community has more or less converged upon. (Another aim is to introduce this example, which some of us may not be familiar with.) Before I begin, I should note that I don't think "people are crazy, the world is mad" (as Eliezer puts it) is a good explanation. Maybe people are crazy, but unless we can understand how and why people are crazy (or to put it more diplomatically, "make mistakes"), how can we know that we're not being crazy in the same way or making the same kind of mistakes?

The problem of the ‘‘absent-minded driver’’ was introduced by Michele Piccione and Ariel Rubinstein in their 1997 paper "On the Interpretation of Decision Problems with Imperfect Recall". But I'm going to use "The Absent-Minded Driver" by Robert J. Aumann, Sergiu Hart, and Motty Perry instead, since it's shorter and more straightforward. (Notice that the authors of this paper worked for a place called Center for the Study of Rationality, and one of them won a Nobel Prize in Economics for his work on game theory. I really don't think we want to call these people "crazy".)

Here's the problem description:

An absent-minded driver starts driving at START in Figure 1. At X he
can either EXIT and get to A (for a payoff of 0) or CONTINUE to Y. At Y he
can either EXIT and get to B (payoff 4), or CONTINUE to C (payoff 1). The
essential assumption is that he cannot distinguish between intersections X
and Y, and cannot remember whether he has already gone through one of
them.

graphic description of the problem

At START, the problem seems very simple. If p is the probability of choosing CONTINUE at each intersection, then the expected payoff is p2+4(1-p)p, which is maximized at p = 2/3. Aumann et al. call this the planning-optimal decision.

The puzzle, as Piccione and Rubinstein saw it, is that once you are at an intersection, you should think that you have some probability α of being at X, and 1-α of being at Y. Your payoff for choosing CONTINUE with probability p becomes α[p2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p2+4(1-p)p unless α = 1. So, once you get to an intersection, you'd choose a p that's different from the p you thought optimal at START.

Aumann et al. reject this reasoning and instead suggest a notion of action-optimality, which they argue should govern decision making at the intersections. I'm going to skip explaining its definition and how it works (read section 4 of the paper if you want to find out), and go straight to listing some of its relevant properties:

  1. It still involves a notion of "probability of being at X".
  2. It's conceptually more complicated than planning-optimality.
  3. Mathematically, it has the same first-order necessary conditions as planning-optimality, but different sufficient conditions.
  4. If mixed strategies are allowed, any choice that is planning-optimal is also action-optimal.
  5. A choice that is action-optimal isn't necessarily planning-optimal. (In other words, there can be several action-optimal choices, only one of which is planning-optimal.)
  6. If we are restricted to pure strategies (i.e., p has to be either 0 or 1) then the set of action-optimal choices in this example is empty, even though there is still a planning-optimal one (namely p=1).

In problems like this one, UDT is essentially equivalent to planning-optimality. So why did the authors propose and argue for action-optimality despite its downsides (see 2, 5, and 6 above), instead of the alternative solution of simply remembering or recomputing the planning-optimal decision at each intersection and carrying it out?

Well, the authors don't say (they never bothered to argue against it), but I'm going to venture some guesses:

  • That solution is too simple and obvious, and you can't publish a paper arguing for it.
  • It disregards "the probability of being at X", which intuitively ought to play a role.
  • The authors were trying to figure out what is rational for human beings, and that solution seems too alien for us to accept and/or put into practice.
  • The authors were not thinking in terms of an AI, which can modify itself to use whatever decision theory it wants to.
  • Aumann is known for his work in game theory. The action-optimality solution looks particularly game-theory like, and perhaps appeared more natural than it really is because of his specialized knowledge base.
  • The authors were trying to solve one particular case of time inconsistency. They didn't have all known instances of time/dynamic/reflective inconsistencies/paradoxes/puzzles laid out in front of them, to be solved in one fell swoop.

Taken together, these guesses perhaps suffice to explain the behavior of these professional rationalists, without needing to hypothesize that they are "crazy". Indeed, many of us are probably still not fully convinced by UDT for one or more of the above reasons.

EDIT: Here's the solution to this problem in UDT1. We start by representing the scenario using a world program:

def P(i, j):
    if S(i) == "EXIT":
        payoff = 0
    elif S(j) == "EXIT":
        payoff = 4
    else:
        payoff = 1

(Here we assumed that mixed strategies are allowed, so S gets a random string as input. Get rid of i and j if we want to model a situation where only pure strategies are allowed.) Then S computes that payoff at the end of P, averaged over all possible i and j, is maximized by returning "EXIT" for 1/3 of its possible inputs, and does that.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 6:48 PM
Select new highlight date
All comments loaded

You cannot assume any α, and choose p based on it, for α depends on p. You just introduced a time loop into your example.

Indeed, though it doesn't have to be a time loop, just a logical dependency. Your expected payoff is α[p^2+4(1-p)p] + (1-α)[p+4(1-p)]. Since you will make the same decision both times, the only coherent state is α=1/(p+1). Thus expected payoff is (8p-6p^2)/(p+1), whose maximum is at about p=0.53. What went wrong this time? Well, while this is what you should use to answer bets about your payoff (assuming such bets are offered independently at every intersection), it is not the quantity you should maximize: it double counts the path where you visit both X and Y, which involves two instances of the decision but pays off only once.

Mod parents WAY up! I should've tried to solve this problem on my own, but I wasn't expecting it to be solved in the comments like that!

Awesome. I'm steadily upgrading my expected utilities of handing decision-theory problems to Less Wrong.

EDIT 2016: Wei Dai below is correct, this was my first time encountering this problem and I misunderstood the point Wei Dai was trying to make.

You make it sound as if you expect to expect a higher utility in the future than you currently expect...

The parents that you referred to are now at 17 and 22 points, which seems a bit mad to me. Spotting the errors in P&R's reasoning isn't really the problem. The problem is to come up with a general decision algorithm that both works (in the sense of making the right decisions) and (if possible) makes epistemic sense.

So far, we know that UDT works but it doesn't compute or make use of "probability of being at X" so epistemically it doesn't seem very satisfying. Does TDT give the right answer when applied to this problem? If so, how? (It's not specified formally enough that I can just apply it mechanically.) Does this problem suggest any improvements or alternative algorithms?

Awesome. I'm steadily upgrading my expected utilities of handing decision-theory problems to Less Wrong.

Again, that seems to imply that the problem is solved, and I don't quite see how the parent comments have done that.

There's an old story about a motorist who gets a flat tire next to a mental hospital. He goes over to change the tire, putting the lug nuts into the wheel cap... and sees that one of the patients is staring at him from behind the fence. Rattled, he steps on the wheel cap, and the lug nuts go into the storm sewer.

The motorist is staring, disheartened, at the sewer drain, when the patient speaks up from behind the fence: "Take one lug nut off each of the other wheels, and use those to keep on the spare tire until you get home."

"That's brilliant!" says the motorist. "What are you doing in a mental hospital?"

"I'm here because I'm crazy," says the patient, "not because I'm stupid."

Robert Aumann, Nobel laureate, is an Orthodox Jew. And Isaac Newton wasted a lot of his life on Christian mysticism. I don't think theists, or Nobel laureate physicists who don't accept MWI, are stupid. I think they're crazy. There's a difference.

Remind me to post at some point about how rationalists, in a certain stage in their development, must, to progress further, get over their feeling of nervousness about leaving the pack and just break with the world once and for all.

If crazy people can nevertheless reach brilliant and correct solutions, then in what sense is their craziness an explanation for the fact that they failed to reach some solution? I really don't see what Aumann's religiousness has to do with the question I asked in this post, not to mention that he's just one person who worked on this problem. (Google Scholar lists 171 citations for P&R's paper.)

To put it another way, if we add "Aumann is religious" to the list of possible explanations I gave, that seems to add very little, if any, additional explanatory value.

Because crazy smart people don't consistently reach solutions. It's not surprising when they're right, but it's not surprising when they're wrong, either. There are very few people I know such that I'm surprised when they seem to get something wrong, and the key factor in that judgment is high sanity, more than high intelligence.

I'm also beginning to have a very strange thought that a reddit-derived blog system with comment upvoting and karma is just a vastly more effective way of researching decision-theory problems than publication in peer-reviewed journals.

Chess World Champions are sometimes notoriously superstitious, you can still rely on the consistency of their chess moves.

They really ought to be, what's the rational value in putting the time and effort into chess to become a world champion at it.

I played it semi-seriously when I was young, but gave it up when in order to get to the next level I'd have to study more than play. Most of the people I know who were good at a competitive intellectual game dropped out of school to pursue it, because they couldn't handle studying at that level for both.

I find it rather difficult to believe that pursuing chess over school is the rationally optimal choice, so I wouldn't be remotely surprised to find that those who get to that level are irrational or superstitious when it comes to non-chess problems.

Chess provides very strong objective feedback on what does and doesn't work.

You should be emailing people like Adam Elga and such to invite them to participate then.

Surely Hanson's favorite (a market) is worth a try here. You're more raging (as he does) against the increasingly obvious inefficiency of peer reviewed journals than discovering Reddit + mods as a particularly good solution, no?.

An interesting question: is there any way to turn a karma system into a prediction market?

The obvious way to me is to weight a person's voting influence by how well their votes track the majority, but that just leads to groupthink.

The key to prediction markets, as far as I can tell, is that predictions unambiguously come true or false and so the correctness of a prediction-share can be judged without reference to the share-price (which is determined by everyone else in what could be a bubble even) - but there is no similar outside objective check on LW postings or comments, is there?

I'd love to do a real money prediction market. Unfortunately western governments seek to protect their citizens from the financial consequences of being wrong (except in state sponsored lotteries… those are okay), and the regulatory costs (financial plus the psychic pain of navigating bureaucracy) of setting one up are higher than the payback I expect from the exercise.

The UBC is able to do a non-profit elections prediction market, and it generally does better than the average of the top 5 pollsters.

The popular vote market is you pay $1 for 1 share of CON, LIB, NDP, Green, Other, and you can trade shares like a stockmarket.

The ending payout is $1 * % of popular vote that group gets.

There are other markets such as a seat market, and a majority market.

The majority market pays 50/50 if no majority is reached, and 100/0 otherwise, which makes it pretty awkward in some respects. Generally predicting a minority government the most profitable action is to try and trade for shares of the loser. This is probably the main reason its restricted to the two parties with a chance of winning one if it were the same 5 way system, trading LIB and CON for GREEN, OTHER and NDP to exploit a minority government would probably bias the results. In this case in a minority the payout would be 20/20/20/20/20, but many traders would be willing to practically throw away shares of GREEN, OTHER and NDP because they "know" those parties have a 0% chance of winning a majority. This leads to artificial devaluation and bad prediction information.

By trading 1 share of CON for 5 GREEN and 5 OTHER, you just made 10 times the money in a minority government, and that's the payoff you're looking for instead of saying that you think the combined chances of Green and Other winning a majority is 1/6th that of the conservatives winning.

Of course they still have this problem with Liberals and Conservatives where trading out of a party at a favorable rate might just be betting minority.

I think the problem with a prediction market is you need a payout mechanism, that values the shares at the close of business, for elections there is a reasonable structure.

For situations where there isn't a clear solution or termination that gets much more complicated.

I'm also beginning to have a very strange thought that a reddit-derived blog system with comment upvoting and karma is just a vastly more effective way of researching decision-theory problems than publication in peer-reviewed journals.

How relevant is the voting, as opposed to just the back and forth?

I think the voting does send a strong signal to people to participate (there's a lot more participation here than at OB). If this is working better than mailing lists, it may be the karma, but it may also be that it can support more volume by making it easier to ignore threads.

"one of them won a Nobel Prize in Economics for his work on game theory. I really don't think we want to call these people "crazy".)"

Aumann is a religious Orthodox Jew who has supported Bible Code research. He's brilliant and an expert, but yes, he's crazy according to local mores.

Of course, that doesn't mean we should dismiss his work. Newton spent much of his life on Christian mysticism.

Your payoff for choosing CONTINUE with probability p becomes α[p^2+4(1-p)p] + (1-α)[p+4(1-p)], which doesn't equal p^2+4(1-p)p unless α = 1.

No. This statement of the problem pretends to represent the computation performed by the driver at an intersection - but it really doesn't. The trouble has to do with the semantics of alpha. Alpha is not the actual probability that the driver is at point X; it's the driver's estimate of that probability. The driver knows ahead of time that he's going to make the same calculation again at intersection Y, using the same value of alpha, which will be wrong. Therefore, he can't pretend that the actual payoff is alpha x (payoff if I am at X) + (1-alpha) x (payoff if I am at Y). Half the time, that payoff calculation will be wrong.

Perhaps a clearer way of stating this, is that the driver, being stateless, must believe P(I am at X) to be the same at both intersections. If you allow the driver to use alpha=.7 when at X, and alpha=.3 when at Y, then you've given the driver information, and it isn't the same problem anymore. If you allow the driver to use alpha=.7 when at X, and alpha=.7 again when at Y, then the driver at X is going to make a decision using the information that he's probably at X and should continue ahead, without taking into account that he's going to make a bad decision at Y because he will be computing with faulty information. That's not an alternative logic; it's just wrong.

The "correct" value for alpha is actually 1/(p+1), for those of us outside the problem; the driver is at Y p times for every 1 time he's at X, so he's at X 1 out of p+1 times. But to the driver, who is inside the problem, there is no correct value of alpha to use at both X and Y. This means that if the driver introduces alpha into his calculations, he is knowingly introducing error. The driver will be using bad information at at least one of X and Y. This means that the answer arrived at must be wrong at at least either X or Y. Since the correct answer is the same in both places, the answer arrived at must be wrong in both places. Using alpha simply adds a source of guaranteed error, and prevents finding the optimal solution.

Does the driver at each intersection have the expected payoff alpha x [p^2+4(1-p)p] + (1-alpha)*[p+4(1-p)] at each intersection? No; at X he has the actual expected payoff p^2+4(1-p)p, and at Y he has the expected payoff p+4(1-p). But he only gets to Y p/1 of the time. The other 1-p of the time, he's parked at A. So, if you want him to make a computation at each intersection, he maximizes the expected value averaging together the times he is at X and Y, knowing he is at Y p times for every one time he is at X:

p^2+4(1-p)p + p[p+4(1-p)] = 2p[p+4(1-p)]

And he comes up with p = 2/3.

There's just no semantically meaningful way to inject the alpha into the equation.

then the expected payoff is p^2^+4(1-p)p

For anyone whose eyes glazed over and couldn't see how this was derived:

There are 3 possible outcomes:

  1. you miss both turns
    The probability of missing both turns for a p is p*p (2 turns, the same p each time), and the reward is 1. Expected utility is probability*reward, so 2*p*1. Which is just 2*p or p^2
  2. you make the second turn.
    The probability of making the second turn is the probability of missing the first turn and making the second. Since p is for a binary choice, there's only one other probability, q, of missing the turn; by definition all probabilities add to 1, so p+q=1, or q=1-p. So, we have our q*p (for the missed and taken turn), and our reward of 4. As above, the expected utility is q*p*4, and substituting for q gives us (1-p)*p*4, or rearranging, 4*(1-p)*p.
  3. or, you make the first turn
    The probability of making the first turn is just p-1 as before, and the reward is 0. So the expected utility is (p-1)*0 or just 0.

Our 3 possibilities are exhaustive, so we just add them together:

p^2 + 0 + 4*(1-p)*p

0 drops out, leaving us with the final result given in the article:

p^2 + 4*(1-p)*p

  1. The probability of missing both turns for a p is 2*p […] Which is just 2*p or p^2

  2. The probability of making the first turn is just p […] So the expected utility is p*0 or just 0.

In (1), instead of 2*p, you want p*p. In (2), you want 1 – p instead of p. The final results are correct, however.

I could add some groundless speculation, but my general advice would be: Ask, don't guess!

You probably won't get answers like "I wanted to publish a paper.". But I'd bet, it would be very enlightening regardless. It'd be surprising if all of them were so extremely busy that you can't approach anybody in the area. But don't settle for PhD students, go for senior professors.

Please go ahead and add your groundless speculation. (I just added a couple more of my own to the post.) I'm actually more interested in the set of plausible explanations, than the actual explanations. A possible explanation for an error can perhaps teach us something, even if it wasn't the one responsible for it in this world.

For example, "The authors were trying to solve one particular case of time inconsistency." suggests that maybe we shouldn't try to solve ethical dilemmas one at a time, but accumulate as many of them as we can without proposing any solutions, and then see if there is a single solution to all of them.

"The authors were trying to figure out what is rational for human beings, and that solution seems too alien for us to accept and/or put into practice."

I'm not sure about that. A lot of people intuitively endorse one-boxing on Newcomb, and probably a comparable fraction would endorse the 2/3 strategy for Absent-Minded Driver.

"Well, the authors don't say (they never bothered to argue against it)"

They do mention and dismiss mystical /psychic causation, the idea that in choosing what we will do we also choose for all identical minds/algorithms

"The authors were trying to solve one particular case of time inconsistency. They didn't have all known instances of time/dynamic/reflective inconsistencies/paradoxes/puzzles laid out in front of them, to be solved in one fell swoop."

Decision theorists have a lot of experience with paradoxical-seeming results of standard causal decision theory where 'rational agents' lose in certain ways. Once that conclusion has been endorsed by the field in some cases, it's easy to dismiss further such results: "we already know rational agents lose on all sorts of seemingly easy problems, such that they would precommit/self-modify to avoid making rational decisions, so how is this further instance a reason to change the very definition of rationality?" There could be substantial path-dependence here.

What I'm more interested in is: doesn't the UDT/planning-optimal solution imply that injecting randomness can improve an algorithm, which is a big no-no? Because you're saying (and you're right AFAICT) that the best strategy is to randomly choose whether to continue, with a bias in favor of continuing.

Also, could someone go through the steps of how UDT generates this solution, specifically, how it brings to your attention the possibility of expressing the payoff as a function of p? (Sorry, but I'm a bit of a straggler on these decision theory posts.)

imply that injecting randomness can improve an algorithm

Hm. This would actually appear to be a distinct class of cases in which randomness is "useful", but it's important to note that a simple deterministic algorithm would do better if we were allowed to remember our own past actions - i.e. this is a very special case. I should probably think about it further, but offhand, it looks like the reason for randomized-optimality is that the optimal deterministic algorithm has been prohibited in a way that makes all remaining deterministic algorithms stupid.

More generally, game theory often produces situations where randomization is clearly the rational strategy when you do not know with certainty the other player's action. The example given in this post is straightforwardly convertible into a game involving Nature, as many games do, where Nature plays first and puts you into one of two states; if you are in X and play Continue, Nature plays again and the game loops. The solution to that game is the same as that derived above, I believe.

The point is, for a broad class of games/decision problems where Nature has a substantial role, we might run into similar issues that can be resolved a similar way.

I think the motivation for this problem is that memory in general is a limited resource, and a decision theory should be able to handle cases were recall is imperfect. I don't believe that there was a deliberate attempt to prohibit the optimal deterministic algorithm in order to make randomized algorithms look good.

Jaynes' principle is that injecting randomness shouldn't help you figure out what's true; implementing a mixed strategy of action is a different matter.

Although the distinction is a bit too fuzzy (in my head) for comfort.

In #lesswrong, me, Boxo, and Hyphen-ated each wrote a simple simulation/calculation of the absent-minded driver and checked how the various strategies did. Our results:

  • 1/3 = 1.3337, 1.3338174; Hyphen-ated: 1.33332; Boxo: 1.3333333333333335
  • 1/4 = 1.3127752; Hyphen-ated: 1.31247; Boxo: 1.3125
  • 2/3 = 0.9882
  • 4/9 = 1.2745, 1.2898, 1.299709, 1.297746, 1.297292, 1.2968815; Hyphen-ated: 1.29634; Boxo: 1.2962962962962963

As expected, 4/9 does distinctly worse than 1/3, which is the best strategy of the ones we tested. I'm a little surprised at Piccione & Rubinstein - didn't they run any quick* little simulation to see whether 4/9 was beaten by another probability or did they realize this and had some reason bad performance was justifiable? (I haven't read P&R's paper.)

Boxo used his UDT in Haskell code ([eu (Strategy $ const $ chance p) absentMinded id | p <- [0, 1/2, 1/3, 1/4, 4/9]]); I used some quick and dirty Haskell pasted below; and I don't know what Hyphen-ated used.

import System.Random
import Control.Monad
main = foo >>= print -- test out the 1/4 strategy. obvious how to adapt
foo = let n = 10000000 in fmap (\x -> fromIntegral (sum x) / fromIntegral n) $ replicateM n run
run = do x <- getStdRandom (randomR (1,4))
         y <- getStdRandom (randomR (1,4))
         return $ trial x y
trial :: Int -> Int -> Int
trial x y | x == 1 = 0 -- exit at first turn with reward 0
          | y == 1 = 4 -- second turn, reward 4
          | otherwise = 1 -- missed both, reward 1

* It's not like this is a complex simulation. I'm not familiar with Haskell's System.Random so it took perhaps 20 or 30 minutes to get something working and then another 20 or 30 minutes to run the simulations with 1 or 10 million iterations because it's not optimized code, but Boxo and Hyphen-ated worked even faster than that.

Wei, the solution that makes immediate sense is the one proposed by Uzi Segal in Economic Letters, Vol 67, 2000 titled "Don't Fool Yourself to Believe You Won't Fool Yourself Again".

You find yourself at an intersection, and you have no idea whether it is X or Y. You believe that you are at intersection X with probability α. Denote by q the probability of continuing to go straight rather than taking a left. What lottery do you face? Well, if you are at Y, then EXITING will yield a payoff of 4, and CONTinuing will yield a payoff of 1. If you are at X, then EXITING will yield a payoff of 0, and CONTinuing on straight will take you to Y. However, when you do in fact reach Y you will not realize you have reached Y and not X. In other words, you realize that going straight if you are at X will mean that you will return to the same intersection dilemma. So, suppose we denote the lottery you currently face at the intersection by Z. Then the description of the lottery is Z = ( (Z, q; 0, (1-q)), α; (1, q; 4, (1-q)), 1 - α). This is a recursive lottery. The recursive structure might seem paradoxical since there are a finite number of intersections, but is simply a result of the absent-mindedness. When you are at an intersection, you realize that any action you take at the moment that will put you at an intersection will not be distinguishable from the present moment in time, and that when you contemplate your choice at this subsequent intersection, you will not have the knowledge that you had already visited an intersection, inducing this "future" you to account for the belief that the second intersection is yet to be reached. The next crucial step is to recognize that beliefs about where you are ought to be consistent with your strategy. You only ever have to make a choice when you are at an intersection, and since all choice nodes are indistinguishable from each other (from your absent-minded perspective), your strategy can only be a mixture over EXIT and CONT. Note that the only way to be at intersection Y is by having chosen CONT at intersection X. Thus, if you believe that you are at intersection X with probability α, then the probability of being at Y is the probability that you ascribed to being at X when you previously made the choice to CONT multiplied by the probability on CONTinuing, as described by your optimal strategy. That is, Prob(Y) = Prob(X) Prob(CONT). So, consistency of beliefs with your strategy requires that (1 - α) = α q. Then, the value of the lottery Z, V(Z) as a function of q is described implicitly by: V(Z) = αqV(Z) + α(1-q)(0) + (1-α)q(1) + (1-α)(1-q)4. Solving for V(Z), and using the consistent beliefs condition yields V(Z) = (1 + 3q)(1-q), and so the optimal q is 2/3, which is the same mixed strategy as you chose when you were at START.

Of course, Aumann et. al. obtain that optimality of the planning-stage choice, but the argument they make is distinct from the one presented here.

It occurs to me that perhaps "professional rationalist" is a bit of an oxymoron. In today's society, a professional rationalist is basically someone who is paid to come up with publishable results in the academic fields related to rationality, such as decision theory and game theory.

I've always hated the idea of being paid for my ideas, and now I think I know why. When you're being paid for you ideas, you better have "good" ideas or you're going to starve (or at least suffer career failure). But good ideas can't be produced on a schedule, so the only way to guarantee academic survival is to learn the art of self-promotion. Your papers better make your mediocre ideas look good, and your good ideas look great. And having doubts about your ideas isn't going to help your career, even if they are rationally justified.

Given this reality, how can any professional rationalist truly be rational?