Towards a New Decision Theory

It commonly acknowledged here that current decision theories have deficiencies that show up in the form of various paradoxes. Since there seems to be little hope that Eliezer will publish his Timeless Decision Theory any time soon, I decided to try to synthesize some of the ideas discussed in this forum, along with a few of my own, into a coherent alternative that is hopefully not so paradox-prone.

I'll start with a way of framing the question. Put yourself in the place of an AI, or more specifically, the decision algorithm of an AI. You have access to your own source code S, plus a bit string X representing all of your memories and sensory data. You have to choose an output string Y. That’s the decision. The question is, how? (The answer isn't “Run S,” because what we want to know is what S should be in the first place.)

Let’s proceed by asking the question, “What are the consequences of S, on input X, returning Y as the output, instead of Z?” To begin with, we'll consider just the consequences of that choice in the realm of abstract computations (i.e. computations considered as mathematical objects rather than as implemented in physical systems). The most immediate consequence is that any program that calls S as a subroutine with X as input, will receive Y as output, instead of Z. What happens next is a bit harder to tell, but supposing that you know something about a program P that call S as a subroutine, you can further deduce the effects of choosing Y versus Z by tracing the difference between the two choices in P’s subsequent execution. We could call these the computational consequences of Y. Suppose you have preferences about the execution of a set of programs, some of which call S as a subroutine, then you can satisfy your preferences directly by choosing the output of S so that those programs will run the way you most prefer.

A more general class of consequences might be called logical consequences. Consider a program P’ that doesn’t call S, but a different subroutine S’ that’s logically equivalent to S. In other words, S’ always produces the same output as S when given the same input. Due to the logical relationship between S and S’, your choice of output for S must also affect the subsequent execution of P’. Another example of a logical relationship is an S' which always returns the first bit of the output of S when given the same input, or one that returns the same output as S on some subset of inputs.

In general, you can’t be certain about the consequences of a choice, because you’re not logically omniscient. How to handle logical/mathematical uncertainty is an open problem, so for now we'll just assume that you have access to a "mathematical intuition subroutine" that somehow allows you to form beliefs about the likely consequences of your choices.

At this point, you might ask, “That’s well and good, but what if my preferences extend beyond abstract computations? What about consequences on the physical universe?” The answer is, we can view the physical universe as a program that runs S as a subroutine, or more generally, view it as a mathematical object which has S embedded within it. (From now on I’ll just refer to programs for simplicity, with the understanding that the subsequent discussion can be generalized to non-computable universes.) Your preferences about the physical universe can be translated into preferences about such a program P and programmed into the AI. The AI, upon receiving an input X, will look into P, determine all the instances where it calls S with input X, and choose the output that optimizes its preferences about the execution of P. If the preferences were translated faithfully, the the AI's decision should also optimize your preferences regarding the physical universe. This faithful translation is a second major open problem.

What if you have some uncertainty about which program our universe corresponds to? In that case, we have to specify preferences for the entire set of programs that our universe may correspond to. If your preferences for what happens in one such program is independent of what happens in another, then we can represent them by a probability distribution on the set of programs plus a utility function on the execution of each individual program. More generally, we can always represent your preferences as a utility function on vectors of the form <E1, E2, E3, …> where E1 is an execution history of P1, E2 is an execution history of P2, and so on.

These considerations lead to the following design for the decision algorithm S. S is coded with a vector <P1, P2, P3, ...> of programs that it cares about, and a utility function on vectors of the form <E1, E2, E3, …> that defines its preferences on how those programs should run. When it receives an input X, it looks inside the programs P1, P2, P3, ..., and uses its "mathematical intuition" to form a probability distribution P_Y over the set of vectors <E1, E2, E3, …> for each choice of output string Y. Finally, it outputs a string Y* that maximizes the expected utility Sum P_Y(<E1, E2, E3, …>) U(<E1, E2, E3, …>). (This specifically assumes that expected utility maximization is the right way to deal with mathematical uncertainty. Consider it a temporary placeholder until that problem is solved. Also, I'm describing the algorithm as a brute force search for simplicity. In reality, you'd probably want it to do something cleverer to find the optimal Y* more quickly.)

Example 1: Counterfactual Mugging

Note that Bayesian updating is not done explicitly in this decision theory. When the decision algorithm receives input X, it may determine that a subset of programs it has preferences about never calls it with X and are also logically independent of its output, and therefore it can safely ignore them when computing the consequences of a choice. There is no need to set the probabilities of those programs to 0 and renormalize.

So, with that in mind, we can model Counterfactual Mugging by the following Python program:

def P(coin):
    AI_balance = 100
    if coin == "heads":
        if S("heads") == "give $100":
            AI_balance -= 100
    if coin == "tails":
        if Omega_Predict(S, "heads") == "give $100":
            AI_balance += 10000

The AI’s goal is to maximize expected utility = .5 * U(AI_balance after P("heads")) + .5 * U(AI_balance after P("tails")). Assuming U(AI_balance)=AI_balance, it’s easy to determine U(AI_balance after P("heads")) as a function of S’s output. It equals 0 if S(“heads”) == “give $100”, and 100 otherwise. To compute U(AI_balance after P("tails")), the AI needs to look inside the Omega_Predict function (not shown here), and try to figure out how accurate it is. Assuming the mathematical intuition module says that choosing “give $100” as the output for S(“heads”) makes it more likely (by a sufficiently large margin) for Omega_Predict(S, "heads") to output “give $100”, then that choice maximizes expected utility.

Example 2: Return of Bayes

This example is based on case 1 in Eliezer's post Priors as Mathematical Objects. An urn contains 5 red balls and 5 white balls. The AI is asked to predict the probability of each ball being red as it as drawn from the urn, its goal being to maximize the expected logarithmic score of its predictions. The main point of this example is that this decision theory can reproduce the effect of Bayesian reasoning when the situation calls for it. We can model the scenario using preferences on the following Python program:

def P(n):
    urn = ['red', 'red', 'red', 'red', 'red', 'white', 'white', 'white', 'white', 'white']
    history = []
    score = 0
    while urn:
        i = n%len(urn)
        n = n/len(urn)
        ball = urn[i]
        urn[i:i+1] = []
        prediction = S(history)
        if ball == 'red':
            score += math.log(prediction, 2)
        else:
            score += math.log(1-prediction, 2)
        print (score, ball, prediction)
        history.append(ball)

Here is a printout from a sample run, using n=1222222:

-1.0 red 0.5
-2.16992500144 red 0.444444444444
-2.84799690655 white 0.375
-3.65535182861 white 0.428571428571
-4.65535182861 red 0.5
-5.9772799235 red 0.4
-7.9772799235 red 0.25
-7.9772799235 white 0.0
-7.9772799235 white 0.0
-7.9772799235 white 0.0

S should use deductive reasoning to conclude that returning (number of red balls remaining / total balls remaining) maximizes the average score across the range of possible inputs to P, from n=1 to 10! (representing the possible orders in which the balls are drawn), and do that. Alternatively, S can approximate the correct predictions using brute force: generate a random function from histories to predictions, and compute what the average score would be if it were to implement that function. Repeat this a large number of times and it is likely to find a function that returns values close to the optimum predictions.

Example 3: Level IV Multiverse

In Tegmark's Level 4 Multiverse, all structures that exist mathematically also exist physically. In this case, we'd need to program the AI with preferences over all mathematical structures, perhaps represented by an ordering or utility function over conjunctions of well-formed sentences in a formal set theory. The AI will then proceed to "optimize" all of mathematics, or at least the parts of math that (A) are logically dependent on its decisions and (B) it can reason or form intuitions about.

I suggest that the Level 4 Multiverse should be considered the default setting for a general decision theory, since we cannot rule out the possibility that all mathematical structures do indeed exist physically, or that we have direct preferences on mathematical structures (in which case there is no need for them to exist "physically"). Clearly, application of decision theory to the Level 4 Multiverse requires that the previously mentioned open problems be solved in their most general forms: how to handle logical uncertainty in any mathematical domain, and how to map fuzzy human preferences to well-defined preferences over the structures of mathematical objects.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 6:50 PM
Select new highlight date
All comments loaded

There's lots of mentions of Timeless Decision Theory (TDT) in this thread - as though it refers to something real. However, AFAICS, the reference is to unpublished material by Eliezer Yudkowsky.

I am not clear about how anyone is supposed to make sense of all these references before that material has been published. To those who use "TDT" as though they know what they are talking about - and who are not Eliezer Yudkowsky - what exactly is it that you think you are talking about?

1) Congratulations: moving to logical uncertainty and considering your decision's consequences to be the consequence of that logical program outputting a particular decision, is what I would call the key insight in moving to (my version of) timeless decision theory. The rest of it (that is, the work I've done already) is showing that this answer is the only reflectively consistent one for a certain class of decision problems, and working through some of the mathematical inelegancies in mainstream decision theory that TDT seems to successfully clear up and render elegant (the original Newcomb's Problem being only one of them).

Steve Rayhawk also figured out that it had to do with impossible possible worlds.

Neither of you have arrived at (published?) some important remaining observations about how to integrate uncertainty about computations into decision-theoretic reasoning; so if you want to completely preempt my would-be PhD thesis you've still got a bit more work to do.

Why didn't you mention earlier that your timeless decision theory mainly had to do with logical uncertainty? It would have saved people a lot of time trying to guess what you were talking about.

Looking at my 2001 post, it seems that I already had the essential idea at that time, but didn't pursue very far. I think it was because (A) I wasn't as interested in AI back then, and (B) I thought an AI ought to be able to come up with these ideas by itself.

I still think (B) is true, BTW. We should devote some time and resources to thinking about how we are solving these problems (and coming up with questions in the first place). Finding that algorithm is perhaps more important than finding a reflectively consistent decision algorithm, if we don't want an AI to be stuck with whatever mistakes we might make.

Why didn't you mention earlier that your timeless decision theory mainly had to do with logical uncertainty?

Because I was thinking in terms of saving it for a PhD thesis or some other publication, and if you get that insight the rest follows pretty fast - did for me at least. Also I was using it as a test for would-be AI researchers: "Here's Newcomblike problems, here's why the classical solution doesn't work for self-modifying AI, can you solve this FAI problem which I know to be solvable?"

I still think (B) is true, BTW. We should devote some time and resources to thinking about how we are solving these problems (and coming up with questions in the first place). Finding that algorithm is perhaps more important than finding a reflectively consistent decision algorithm, if we don't want an AI to be stuck with whatever mistakes we might make.

And yet you found a reflectively consistent decision algorithm long before you found a decision-system-algorithm-finding algorithm. That's not coincidence. The latter problem is much harder. I suspect that even an informal understanding of parts of it would mean that you could find timeless decision theory as easily as falling backward off a tree - you just run the algorithm in your own head. So with vey high probability you are going to start seeing through the object-level problems before you see through the meta ones. Conversely I am EXTREMELY skeptical of people who claim they have an algorithm to solve meta problems but who still seem confused about object problems. Take metaethics, a solved problem: what are the odds that someone who still thought metaethics was a Deep Mystery could write an AI algorithm that could come up with a correct metaethics? I tried that, you know, and in retrospect it didn't work.

The meta algorithms are important but by their very nature, knowing even a little about the meta-problem tends to make the object problem much less confusing, and you will progress on the object problem faster than on the meta problem. Again, that's not saying the meta problem is important. It's just saying that it's really hard to end up in a state where meta has really truly run ahead of object, though it's easy to get illusions of having done so.

Now that I have some idea what Eliezer and Nesov were talking about, I'm still a bit confused about AI cooperation. Consider the following scenario: Omega appears and asks two human players (who are at least as skilled as Eliezer and Nesov) to each design an AI. The AIs will each undergo some single-player challenges like Newcomb's Problem and Counterfactual Mugging, but there will be a one-shot PD between the two AIs at the end, with their source codes hidden from each other. Omega will grant each human player utility equal to the total score of his or her AI. Will the two AIs play cooperate with each other?

I don't think it's irrational for human players to play defect in one-shot PD. So let's assume these two human players would play defect in one-shot PD. Then they should also program their AIs to play defect, even if they have to add an exception to their timeless/updateless decision algorithms. But exceptions are bad, so what's the right solution here?

I'm still quite confused, but I'll report my current thoughts in case someone can help me out. Suppose we take it as an axiom that an AI's decision algorithm shouldn't need to contain any hacks to handle exceptional situations. Then the following "exceptionless" decision algorithm seems to pop out immediately: do what my creator would want me to do. In other words, upon receiving input X, S computes the following: suppose S's creator had enough time and computing power to create a giant lookup table that contains an optimal output for every input S might encounter, what would the entry for X be? Return that as the output.

This algorithm correctly solves Counterfactual Mugging, since S's creator would want it to output "give $100", since "give $100" would have maximized the creator's expected utility at the time of coding S. It also solves the problem posed by Omega in the parent comment. It seems to be reflectively consistent. But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

There are two parts to AGI: consequentialist reasoning and preference.

Humans have feeble consequentialist abilities, but can use computers to implement huge calculations, if the problem statement can be entered in the computer. For example, you can program the material and mechanical laws in an engineering application, enter a building plan, and have the computer predict what's going to happen to it, or what parameters should be used in the construction so that the outcome is as required. That's the power outside human mind, directed by the correct laws, and targeted at the formally specified problem.

When you consider AGI in isolation, it's like an engineering application with a random building plan: it can powerfully produce a solution, but it's not a solution to the problem you need solving. Nonetheless, this part is essential when you do have an ability to specify the problem. And that's the AI's algorithm, one aspect of which is decision-making. It's separate from the problem statement that comes from human nature.

For an engineering program, you can say that the computer is basically doing what a person would do if they had crazy amount of time and machine patience. But that's because a person can know both problem statement and laws of inference formally, which is the way it was programmed in the computer in the first place.

With human preference, the problem statement isn't known explicitly to people. People can use preference, but can't state this whole object explicitly. A moral machine would need to work with preference, but human programmers can't enter it, and neither can they do what a machine would be able to do given a formal problem statement, because humans can't know this problem statement, it's too big. It could exist in a computer explicitly, but it can't be entered there by programmers.

So, here is a dilemma: problem statement (preference) resides in the structure of human mind, but the strong power of inference doesn't, while the strong power of inference (potentially) exists in computers outside human minds, where the problem statement can't be manually transmitted. Creating FAI requires these components to meet in the same system, but it can't be done in a way other kinds of programming are done.

Something to think about.

This is the clearest statement of the problem FAI that I have read to date.

This algorithm . . . seems to be reflectively consistent. But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

Suppose that, before S's creator R started coding, Omega started an open game of counterfactual mugging with R, and that R doesn't know this, but S does. According to S's inputs, Omega's coin came up tails, so Omega is waiting for $100.

Does S output "give $0"? If Omega had started the game of counterfactual mugging after S was coded, then S would output "give $100".

Suppose that S also knows that R would have coded S with the same source code, even if Omega's coin had come up heads. Would S's output change? Should S's output change (should R have coded S so that this would change S's output)? How should S decide, from its inputs, which R is the creator with the expected utility S's outputs should be optimal for? Is it the R in the world where Omega's coin came up heads, or the R in the world where Omega's coin came up tails?

If there is not an inconsistency in S's decision algorithm or S's definition of R, is there an inconsistency in R's decision algorithm or R's own self-definition?

"Do what my creator would want me to do"?

We could call that "pass the buck" decision theory ;-)

But what is the relationship between this "exceptionless" algorithm and the timeless/updateless decision algorithm?

Here's my conjecture: An AI using the Exceptionless Decision Theory (XDT) is equivalent to one using TDT if its creator was running TDT at the time of coding. If the creator was running CDT, then it is not equivalent to TDT, but it is reflectively consistent, one-boxes in Newcomb, and plays defect in one-shot PD.

And in case it wasn't clear, in XDT, the AI computes the giant lookup table its creator would have chosen using the creator's own decision theory.

AI's creator was running BRAINS, not a decision theory. I don't see how "what the AI's creator was running" can be a meaningful consideration in a discussion of what constitutes a good AI design. Beware naturalistic fallacy.

Although I still have not tried to decipher what "Timeless Decision Theory" or "Updateless Decision Theory" is actually about, I would like to observe that it is very unlikely that the "timeless" aspect, in the sense of an ontology which denies the reality of time, change, or process, is in any way essential to how it works.

If you have a Julian-Barbour-style timeless wavefunction of the universe, which associates an amplitude with every point in a configuration space of spacelike states of the universe, you can always construct histories using Bohm's formula of following the configuration-space gradient of the complex phase of the wavefunction.

I don't actually advocate Bohmian mechanics, I'd prefer something more like "quantum causal histories" a la Fotini Markopoulou, but looking even more like a background-independent cellular automaton. However, the ever-present Bohmian option should demonstrate that there is no particular intrinsic necessity to the abandonment of time. And basic subjective experience, phenomenology if you will, shows that change is real and about as basic as existence itself. The denial of time is a classic case of people denying what's right in front of them because of absorption in a theory or belief that there is no alternative. I'd basically attribute it to love of the power of objectifying thought to explain things: I can map the events of reality onto a mathematical structure which is static at least in my mind, that structure has enormous clarifying and predictive power, so therefore reality must be static, i.e. there is no time.

So I think the fashion hereabouts for denying the reality of time is a basic error. But it's a hard one to argue against because it requires some detachment from the intellectual drive to formalize and objectify everything which is just about synonymous with rationality and truthseeking in the local worldview, and requires instead that one spend a bit of time being a phenomenologist, reflecting on the nature of experience without preconceptions, and noticing that, yes, it's there and it flows, and maybe it's a mistake to call the flow an illusion just because your basic intellectual method is about mapping the world onto static ideal forms.

However, my real point here is not to argue against the ontology of timelessness. It is to suggest that the basic features of Timeless Decision Theory, whatever they may be, may actually be logically independent of the assumption of a timeless reality; and that it might be worth someone's time to re-express the theory in a language which does not presuppose timelessness. It would be a shame to see a basic innovation in decision theory unnecessarily bound to a particular wrong ontology.

AFAIK, Timeless Decision Theory doesn't have anything to say about the reality of time, only that decisions shouldn't vary depending on the time when they are considered.

Thanks for twisting my mind in the right direction with the S' stuff. I hereby submit the following ridiculous but rigorous theory of Newcomblike problems:

You submit a program that outputs a row number in a payoff matrix, and a "world program" simultaneously outputs a column number in the same matrix; together they determine your payoff. Your program receives the source code of the world program as an argument. The world program doesn't receive your source code, but it contains some opaque function calls to an "oracle" that's guaranteed to return your future output. For example, in Newcomb's Problem the world decides to put $1M in the big box iff the oracle says you will one-box.

You have no way to simulate the world and cause paradoxes, so any run of this game will be consistent. Your only recourse is "conditional simulation": for each of your possible choices, substitute it in place of the oracle call and simulate the world under this assumption, then pick the best option. When applied to Newcomb's Problem, this general algorithm leads to one-boxing. Note there's no infinite recursion involved on either side: your program doesn't ever attempt to simulate the oracle because it's opaque. And the final touch: with infinite recursion thus banished, the oracle can actually be implemented as an ordinary simulator that obtains your source code by peeking through some backdoor in the tournament setting.

This formalization looks totally obvious in retrospect and captures a lot of my common-sense intuitions about Newcomb's. I wonder why people didn't mention it earlier.