Open Problems Related to Solomonoff Induction

Solomonoff Induction seems clearly "on the right track", but there are a number of problems with it that I've been puzzling over for several years and have not made much progress on. I think I've talked about all of them in various comments in the past, but never collected them in one place.

Apparent Unformalizability of “Actual” Induction

Argument via Tarski’s Indefinability of Truth

Informally, the theorem states that arithmetical truth cannot be defined in arithmetic. The theorem applies more generally to any sufficiently strong formal system, showing that truth in the standard model of the system cannot be defined within the system.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

Argument via Berry’s Paradox

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

Is Solomonoff Induction “good enough”?

Given the above, is Solomonoff Induction nevertheless “good enough” for practical purposes? In other words, would an AI programmed to approximate Solomonoff Induction do as well as any other possible agent we might build, even though it wouldn’t have what we’d consider correct beliefs?

Is complexity objective?

Solomonoff Induction is supposed to be a formalization of Occam’s Razor, and it’s confusing that the formalization has a free parameter in the form of a universal Turing machine that is used to define the notion of complexity. What’s the significance of the fact that we can’t seem to define a parameterless concept of complexity? That complexity is subjective?

Is Solomonoff an ideal or an approximation?

Is it the case that the universal prior (or some suitable generalization of it that somehow overcomes the above "unformalizability problems") is the “true” prior and that Solomonoff Induction represents idealized reasoning, or does Solomonoff just “work well enough” (in some sense) at approximating any rational agent?

How can we apply Solomonoff when our inputs are not symbol strings?

Solomonoff Induction is defined over symbol strings (for example bit strings) but our perceptions are made of “qualia” instead of symbols. How is Solomonoff Induction supposed to work for us?

What does Solomonoff Induction actually say?

What does Solomonoff Induction actually say about, for example, whether we live in a creatorless universe that runs on physics? Or the Simulation Argument?

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 3:15 AM
Select new highlight date
Rendering 50/106 comments  show more

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

Solomonoff Induction would not consider this a description of x as it cannot be used to compute x.

I guess I failed to bridge the inferential gap here. You're right "Solomonoff Induction would not consider this a description of x as it cannot be used to compute x." The point I'm trying to make is that a correct formalization of induction/Occam's Razor ought (in order to satisfy our intuitions) to assign x some probability > 1/3^^^3 since it has a short (even if uncomputable) description, but a formalization based on this P would not, therefore it can't be a correct formalization. But this is the case no matter what P we use (with different x depending on P), hence the "unformalizability" problem.

Could you explain further? Why "ought" it assign it such a probability? As stated, this seems more convincing as an argument that it "ought not" to assign a probability > 1/3^^^3, despite the short "description".

The idea is, however we define P, how can we be that sure that there isn't some kind of uncomputable physics that would allow someone to build a device that can find the lexicographically least object x such that P(x) < 1/3^^^3 and present it us?

Maybe we should just work off of the assumption that there's no relevant uncomputable physics, because if there were, we should probably give up our endeavors anyways, unless we knew how to model an uncomputable reality within a computable AGI-program. As Schmidhuber ever so aptly wrote on his homepage:

The distribution should at least be computable in the limit. That is, there should exist a program that takes as an input any beginning of the universe history as well as a next possible event, and produces an output converging on the conditional probability of the event. If there were no such program we could not even formally specify our universe, leave alone writing reasonable scientific papers about it.

unless we knew how to model an uncomputable reality within a computable AGI-program

You could start out by trying to understand how an AI might invent the concept of uncomputability, and how it might then proceed to the possibility of uncomputable physics. And one way to get started here is by thinking as a cognitive historian, and asking how humans came up with the concept of uncomputability.

That gives you a probability inversely proportional to the integral of e^-(the description length) of each number from 0 to infinity. Complexity grows like log(n)-ish. It's an improper prior.

So I agree with Adele that this may be a modus tollens / modus ponens moment.

Are you essentially saying Solomonoff Induction could be inferior to some other meta-algorithm (or whatever you call it) in uncomputable universes?

Yes. ETA: I'm also making the stronger point that any formalization of induction we can come up with would have a similar problem. (Or at least the kinds of formalizations covered by my arguments.)

Technically not if P is sufficiently long and complex.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability.

It seems to me that the paradox may lie within this problem setup, not within the agent doing the induction.

We first consider that, rather than this device being assigned zero probability, it should actually be inconceivable to the agent - there should not be a finitely describable thingy that the agent assigns zero probability of having a finitely describable property.

Why would an agent using a second-order analogue of Solomonoff induction have such conceptual problems? Well, considering how Tarski's original undefinability theorems worked, perhaps what goes wrong is this: we want to believe that the device outputs the truth of statements about the universe. But we also want to believe this device is in the universe. So what happens if we ask the device, "Does the universe entail the sentence stating that outputs 'No' in response to a question which looks like ?"

Thus, such a device is inconceivable in the first place since it has no consistent model, and we are actually correct to assign zero probability to the alien's assertion that the device produces correct questions to all questions about the universe including questions about the device itself.

we want to believe that the device outputs the truth of statements about the universe

I'm not sure why you say this, because the device is supposed to output the truth of statements about some second-order logic, not about the universe. The device is not describable by the second-order logic via Tarski, so if the device is in the universe, the universe must only be describable by some meta-logic, which implies the device is outputting truth of statements about something strictly simpler than the universe. The agent ought to be able to conceive of this...

Is complexity objective?

This one bother me as well. Turing machine basis for complexity favours certain types of theories over others. For example, how many bits of turing machine does it take to predict, say, maxwells equations, which involve continuous fields?

(warning: original "research" following)

One temptation is to sweep this under the rug by saying that the complexity required to talk about continuous math is just a constant-sized interpreter that you put on the turing machine, but this can't get around the fact that some thoeries won't need that.

(BTW, is continuous math even exactly turing-computable?)

If we try to generalize and say "turing machines suck, we'll use some other language", it becomes clear that all langauges will have some bias. English has a bias for agenty mind-stuff, turing machines have a bias for discrete cellular automata, some imaginable basis language will have a bias for simple continuous mathematics.

I think the basis that is closest to correct would be one based on a hypothetical real-number continuous-memory hypercomputer (for obvious reasons). This of course implies that I have learned this thing, which implies some deeper form of induction over forms of induction, that would make a language that had only one operator that was just "physics" take appropriate penalization. And then thinking about that idea reveals that inducting over langauges and inducting over theories in languages is suspiciously similar.

That tempts me to say that they should not be seperate, but then I still think the general utility of continuous math should not have to be proven for every theory that uses it, which implies some sharing of complexity somehow. But that seems like nonsense, SI already handles that, I think (by each thoery being a completely defactored package deal). Now I think it would be good to find a new SI that can handle factoring (for algorithmic reasons), but maybe that is a step down the "how do we actually do induction" road, which is not what SI is about.

Thinking over this all again, it seems starting with any form of induction that sortof works will eventually come to the right answer. The hypothetical perfect form of induction that would be the best a learning machine could start with is just the actual answer. Again there's that duality (is that the right word) between theories and induction priors.

This looks to me alot like some kind of iterative improvement algorithm, where you have to start somewhere, and where you start is chosen from the same domain as your answer (think newton's method, or gradient descent). It seems like even starting with some human language (like english), you would eventually learn (as we have done) that building a math-interpreter is the thing to do.

Ok, that is the end of my confusion dump.

My unproven, badly specified idea that has something to do with induction:

Ok, background: All turing-complete languages can build interpreters for all others. This implies some traversable languagespace (which is a subset of programspace) where the distance between two points is the length of the interpreter to get from one language to the other. We will also want to be able to go backwards, so that we can go from any point in programspace to any other (by backing up to a good language, and then going forward to new program). Going backwards means building this point from some point in languagespace, rather than building some point in programspace from this point in languagespace) Points in programspace are defined over behavior, so the same program will be reachable from multiple languages.

(the lisp-like area gets a lot of traffic thru it. ("any sufficiently complex program contains ... half of common lisp"))

Ok, so now we have a programspace with distances measured in bits. We could just put a probability distribution over the subset of programspace that makes predictions. Solomonof induction does this, with initial improbability defined by distance from turing-machine-langauge.

I think there might be some way to use this programspace idea for induction other than just being another way to look at SI programs. I'm imagining probability or something flowing around in this space updating our concept of simplicity so that when we see that newtonian mechanics, GR, and quantum mechanics are all closely reachable from some mathy point in languagespace, we then upgrade the probability of math-reachable thoeries, without even talking about what they predict. I'm also imagining if we build an AI to do induction, we say "here are some interesting points in programspace (newton, maxwell, SR, GR, quantum) that have served us well" and letting it figure out a distribution over programspace instead of "start from turing-machine-language".

I have more ideas and I fear I am not communicating this very well.

I think this interpreter-traversable programspace is a good concept to investigate for trying to make SI more objective. Even if we just stick with something like SI, it seems like an interesting concept.

yeah...

The thing that you're describing is a lot like finding the stationary distribution of a Markov chain. Not all Markov chains have stationary distributions and I can't tell whether this one would, though if it does have one it would be unique. It's an interesting idea.

I should also note that it is not necessary to do anything like this to preform induction over multiple theories. Instead, we can just require a program that outputs known theories in addition to the new theory. We can ask them to be output in any form, such as predictions or source code, since converting source code to predictions is O(1) in program size. Then, if the different theories have more in common, the program computing them all will be shorter than one that must seperately specify elements of very different theories. A slightly different implementation of the same general principle is Schmidhuber's Optimal Ordered Problem Solver.

Some continuous math is commutable, some is not. There are uncomputable real numbers. Computable number.

This is the second mention of second-order logical Solomonoff Induction, but I can't imagine what such a thing would look like.

It's Luke and Eliezer's term, but I guess the idea is similar to the one I had. You take some second-order theory, and let each string in the formal language that constitutes a description of X contribute n^-l to the probability of X, where n is the size of the alphabet, and l is the length of the string. Use the resulting distribution in place of the universal prior in Solomonoff Induction.

All universal Turing machines can simulate each other with logarithmic slowdown. Saying that the parameter means that complexity "subjective" is like saying the time complexity of Quicksort is "subjective" because the algorithm doesn't specify which programming language to implement it in.

For aliens with a halting oracle:

Suppose the aliens have this machine that may or may not be a halting oracle. We give them a few Turing machine programs and they decide which ones halt and which ones don't. Then we run the programs. Sure enough, none of the ones they say run forever halt, and some of them they say don't run forever will halt at some point. Suppose we repeat this process a few times with different programs.

Now what method should we use to predict the point at which new programs halt? The best strategy seems to be to ask the aliens which ones halt, give the non-halting ones a 0 probability of halting at every step, and give the other ones some small nonzero probability of halting on every step. Of course this strategy only works if the aliens actually have a halting oracle so it its predictions should be linearly combined with a fallback strategy.

I think that Solomonoff induction will find this strategy, because the hypothesis that the aliens have a true halting oracle is formalizable. Here's how: we learn a function from aliens' answers -> distribution over when the programs halt. We can use the strategy of predicting that the ones that the aliens say don't halt don't halt, and using some fallback mechanism for predicting the ones that do halt. This strategy is computable so Solomonoff induction will find it.

For Berry's paradox:

This is a problem with every formalizable probability distribution. You can always define a sequence that says: see what bit the predictor predicts with a higher probability, then output the opposite. Luckily Solomonoff induction does the best it can by having the estimates converge to 50%. I don't see what a useful solution to this problem would even look like; it seems best to just treat the uncomputable sequence as subjectively random.

If I understand Solomonoff Induction correctly, for all n and p the the sum of the probabilities of all the hypotheses of length n equals the sum of the probabilities of hypotheses of length p. If this is the case, what normalization constant could you possibility use to make all the probabilities sum to one? It seems there are none.

Use a prefix-free encoding for the hypotheses. There's not 2^n hypotheses of length n: Some of the length-n bitstrings are incomplete and you'd need to add more bits in order to get a hypothesis; others are actually a length <n hypothesis plus some gibberish on the end.

Then the sum of the probabilities of all programs of all lengths combined is 1.0. After excluding the programs that don't halt, the normalization constant is Chaitin's Omega.

Unfortunately Chaitin's Omega's incomputable, but even if it wasn't I don't see how it would work as a normalizing constant. Chaitin's Omega is a real number, there is an infinite number of hypotheses, and (IIRC) there is no real number r such that r multiplied by infinite equals one, so I don't see how Chaitin's Omega could possible work as a normalizing constant.

Consider an arbitrary probability distribution P, and the smallest integer (or the lexicographically least object) x such that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short description, a universal distribution shouldn't assign it such a low probability, but P does, so P can't be a universal distribution.

The description of x has to include the description of P, and that has to be computable if a universal distribution is going to assign positive probability to x.

If P has a short computable description, then yes, you can conclude that P is not a universal distribution. Universal distributions are not computable.

If the shortest computable description of P is long, then you can't conclude from this argument that P is not a universal distribution, but I suspect that it still can't be a universal distribution, since P is computable.

If there is no computable description of P, then we don't know that there is a computable description of x, so you have no contradiction to start with.

Suppose we define a generalized version of Solomonoff Induction based on some second-order logic. The truth predicate for this logic can’t be defined within the logic and therefore a device that can decide the truth value of arbitrary statements in this logical has no finite description within this logic. If an alien claimed to have such a device, this generalized Solomonoff induction would assign the hypothesis that they're telling the truth zero probability, whereas we would assign it some small but positive probability

Actually, what Tarski seems to show is that for any language for describing any set of universes, there just is no language representable inside those universes for representing arbitrary statements, with truth values, about "everything" including the language and the statements in it. If you try to invent such a language, it will end up inconsistent - not at the point where it tries to correctly assign truth, but at the point where it can represent truth, due to analogues of "This statement is false." It isn't needful to assign 0 or 1, in particular, to this statement; the moment you represent it, you can prove an inconsistency. But is this really proper to blame on Solomonoff induction?

How can we apply Solomonoff when our inputs are not symbol strings?

Inputs can always be represented by bit sequences. Qualia are an irrelevance. Q.E.D.