Kevin T. Kelly's Ockham Efficiency Theorem

There is a game studied in Philosophy of Science and Probably Approximately Correct (machine) learning. It's a cousin to the Looney Labs game "Zendo", but less fun to play with your friends. http://en.wikipedia.org/wiki/Zendo_(game) (By the way, playing this kind of game is excellent practice at avoiding confirmation bias.) The game has two players, who are asymmetric. One player plays Nature, and the other player plays Science. First Nature makes up a law, a specific Grand Unified Theory, and then Science tries to guess it. Nature provides some information about the law, and then Science can change their guess, if they want to. Science wins if it converges to the rule that Nature made up.

Kevin T. Kelly is a philosopher of science, and studies (among other things) the justification within the philosophy of science for William of Occam's razor: "entities should not be multiplied beyond necessity". The way that he does this is by proving theorems about the Nature/Science game with all of the details elaborated.

Why should you care? Firstly, his justification is different from the overfitting justification that sometimes shows up in Bayesian literature. Roughly speaking, the overfitting justification characterizes our use of Occam's razor as pragmatic - we use Occam's razor to do science because we get good generalization performance from it. If we found something else (e.g. boosting and bagging, or some future technique) that proposed oddly complex hypotheses, but achieved good generalization performance, we would switch away from Occam's razor.

An aside regarding boosting and bagging: These are ensemble machine learning techniques. Suppose you had a technique that created decision tree classifiers, such as C4.5 (http://en.wikipedia.org/wiki/C4.5_algorithm), or even decision stumps (http://en.wikipedia.org/wiki/Decision_stump). Adaboost (http://en.wikipedia.org/wiki/Adaboost) would start by weighting all of the examples identically and invoking your technique to find an initial classifier. Then it would reweight the examples, prioritizing the ones that the first iteration got wrong, and invoke your technique on the reweighted input. Eventually, Adaboost outputs an ensemble of decision trees (or decision stumps), and taking the majority opinion of the ensemble might well be more effective (generalize better beyond training) than the original classifier. Bagging is similar (http://en.wikipedia.org/wiki/Bootstrap_aggregating).

Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

Secondly, his alternative justification bears on ordering hypotheses "by simplicity", providing an alternative to (approximations of) Kolmogorov complexity as a foundation of science.

Let's take a philosopher's view of physics - from a distance, a philosopher listens to the particle physicists and hears: "There are three fundamental particles that make up all matter!", "We've discovered another, higher-energy particle!", "Another, even higher-energy particle!". There is no reason the philosopher can see why this should stop at any particular number of particles. What should the philosopher believe at any given moment?

This is a form of the Nature vs. Science game where Nature's "Grand Unified Theory" is known to be a nonnegative number (of particles), and at each round Nature can reveal a new fundamental particle (by name) or remain silent. What is the philosopher of science's strategy? Occam's razor suggests that we prefer simpler hypotheses, but if there were 254 known particles, how would we decide whether to claim that there are 254, 255, or 256 particles? Note that in many encodings, 255 and 256 are quite round numbers and therefore have short descriptions; low Kolmogorov complexity.

An aside regarding "round numbers": Here are some "shortest expressions" of some numbers, according to one possible grammar of expressions. (Kolmogorov complexity is unique up to an additive constant, but since we never actually use Kolmogorov complexity, that isn't particularly helpful.)

1 == (1)

2 == (1+1)

3 == (1+1+1)

4 == (1+1+1+1)

5 == (1+1+1+1+1)

6 == (1+1+1+1+1+1)

7 == (1+1+1+1+1+1+1)

8 == ((1+1)*(1+1+1+1))

9 == ((1+1+1)*(1+1+1))

10 == ((1+1)*(1+1+1+1+1))

11 == (1+1+1+1+1+1+1+1+1+1+1)

12 == ((1+1+1)*(1+1+1+1))

13 == ((1)+((1+1+1)*(1+1+1+1)))

14 == ((1+1)*(1+1+1+1+1+1+1))

15 == ((1+1+1)*(1+1+1+1+1))

16 == ((1+1+1+1)*(1+1+1+1))

17 == ((1)+((1+1+1+1)*(1+1+1+1)))

18 == ((1+1+1)*(1+1+1+1+1+1))

19 == ((1)+((1+1+1)*(1+1+1+1+1+1)))

20 == ((1+1+1+1)*(1+1+1+1+1))

21 == ((1+1+1)*(1+1+1+1+1+1+1))

22 == ((1)+((1+1+1)*(1+1+1+1+1+1+1)))

24 == ((1+1+1+1)*(1+1+1+1+1+1))

25 == ((1+1+1+1+1)*(1+1+1+1+1))

26 == ((1)+((1+1+1+1+1)*(1+1+1+1+1)))

27 == ((1+1+1)*((1+1+1)*(1+1+1)))

28 == ((1+1+1+1)*(1+1+1+1+1+1+1))

30 == ((1+1+1+1+1)*(1+1+1+1+1+1))

32 == ((1+1)*((1+1+1+1)*(1+1+1+1)))

35 == ((1+1+1+1+1)*(1+1+1+1+1+1+1))

36 == ((1+1+1)*((1+1+1)*(1+1+1+1)))

40 == ((1+1)*((1+1+1+1)*(1+1+1+1+1)))

42 == ((1+1+1+1+1+1)*(1+1+1+1+1+1+1))

45 == ((1+1+1)*((1+1+1)*(1+1+1+1+1)))

As you can see, there are several inversions, where a larger number (in magnitude) is "simpler" in that it has a shorter shortest expression (in this grammar). The first inversion occurs at 12, which is simpler than 11. In this grammar, squares, powers of two, and smooth numbers generally (http://en.wikipedia.org/wiki/Smooth_number) will be considered simple. Even though "the 256th prime" sounds like a short description of a number, this grammar isn't flexible enough to capture that concept, and so this grammar does not consider primes to be simple. I believe this illustrates that picking a particular approximate Kolmogorov-complexity-like concept is a variety of justification of Occam's razor by aesthetics. Humans can argue about aesthetics, and be convinced by arguments somewhat like logic, but ultimately it is a matter of taste.

In contrast, Kelly's idea of "simplicity" is related to Popper's falsifiability. In this sense of simplicity, a complex theory is one that can (for a while) camouflage itself as another (simple) theory, but a simple theory cannot pretend to be complex. So if Nature really had 256 particles, it could refuse to reveal them for a while (maybe they're hidden in "very high energy regimes"), and the 256-particle universe would exactly match the givens for a 254-particle universe. However, the 254-particle universe cannot reveal new particles; it's already shown everything that it has.

Remember, we're not talking about elaborate data analysis, where there could be "mirages" or "aliasing", patterns in the data that look initially like a new particle, but later reveal themselves to be explicable using known particles. We're talking about a particular form of the Nature vs Science game where each round Nature either reveals (by name) a new particle, or remains silent. This illustrates that Kelly's simplicity is relative to the possible observables. In this scenario, where Nature identifies new particles by name, then the hypothesis that we have seen all of the particles and there will be no new particles is always the simplest, the "most falsifiable". With more realistic observables, the question of what is the simplest consistent hypothesis becomes trickier.

A more realistic example that Kelly uses (following Kuhn) is the Copernican hypothesis that the earth and other planets circle the sun. In what sense is it simpler than the geocentric hypothesis? From a casual modern perspective, both hypotheses might seem symmetric and of similar complexity. The crucial effect is that Ptolemy's model parameters (velocities, diameters) have to be carefully adjusted to create a "coincidence" - apparent retrograde motion that always coincides with solar conjunction (for mercury and venus) and solar opposition (for the other planets). (Note: Retrograde means moving in the unusual direction, west to east. Conjunction means two entities near each other in the sky. Opposition means the entities are at opposite points of the celestial sphere.) The Copernican model "predicts" that coincidence; not in the sense that the creation of the model precedes knowledge of the effect, but that any tiny deviation from exact coincidence to be discovered in the future would be evidence against the Copernican model. In this sense, the Copernican model is more falsifiable; simpler.

The Ockham Efficiency Theorems explain in what sense this version of Occam's razor is strictly better than other strategies for the Science vs. Nature game. If what we care about is the number of public mind changes (saying "I was mistaken, actually there are X particles." would count as a mind change for any X), and the timing of the mind changes, then Occam's razor is the best strategy for the Science vs. Nature game. The Occam strategy for the number-of-particles game will achieve exactly as many mind changes as there are particles. A scientist who deviates from Occam's razor allows Nature to extract a mind change from the (hasty) scientist "for free".

The way this works in the particles game is simple. To extract a mind change from a hasty scientist who jumps to predicting 12 particles when they've only seen 11, or 256 particles when they've only seen 254, Nature can simply continuously refuse to reveal new particles. If the scientist doesn't ever switch back down to the known number of particles, then they're a nonconvergent scientist - they lose the game. If the scientist, confronted with a long run of "no new particles found" does switch to the known number of particles, then Nature has extracted a mind change from the scientist without a corresponding particle. The Occam strategy achieves the fewest possible number of mind changes (that is, equal to the number of particles), given such an adversarial Nature.

The "Ockham Efficiency Theorems" refer to the worked-out details of more elaborate Science vs. Nature games - where Nature chooses a polynomial GUT, for example.

This entire scenario does generalize to noisy observations as well (learning Perlean causal graphs) though I don't understand this aspect fully. If I understand correctly, the scientist guesses a probability distribution over the possible worlds and you count "mind changes" as changes of that probability distribution, so adjusting a 0.9 probability to 0.8 would be counted as a fraction of a mind change. Anyway, read the actual papers, they're well-written and convincing.

www.andrew.cmu.edu/user/kk3n/ockham/Ockham.htm

This post benefited greatly from encouragement and critique by cousin_it.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 12:31 AM
Select new highlight date
All comments loaded

As Unknown pointed out, using Kolmogorov complexity is a better bet if winning means finding the truth quickly, instead of just eventually.

The rules of this game, plus the assumption that Nature is maximally adversarial, seems specifically designed to so that this particular version of Ockham would be optimal. It doesn't really seem to provide much insight into what one should do in more general/typical settings.

The question that Kelly is trying to answer is "Why, or in what sense, does Occam's razor work?". Yes, the answer is "It works in that it is worst-case-efficient." He doesn't assume a simplicity-biased prior (which some might characterize as a circular justification of Occam's razor).

I worry that the particular example that I'm presenting here (where the Occam strategy steps in a vaguely linear fashion) is coloring your thinking about Kelly's work generally, which has little or nothing to do with stepping in a vaguely linear fashion, and everything to do with believing the simplest theory compatible with the evidence. (Which, based on your other writings, I think you would endorse.)

Please investigate further, maybe read one of Kelly's papers, I would hate for my poor writing skills to mislead you.

He doesn't assume a simplicity-biased prior (which some might characterize as a circular justification of Occam's razor).

Is the particles example one of Kelly's own examples, or something you made up to explain his idea? Because in that example at least, we seem to be assuming a game with very specific rules, showing that a particular strategy is optimal for that game, and then calling that strategy "Occam". Compare this with Solomonoff induction, where we only have to assume that the input is computable and can show that Bayesian updating with a prior based on Kolmogorov complexity is optimal in a certain sense.

everything to do with believing the simplest theory compatible with the evidence. (Which, based on your other writings, I think you would endorse.)

I do endorse that, but there are various aspects of the idea that I'm still confused about, and I'm not seeing how Kelly's work helps to dissolve those confusions.

I think I read some of Kelly's own writings the last time cousin_it pointed them out, but again, didn't really "get it" (in the sense of seeing why it's important/interesting/insightful). I'm hoping that I'm just missing something, and you, or someone else, can show me what it is. Or perhaps just point to a specific paper that explains Kelly's insights most clearly?

The particles example is derived from one of Kelly's examples (marbles in a box).

Kolmogorov complexity is unavailable - only approximations it are available, and they're not unique. There are multiple reasonable grammars that we can use to do MDL, and no clear justification for why we should use one rather than another. For an extreme example, imagine measuring simplicity by "size of the patch to the Catholic Church's dogma."

Kelly's notion that Occam means worst-case-efficient-in-mind-changes allows us to avoid the prior "dropping like manna from heaven".

I recommend this paper:

http://www.hss.cmu.edu/philosophy/kelly/papers/bonn5.pdf

but if that one doesn't do it for you, there are others with more cartoons and less words:

http://www.fitelson.org/few/few_05/kelly_2.pdf

Or ones with more words and less cartoons:

http://www.hss.cmu.edu/philosophy/kelly/papers/Ch4-Glymour%20&%20Kelly-final.pdf

In the first paper you cite, there is a particles example that is essentially the same example as yours, and Kelly does use it as his main example. The only difference is instead of counting the number of fundamental particles in physics, his example uses a device that we know will emit a finite number of particles.

On page 33, Kelly writes about how his idea would handle a modification of the basic example:

For example, suppose you have to report not only how many particles will appear but when each one will appear [...] The only sensible resolution of this Babel of alternative coarsenings is for Ockham to steer clear of it altogether. And that’s just what the proposed theory says. [...] Ockham is both mute and toothless (perhaps mute because toothless) in this problem. Again, that is the correct answer.

So suppose this device has been emitting one particle every 10 seconds for the last million seconds. According to Kelly's version of Ockham's Razor (perhaps we should call it Kelly's Razor instead?), we can't predict that the next particle will come 10 seconds later. What use is Kelly's idea, if I want to have a notion of complexity that can help us (or an AI) make decisions in general, instead of just playing some specific games for which it happens to apply?

You read the paper! Thanks for pointing out that we know somehow that only a finite number of particles will ever be found.

To explain the "oneicle" problem: It seems like how a scenario is coded into a game matters. For example, if you viewed the timed particles game as having two possible worlds "The device will always emit a particle every 10 seconds." and "The device will sometimes emit a particle every 10 seconds.", then the first world cannot pretend to be the second world, but the second world can camouflage itself as the first world for a time, and so (Kelly's version of) Occam's razor says the first is simpler - we get the intuitively correct answer.

The alternative coding is somewhat analogous to the color "grue" (which is green up until some date, and blue thereafter). You recode the problem to talk about "oneicles", a concept that refers to non-particles up to time 1, and particles thereafter. If you allow this sort of recoding, then you would also allow "twoticles", and the infinite hierarchy of symmetric re-codings causes a problem. I tend to think this is a technical problem that is unlikely to expand into the philosophy part of the theory, but I'm kindof an idiot, and I may be missing something - certainly we would like to avoid coding-dependence.

That's a problem (the first problem Kelly mentioned in that paper), but do you really require a theory to have no problems remaining in order for it to be counted as insightful? No one else addresses the question "Where does the prior come from?".

do you really require a theory to have no problems remaining in order for it to be counted as insightful?

It would be one thing if Kelly said that the theory currently can't predict that another particle will come in 10 seconds, but he hopes to eventually extend it so that it can make predictions like that. But instead he says that Ockham is mute on the question, and that's the right answer.

No one else addresses the question "Where does the prior come from?".

Neither does Kelly. I don't see how we can go from his idea of Ockham to a Bayesian prior, or how to use it directly in decision making. Kelly's position above suggests that he doesn't consider this to be the problem that he's trying to solve. (And I don't see what is so interesting about the problem that he is trying to solve.)

Okay, I think we've reached a point of reflective disagreement.

I agree with you that Kelly was wrong to be enamored of his formalization's output on the timed particles example; it's either a regrettable flaw that must be lived with, or a regrettable flaw that we should try to fix, and I don't understand enough of the topological math to tell which.

However, the unjustified Occam prior in the standard Bayesian account of science is also a regrettable flaw - and Kelly has demonstrated that it's probably fixable. I find that very intriguing, and am willing to put some time into understanding Kelly's approach - even if it dissolves something that I previously cherished (such as MDL-based Occam priors).

Reasonable people can reasonably disagree regarding which research avenues are likely to be valuable.

Nice post! Couple quibbles/questions.

Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

It's not a challenge unless you think that each of the weak classifiers should be powerful enough to do the classification correctly. Occam = "Do not multiply entities beyond need", not "Do not multiply entities".

If the scientist, confronted with a long run of "no new particles found" does switch to the known number of particles, then Nature has extracted a mind change from the scientist without a corresponding particle. The Occam strategy achieves the fewest possible number of mind changes (that is, equal to the number of particles), given such an adversarial Nature.

The Occam strategy isn't to say the number of particles that exist equals the number observed. It's to use the simplest underlying theory. We have theories with fewer moving parts than number of particles predicted. Maybe you mean "fundamental concepts" when you say "particles"; but in that case, the non-Occam strategy is a straw man, since I don't think anyone proposes new concepts that provide no additional explanatory power.

Can you elaborate your sentence that begins "It's not a challenge.."?

My understanding is that if our real justification for "Why do we use Occam's razor?" was "Because that way we avoid overfitting." then if a future statistical technique that outperformed Occam by proposing weirdly complex hypotheses came along, we would embrace it wholeheartedly.

Boosting and bagging are merely illustrative of the idea that there might be a statistical technique that achieves good performance in a statistical sense, though nobody believes that their outputs (large ensembles of rules) are "really out there", the way we might believe Schroedinger's equation is "really out there".

We only use boosting if our set of low-complexity hypotheses does not contain the solution we need. And instead of switching to a larger set of still-low-complexity hypotheses, we do something much cheaper, a second best thing: we try to find a good hypothesis in the convex hull of the original hypothesis space.

In short, boosting "outperforms Occam" only in man-hours saved: boosting requires less thinking and less work than properly applying Occam's razor. That really is a good thing, of course.

Kolmogorov complexity also depends on the description language used to create strings. If the language has a recursive feature, for example, this would assign lower complexity to members of the Fibonacci sequence.

So, part of the task of Science is to induce what Nature's description language is.

Kolmogorov complexity also depends on the description language used to create strings.

The whole point of the KC is that it doesn't depend on the specific programming language (or Turing machine) selected except up to an additive constant. If your language contains a recursive feature, it will assign lower complexity to Fibonacci numbers than my non-recursive language does, but the difference is at most the codelength required for me to simulate your machine on my machine.

I wrote this some years back. It is probably not without its flaws (and got some criticism on Less Wrong), but the general idea may be relevant: http://www.paul-almond.com/WhatIsALowLevelLanguage.htm

AFAICT, the 'proposed solution' from WiaLLL doesn't work. The rest of this post will be devoted to proposing a specific attack on it. As such it will be longish - since WiaLLL makes a strong claim about a fundamental aspect of epistemology, it deserves a considered refutation.

My attack works on the assumption that current sets count programs, not languages - i.e. two different programs that implement the same language will result in two entries in the next current set (regardless of whether they are written in the same language). Paul Almond has told me that this is what he intended in WiaLLL.

Consider a family of turing-complete languages, whose members are called Xn (n ∈ N). I will use 'X-like' to refer to any language from this family. Let X := X8.

Programs for any language Xn can be divided into three categories:

  1. Programs that begin with n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don't matter for this attack).

  2. Programs that have a length of 0 (i.e. empty strings). All such programs are defined to implement X.

  3. Programs that don't fit either 1 or 2. All such programs are defined to implement X(n+1).

It follows that the set of programs of any bounded length (>0) writable in any Xn (n>0) is dominated by interpreters for X-like languages with a greater n than their own. While all such languages are turing-complete, for any given program length the proportion of non-X-like-interpreting programs writable in them is small, and inversely exponential in n.

What happens when we run the proposed solution from WiaLLL on this X? Assuming sufficiently high values of N, on each step but the first the current set will consist of a mix of X-likes and other languages. For purposes of calculating a lower bound on the long-term proportion of X-likes, assume that programs for non-X-likes never implement an X-like. The X-likes will then tend to thin out, since some of their programs fall into category 1 and implement non-X-likes; however this process is bounded to produce a fraction of non-X-likes no larger than \sum_{i=8}^\infty (1/2^i) = 0.0078125.

The X-likes will also thin themselves out through programs in category 2, since X itself will produce a fixed fraction of non-X-likes. How much they do so depends on the exact growth of N and I in the limit. Given a function I(N) which maps values for program length to the number of iterations to run with this length, one can calculate a rough upper bound on the proportion of non-X-likes produced in this manner as 1-\prod_{N=N0}^\infty (1-I(N)/2^N).

The detailed results of this process therefore depend on the choice of I(N), and of N0. The general relationship is that unless I(N) grows quite quickly in N, the X-likes will only be notably thinned out for small N, and therefore will not be thinned out significantly given a sufficiently big N0. For example, for I(N)=16 and N0=16 (16 iterations per maximum program length, minimum maximum program length of 16 bits) the result is about 0.00049.

Therefore, with free choice of the starting language and some reasonable assumptions about the limiting process used for I and N, we can ensure that the language mix in the final current set will be dominated by X-likes.

Why does all of this matter? Because any X-like especially likes to implement X (since category 2 programs have the minimum length of 0 bits), and for sufficiently large n really doesn't like to implement any non-X-like, and as such any current set containing a sufficient proportion of X-likes will rate X as lower-level than any other language. This argument might in theory fail because of the possibility that other languages would on average penalize X-likes much more than X-likes penalize them, but if so this can be worked around by setting X to a higher Xn than 8 - the family can be redefined to set the initial n to any integer we need it to be to work around such finite factors.

Therefore, unless I'm missing something critical, the output of the procedure proposed in WiaLLL would be highly sensitive to the starting language as well as I(N) and N0. It therefore fails in its attempt to define a simplicity measure on languages in the absence of additional assumptions not stated in the article.

(Part 2) But hang on - I can see an easy way to improve on Paul's definition:

He looks at all implementations of languages, of length less than or equal to N, takes a simple arithmetic mean and then lets N go to infinity. But how about we take a weighted arithmetic mean where the weight of a given implemention is 1/2^itslength (and then we normalise if necessary so that the weights sum to 1). (This does away with the need for N and taking a limit as N goes to infinity.) Then in your example above, the fact that there is always an implementation of X whose length is 0 means that a fixed proportion of the weight previously assigned to X-like languages 'bleeds away' on every iteration.

I think we can use a "pathological" language to make X[0] particularly hard to implement (on average) rather than particularly easy, so it would still be false that "low-level index for Y" measured from X must be the same as measured from Y.

But perhaps we can define the "low-level index" of a language Y as the infimum of all low-level indices as measured from languages X?

(Needless to say, the trick of weighting the average by 1/2^length is very standard.)

Let's see if I understand this correctly. You're redefining current sets to a mapping from the set of all computable and turing-complete languages to their weights in [0,1]. Except for the first step of the process, you will actually have non-zero weights on each of those languages, so we're working with true countably infinite sets here. This might require a stronger hypercomputer to implement than WiaLLL, but I suppose that's fine for the purely theoretical purposes of these constructions.

Weighting each program by 1/2^i probably doesn't work, since that still results in a total measure of 1 for each set of all programs with a given length n ∈ N (since there are 2^n programs in each of those sets), but that's a minor detail - weighting by e.g. 1/4^i should get you a workable result.

I think this is still attackable in so far as that it's possible to produce a setup that will make X win over any non-X-like. It would lose to certain other X-likes in each round, but this does not hold in the (I → ∞) limit.

It's probably not necessary for me to write up the modified attack in full detail; the basic idea is to redefine the categories of Xn programs as follows:

  1. Programs that begin with 2n 0 bits. This prefix is solely used to encode that these programs are of type 1; it encodes no other data. The rest of the program source encodes a program in some suitable turing-complete language (the details of which don't matter for this attack).

  2. Programs that begin with j ∈ [n,2n) 0 bits. All such programs are defined to implement X.

  3. Programs that don't fit either 1 or 2. All such programs are defined to implement X(n+1).

So the support of Xn for X approaches 0 in the (n → ∞) limit, which allows the X-like family to keep a large weight through arbitrarily many iterations, and even in the limit of (I → ∞). However, it approaches 0 much slower than their support for any non-X-like, which means that X still wins over every non-X-like at each step of the simplicity computation.

Also note that the choice of a prefix length linear in the X-like level is fairly arbitrary. I think this is sufficient to get the desired effect, but if this turns out to be incorrect we can of course use any computable function here, meaning we can adjust this attack to make the fraction of non-X-likes produced fall much faster.

But perhaps we can define the "low-level index" of a language Y as the infimum of all low-level indices as measured from languages X?

I don't understand this suggestion. If I understood your previous modifications correctly, all but the first current sets of your process assign non-zero measure to every computable and turing-complete language, so taking minimums about languages from this set wouldn't get you any useful information. As for X-likes, if you are going to treat those seperately from other languages in the simplicity determination process, you'd need to specify some precise way of how to recognize them. The specific X families I've talked about are of course only examples; there's countably infinitely more such families that exhibit similar properties. Please elaborate on what you meant?

So essentially, the problem with Paul Almond's suggestion is that one can find languages X that are 'X[0]-pathological' in the sense that almost all of their children are both (i) good at implementing some fixed language X[0] and (ii) even more X[0]-pathological than they are (this being a recursive definition of 'pathological').

Such languages destroy any hope that Paul's "low-level index for Y" won't depend on which language X we start from.

Kolomogorov complexity isn't an option, and approximations of Kolmogorov complexity don't have that uniqueness property.

Even if we did have that uniqueness property, it's still not all that helpful if you're trying to decide among a finite number of finite-complexity hypotheses. Choice of description language is still sufficient to completely determine your choice of which hypothesis is simplest.

As far as I can tell, a MDL-ish complexity measurement that is "unique up to a constant" acts like a prior - in the limit of infinite evidence, the initial bias will wash out. Kelly is fond of analogizing this "justification" of Occam to a flat tire - it helps you get home because you can eventually fix it.

Kolomogorov complexity isn't an option, and approximations of Kolmogorov complexity don't have that uniqueness property.

Right, but we can find a series of increasingly tight upper bounds to the KC, and the bounds are language-independent in the same sense that the KC itself is.

Then for any sufficiently large dataset, we can say that the probability of the data is given by the model that produces the tightest upper bound for the KC. This is nicely rigorous - if you think my model is unjustified and doesn't correspond to the "real" probabilities, all you have to do is produce a new model that achieves a tighter KC-bound.

I second the OP's recommendation of Zendo. It could've been named "Induction: The Game", it's a great medium for practicing (and teaching!) some basic science and rationality skills, and it's also a heck of a lot of fun. Plus, you can use the neat little Zendo pyramids (also called Icehouse pieces) for various other games.

I agree,

I'm not sure if I should go with Treehouse or IceTowers...

If I buy IceTowers I could probably play Treehouse, but not the other way round...

:)

You don't really need the pyramids, though they're definitely neat. You can do it with coins, dice, little glass beads, whatever. With words and phrases, it makes a car game; the game "Green Glass Door" can be seen as an instance.

I call this game "playing Science".

The game of Science vs. Nature is more complicated than that, and it's the interesting structure that allows scientists to make predictions that are better than "what we've seen so far is everything there is." In particular, the interesting things in both Chemistry and particle Physics is that scientists were able to find regularities in the data (the Periodic Table is one example) that led them to predict missing particles. Once they knew what properties to look for, they usually were able to find them. When a theory predicts particles that aren't revealed when the right apparatus is constructed, we drop the theory.

But in the meantime, you'd have a more interesting game (and closer to Zendo) if Nature gave you a way of classifying objects. In Zendo, there is only one dimension. Something is a match (I forget Zendo's terminology) or it isn't. [In the real world, one of the possible moves is inventing a new test. "If I hold the object up to the light, what do I see?"] Some new tests turn out not to reveal anything useful, while others give you a whole new way of categorizing all the things you thought you knew about before.

In this context, Occam's razor is a rule about inventing rules to explain complex behavior, not rules about how many things there are. Your objective is to explain a pile of evidence, and you get to make up whatever story you like, but in the end, your story has to guide you in predicting something that hasn't happened yet. If you can make better predictions than other scientists, you can have as complicated a rule as you like. If you and they make the same predictions, then the observers get to break the tie by deciding which rule is simpler.

Ensemble learning methods are a challenge for "prevents overfitting" justifications of Occam's razor since they propose weirdly complex hypotheses, but suffer less from overfitting than the weak classifiers that they are built from.

I'm not sure this is true. Overfitting means that a classifier achieves good empirical performance (on the known data), but bad generalization performance (on new, previously unseen data). The weak classifiers used in Adaboost et al. don't really suffer from overfitting; they just have bad empirical performance (this is why they are called "weak"). Also, ensemble methods do suffer from overfitting.

It is very hard to analyze complexity using naive methods. For example, let's say we have an Adaboost classifier built from 4 weak classifiers with relatively large weights. Then we add another classifier, but with a small weight. Now, did the complexity increase by 25%? Why? If the 5th weight is small, the 5-classifier should be nearly identical to the 4-classifier, so it should achieve about the same generalization performance.

According to the MDL / approximate-Kolmogorov-complexity notion of complexity, to measure the complexity of a hypothesis, fix a notation and count bits of the description of that hypothesis in that notation.

It's certainly possible to find a notation where the scenario you describe has the small increment in complexity that you expect. But why pick that notation rather than another? How do you justify your priors?

But why pick that notation rather than another?

Because you need the right notion of complexity in order to prevent overfitting. To prevent overfitting, you need to penalize highly expressive model classes. The MDL complexity measure does not perfectly capture the notion of expressivity, especially if you use a naive encoding method. An ensemble with many subclassifiers may require many bits to specify, but still not be very expressive, if the weights are small.

I think we're not trying to prevent overfitting. We're talking about the problem of induction.

How do we know that a method that prevented overfitting in the past will continue to prevent overfitting in the future? Appealing to "induction has always worked before" is circular.

I think it was Hume who asked: How do we know that bread will continue to nourish humans? It always has before, but in what sense are past observations logical grounds for anything? We would need some magical law of nature that says "Nature definitely won't just change all the rules." But of course, there is no such law, and Nature might change all the rules at any time.

Suppose there are two kinds of worlds, one where bread always nourishes, and one where bread sometimes nourishes. The first variety is strictly simpler, because the second can camouflage itself as the first, matching the givens for an arbitrary amount of time, but the converse is not true. The Occam strategy of believing the simplest hypothesis consistent with the givens has this pleasant property: It is the best strategy as measured in worst-case mind changes against an adversarial Nature. Therefore, believe that bread will continue to nourish.

I worry that I'm just muddying the water here.

What I love about this post is that, at heart, it challenges the very ability to be rational, it hints at the possibility that what is most effective (and how we ultimately function) is as statistical learning machines whose understanding and predictive ability lie in structures very different from the ones we use to communicate. Our language and formal argument structures define a space of understanding that is more a reflection of our communication technology than the true nature of things. In this way, Occams razor and Kolmogorov complexity reflect a motivation to communicate efficiently (both with ourself and with others). It can have practical utility in the sense that a simpler conscious model is easier to analyse and in that sense we have more chance of finding its flaws (predictive failures). In this sense a simpler theory (in the sense of being more comprehensible) is more likely to be correct if it explains the same evidence, as it literally has more analysis performed on it, because it is easier to think about. Or in an AI sense, has a larger (mentally synthesised) evaluation set.

Successful analysis of uncertainty without using a prior implies some kind of hidden Bayes-structure.

Consider hypothesis 1 and hypothesis 2. Hypothesis 2 can pretend to be hypothesis 1 but not vice versa. What does this mean?

p(D1|H2)>0

p(D2|H1)=0

Observations about D2 are clearly irrelevant to our present inference problem, instead, what matters is this:

p(D1|H1)=1

p(D1|H2)<1

p(D1|H1)>p(D1|H2)

that is, D1 causes a Bayesian agent to update in favor of hypothesis 1, because hypothesis 1 fits the evidence better.

(this is only exactly true if D1 and D2 are the only possible data, but the principle should still work in more complicated cases)

So Kelly avoids using a prior by copying the rest of Bayes' theorem. The hypotheses he is advocating are not "simple" hypotheses, they are ones that fit the data. Occam's Razor is still missing.

This is extremely analogous to frequentist statistics, which considers worst-case scenarios rather than using a prior, and therefore achieves some of what you can do with Bayes.

Is your additive example supposed to be the minimum number of 1s needed to represent n as a product or sum of 1s? It looks like that but if so, 6 should be ((1+1)(1+1+1)) and 11 should be (1+(1+1)(1+1+1+1+1)). 12 is still an "inversion."

ETA A few other numbers seem to be off as a result of the mistake with 6. 30 and 42 also have more compact representations.

No, it's one of the expressions with the minimum number of characters needed to represent the number, using this grammar:

EXP ::= ( EXP + EXP ) | ( EXP * EXP ) | ( NUM )

NUM ::= 1 | NUM + 1

As you can see, reasonable people can reasonably disagree on what is an appropriate grammar to use for minimum description length, even on arbitrary examples like this one.

If there are a very large number of fundamental particles, it is likely that there is some reason for the number, rather than being a merely random number, and if so, then you would likely reach the truth quicker by using some version of Kolmogorov complexity, rather than counting the particles one by one as you discover them.

This is circular. You assume that the universe is likely to have low K-complexity and conclude that a K-complexity razor works well. Kelly's work requires no such assumption, this is why I think it's valuable.