Bayes Slays Goodman's Grue

This is a first stab at solving Goodman's famous grue problem. I haven't seen a post on LW about the grue paradox, and this surprised me since I had figured that if any arguments would be raised against Bayesian LW doctrine, it would be the grue problem. I haven't looked at many proposed solutions to this paradox, besides some of the basic ones in "The New Problem of Induction". So, I apologize now if my solution is wildly unoriginal. I am willing to put you through this dear reader because:

  1. I wanted to see how I would fare against this still largely open, devastating, and classic problem, using only the arsenal provided to me by my minimal Bayesian training, and my regular LW reading.
  2. I wanted the first LW article about the grue problem to attack it from a distinctly Lesswrongian aproach without the benefit of hindsight knowledge of the solutions of non-LW philosophy. 
  3. And lastly, because, even if this solution has been found before, if it is the right solution, it is to LW's credit that its students can solve the grue problem with only the use of LW skills and cognitive tools.

I would also like to warn the savvy subjective Bayesian that just because I think that probabilities model frequencies, and that I require frequencies out there in the world, does not mean that I am a frequentest or a realist about probability. I am a formalist with a grain of salt. There are no probabilities anywhere in my view, not even in minds; but the theorems of probability theory when interpreted share a fundamental contour with many important tools of the inquiring mind, including both, the nature of frequency, and the set of rational subjective belief systems. There is nothing more to probability than that system which produces its theorems. 

Lastly, I would like to say, that even if I have not succeeded here (which I think I have), there is likely something valuable that can be made from the leftovers of my solution after the onslaught of penetrating critiques that I expect form this community. Solving this problem is essential to LW's methods, and our arsenal is fit to handle it. If we are going to be taken seriously in the philosophical community as a new movement, we must solve serious problems from academic philosophy, and we must do it in distinctly Lesswrongian ways.

 


 

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

That is the inference that the grue problem threatens, courtesy of Nelson Goodman.  The grue problem starts by defining "grue":

"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."

So you see that before time T, from the list of premises:

"The first emerald ever observed was green.
 The second emerald ever observed was green.
 The third emerald ever observed was green.
 … etc.
 The nth emerald ever observed was green."
 (we will call these the green premises)

it follows that:

"The first emerald ever observed was grue.
The second emerald ever observed was grue.
The third emerald ever observed was grue.
… etc.
The nth emerald ever observed was grue."
(we will call these the grue premises)

The proposer of the grue problem asks at this point: "So if the green premises are evidence that the next emerald will be green, why aren't the grue premises evidence for the next emerald being grue?" If an emerald is grue after time T, it is not green. Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%, which is of course, a contradiction of the Komolgorov axioms. We should not be able to form a statement with a probability greater than one.

This threatens the whole of science, because you cannot simply keep this isolated to emeralds and color. We may think of the emeralds as trials, and green as the value of a random variable. Ultimately, every result of a scientific instrument is a random variable, with a very particular and useful distribution over its values. If we can't justify inferring probability distributions over random variables based on their previous results, we cannot justify a single bit of natural science. This, of course, says nothing about how it works in practice. We all know it works in practice. "A philosopher is someone who say's, 'I know it works in practice, I'm  trying to see if it works in principle.'" - Dan Dennett

We may look at an analogous problem. Let's suppose that there is a table and that there are balls being dropped on this table, and that there is an infinitely thin line drawn perpendicular to the edge of the table somewhere which we are unaware of. The problem is to figure out the probability of the next ball being right of the line given the last results. Our first prediction should be that there is a 50% chance of the ball being right of the line, by symmetry. If we get the result that one ball landed right of the line, by Laplace's rule of succession we infer that there is a 2/3ds chance that the next ball will be right of the line. After n trials, if every trial gives a positive result, the probability we should assign to the next trial being positive as well is n+1/n +2.

If this line was placed 2/3ds down the table, we should expect that the ratio of rights to lefts should approach 2:1. This gives us a 2/3ds chance of the next ball being a right, and the fraction of Rights out of trials approaches 2/3ds ever more closely as more trials are performed.

Now let us suppose a grue skeptic approaching this situation. He might make up two terms "reft" and "light". Defined as you would expect, but just in case:

"A ball is reft of the line iff it is right of it before time T when it lands, or if it is left of it after time T when it lands.
 A ball is light of the line iff it is left of the line before time T when it lands, or if it is right of the line after time T when it first lands."

The skeptic would continue:

"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"

Things for some reason become perfectly clear at this point for the defender of Bayesian inference, because now we have an easy to imaginable model. Of course, if a ball landing right of the line is evidence for Right, then it cannot possibly be evidence for ~Right; to be evidence for Reft, after time T, is to be evidence for  ~Right, because after time T, Reft is logically identical to ~Right; hence it is not evidence for Reft, after time T, for the same reasons it is not evidence for ~Right. Of course, before time T, any evidence for Reft is evidence for Right for analogous reasons.

But now the grue skeptic can say something brilliant, that stops much of what the Bayesian has proposed dead in its tracks:

"Why can't I just repeat that paragraph back to you and swap every occurrence of 'right' with 'reft' and 'left' with 'light', and vice versa? They are perfectly symmetrical in terms of their logical realtions to one another.
If we take 'reft' and 'light' as primitives, then we have to define 'right' and 'left' in terms of 'reft' and 'light' with the use of time intervals."

What can we possibly reply to this? Can he/she not do this with every argument we propose then? Certainly, the skeptic admits that Bayes, and the contradiction in Right & Reft, after time T, prohibits previous Rights from being evidence of both Right and Reft after time T; where he is challenging us is in choosing Right as the result which it is evidence for, even though "Reft" and "Right" have a completely symmetrical syntactical relationship. There is nothing about the definitions of reft and right which distinguishes them from each other, except their spelling. So is that it? No, this simply means we have to propose an argument that doesn't rely on purely syntactical reasoning. So that if the skeptic performs the swap on our argument, the resulting argument is no longer sound.

What would happen in this scenario if it were actually set up? I know that seems like a strangely concrete question for a philosophy text, but its answer is a helpful hint. What would happen is that after time T, the behavior of the ratio: 'Rights:Lefts' as more trials were added, would proceed as expected, and the behavior of the ratio: 'Refts:Lights' would approach the reciprocal of the ratio: 'Rights:Lefts'. The only way for this to not happen, is for us to have been calling the right side of the table "reft", or for the line to have moved. We can only figure out where the line is by knowing where the balls landed relative to it; anything we can figure out about where the line is from knowing which balls landed Reft and which ones landed Light, we can only figure out because in knowing this and and time, we can know if the ball landed left or right of the line.

To this I know of no reply which the grue skeptic can make. If he/she say's the paragraph back to me with the proper words swapped, it is not true, because  In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts.

This thin line is analogous to the frequency of emeralds that turn out green out of all the emeralds that get made. This is why we can assume that the line will not move, because that frequency has one precise value, which never changes. Its other important feature is reminding us that even if two terms are syntactically symmetrical, they may have semantic conditions for application which are ignored by the syntactical model, e.g., checking to see which side of the line the ball landed on.

 


 

In conclusion:

Every random variable has as a part of it, stored in its definition/code, a frequency distribution over its values. By the fact that somethings happen sometimes, and others happen other times, we know that the world contains random variables, even if they are never fundamental in the source code. Note that "frequency" is not used as a state of partial knowledge, it is a fact about a set and one of its subsets.

The reason that:

"The first emerald ever observed was green.
The second emerald ever observed was green.
The third emerald ever observed was green.
… etc.
The nth emerald ever observed was green.
(conclusion):
There is a very high probability that a never before observed emerald will be green."

is a valid inference, but the grue equivalent isn't, is that grue is not a property that the emerald construction sites of our universe deal with. They are blind to the grueness of their emeralds, they only say anything about whether or not the next emerald will be green. It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but the frequency of some particular result out of all trials will never change; the line will not move. As long as we know what symbols we are using for what values, observing many green emeralds is evidence that the next one will be grue, as long as it is before time T, every record of an observation of a green emerald is evidence against a grue one after time T. "Grue" changes meanings from green to blue at time T, 'green'''s meaning stays the same since we are using the same physical test to determine green-hood as before; just as we use the same test to tell whether the ball landed right or left. There is no reft in the universe's source code, and there is no grue. Green is not fundamental in the source code, but green can be reduced to some particular range of quanta states; if you had the universes source code, you couldn't write grue without first writing green; writing green without knowing a thing about grue would be just as hard as while knowing grue. Having a physical test, or primary condition for applicability, is what privileges green over grue after time T; to have a physical consistent test is the same as to reduce to a specifiable range of physical parameters; the existence of such a test is what prevents the skeptic from performing his/her swaps on our arguments.


Take this more as a brainstorm than as a final solution. It wasn't originally but it should have been. I'll write something more organized and consize after I think about the comments more, and make some graphics I've designed that make my argument much clearer, even to myself. But keep those comments coming, and tell me if you want specific credit for anything you may have added to my grue toolkit in the comments.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 12:31 AM
Select new highlight date
All comments loaded

The problem seems trivially easy.

Each observed emerald is evidence for both "the emerald is green" and "the emerald is grue." The first is preferred because it is vastly simpler (and picking any particular T, of course, is hugely privileging the hypothesis!) Evidence that is equally strong for two propositions doesn't change their relative likelihoods - so it starts out more likely that the emeralds are green than grue, and it ends more likely that the emeralds are green than grue, but both are quickly more likely than the proposition that emeralds are uniformly red.

What's weird about this?

To clarify what potato said:

If someone was brought up from birth with the words "grue" and "bleen," how would they say something was "green," in their language? Well, they'd have to say that something was grue before, say, 2050, but bleen after. Something that changes from grue to bleen is clearly more complicated to write down than something that just stays grue all the time.

And this is just hiding the complexity, not making it simpler. Complexity isn't a function of how many words you use, cf. "The lady down the street is a witch; she did it." If we are writing a program that emits actual features of reality, rather than socially defined labels, the simplest program for green is simpler than the simplest program for grue or bleen. That you can also produce more complex programs that give the same results (defining green in terms of bleen and grue is only one such example) is both trivially true and irrelevant.

What's weird, is that without a premise about what "green" and "blue" stand for semantically, the skeptic can just repeat that paragraph back to you, but switch all the occurrences of "grue" and "green", since "grue" and "green" are logically symmetrical.

They can claim that the grue hypothesis is simpler than the green hypothesis?

If we take "green" and "bleen" as primitives, then it is the definition of "green" which requires the time interval, not grue.

But if we go down to the level of photons, "green" and "blue" don't require a time interval in their definitions, yet "grue" and "bleen" do.

What do you mean by "primitives"?

It seems to me that the only sensible primitives are photons, which have particular energies. A perception system that has two sets of mappings from energies to names and a clock is necessarily less simple than a perception system that has one mapping from energies to names.

"An object is grue iff it is first observed before time T, and it is green, or it is first observed after time T, and it is blue."

I don't see any reason such an object is likely to eat me when I'm walking around in the dark.

Bayes Slays Goodman's Grue

You don't need Bayes to solve 'grue' problems. Merely reductionism.

"Goodman's Grue" just doesn't seem to be a problem at all. It can only seem like a problem if you forget that 'grue' is a name given to a somewhat complex sequence of events (relative to a thing just being a color) and start making mistakes when manipulating the symbol. There just isn't any reason to suppose there is any 'threat to the whole of science' in the first place.

I agree, you are essentially saying that if you forget that green and blue are not simply syntactical binary predicates from first order logic – if you remember that they are semantic concepts, then it is clear that the grue problem is not at all a threat to science. But this is no trivial result, it means that there is a part to the application of Bayes, i.e., induction, which requires the acquisition of semantic concepts. If you fed evidence statements into a bayesian program, it would have to have an understanding of the semantic application of terms like green and grue. So you are right: reducing "green" and "grue" to their semantic/physical tests is the key in my proposed solution. Bayes can't be enough, obviously, since bayes is a syntactical and axiomatic system.

I guessed what seemed bayesian to me about the whole thing was the analogy to bayse's table problem, which was the main intuition pump I used to solve the problem. I'll edit the article to reflect this. Thanks

The skeptic would continue:

"Why should we treat the observation of several occurrences of Right, as evidence for 'The next ball will land on the right.' and not as evidence for 'The next ball will land reft of the line.'?"

It's evidence for both.

The solution to the grue problem is a combination of biting the bullet and Occam's razor

Nitpick: Emeralds are a bad example. An "emerald" is just green beryl - a blue instance of the same mineral is just a blue piece of beryl. They exist, but they aren't emeralds.

Philosophy of Science textbooks mention that fact. Goodman chose a bad example and now we must all pay the price.

I recommend editing this post to have shorter paragraphs.

The original problem, as stated, is "valid": a mind with a "grue"-like prior would make the grue prediction, while normal human minds (with a "green"-like prior, mostly as a result of our evolution around colors) would make the "green" prediction. If we want a more neutral prior, we go with "minimum message length", and "what are colors". Grue and green are words in a dictionary, so they do not count for math -- only Turing machines do. It's simpler to write a Turing machine which puts out "light at XXXhz, light at XXXhz" then one that takes time T into account. Therefore, the green prior is more in-line with an MML-prior mind. We take MML priors as most compatible with human-like reasoning.

This seems problematic because it implies that humans would be perfectly fine with accepting grue over blue if they didn't know about the nature of light.

Fortunately, the reason this helps is deeper than counting the number of hertz. When you want to determine the complexity of a term, you have to specify what language to use to write the term. The reason grue seems complicated to us evolved animals is because it has higher complexity in the language of our observations - the language of what neurons we feel light up when we look at the rock.

So does that mean that if an entity had a neuronal structure that intuited grue and bleen it would be justified in treating the hypothesis that way? I'd be willing to bite that bullet I think.

It means that that entity's evolved instincts would be out-of-whack with the MML, so if that entity also got to the point where it invented Turing machines, it would see the flaw in its reasoning. This is no different than realizing that Maxwell's equations, though they look more complicated than "anger" to a human, are actually simpler. Sometimes, the intuition is wrong. In the blue/grue case, human intuition happens to not be wrong, but a hypothetical entity is -- and both humans and the entity, after understanding math and computer science, would agree that humans are wrong about anger, and hypothetical entities are wrong about grue. Why is that a problem?

This seems problematic because it implies that humans would be perfectly fine with accepting grue over blue if they didn't know about the nature of light.

Right, they would, if for weird historical reasons they also thought of "grue" and "bleen" as reasonable linguistic primitives. So the human scientists would be surprised when the next emerald turned out to be bleen rather than grue, and they'd be able to observe that the shift happened at time T, and thus observe that green is a natural property. So this isn't really much of a problem.

"To this I know of no reply which the grue skeptic can make, if he/she say's the paragraph back to me with the proper words swapped, it is not true, because In the hypothetical where we have a table, a line, and we are calling one side right and another side left, the only way for Refts:Lefts behave as expected as more trials are added is to move the line (if even that), otherwise the ratio of Refts to Lights will approach the reciprocal of Rights to Lefts. "

He can simply define the term "line" to imply that it flips directions at time t.

This paradox seems to be equivalent to talking about the programming language that the K-complexity of something uses. For example, in any realistic programming language, it would be easier to define MWI than the Copenhagen interpretation of quantum mechanics, since the latter involves all the laws of the former and then some, but what if you use a language that, once MWI is defined, assumes waveform collapse and such unless told otherwise? You can construct a language to match any given prior, and while any two such languages and priors will converge in the limit, you can't say which is right for a finite case.

It may be that the rule that the emerald construction sites use to get either a green or non-green emerald change at time T, but there is no reason to believe that the rule will change if there has never been any change demonstrated in the position of the line before

There's your error! You think that the line is in the middle of the table through the entire experiment, but actually it's in the riddle of the table, where "riddle" means "in the middle of the table before time T and on the right side of the table afterward." All of our experience before time T has confirmed this.

So... your Bayesian answer to the grue problem is to become a frequentist? You're doing it wrong.

As has been pointed out to you, "grue" is a description of a perfectly consistent prior on observations. The reason that "green" is preferable is its simplicity (in terms of basic predictions of physical events) and specificity (i.e. if T is unspecified, then the "green" hypothesis makes more specific predictions than "grue", while if it is specified, then the complexity of the number T comes into play).

Actually it is unsolvable in Bayesian framework, and the only honest answer would be to admit it.

Bayesianism gives you consistency, but it doesn't anchor you to reality in any way. Assignment of probabilities that prefers green, and assignment of probabilities that prefers grue are both equally consistent.

Many people on lesswrong have been trying to handwave the problem away with Kolmogorov Complexity, but if you check real math, then you'll see that for any finite amount of data it solves exactly nothing - two different computational models have finite difference in probability assignment, but this finite difference is unbounded, and for any computational model you can find another that's arbitrarily far away from it.

No finite amount of data will cause non-negligible amount of convergence between models, since their differences are unbounded times greater than informational contents of that information.

At some point you'll have to admit that green and grue versions are equally consistent with data and all logical a priori reasons, and it's just your personal (or societal or whatever) preference to accept green over grue.

PS. This is all completely unrelated to the second big issue with Bayesianism that you only get consistency over infinite models by breaking Gödel's incompleteness theorems - and every theory where you're not allowed to say "I don't know" without assigning specific probability number to it shares this problem. Between these two problems I see Bayesianism as an useful tool, not as any deeper theory of reality.

"If you're insane enough, and have unreasonable enough priors, even Bayesianism won't save you," is an argument against insanity and unreasonableness, not against Bayesianism.

Let's say that the green premises brings the probability of "A new unobserved emerald is green." to 99%. In the skeptic's hypothesis, by symmetry it should also bring the probability of "A new unobserved emerald is grue." to 99%. But of course after time T, this would mean that the probability of observing a green emerald is 99%, and the probability of not observing a green emerald is at least 99%, since these sentences have no intersection, i.e., they cannot happen together, to find the probability of their disjunction we just add their individual probabilities. This must give us a number at least as big as 198%...

Let's do actual Bayesian math on this problem. Let Green_n be "the green premises 1 through n", and so on.

Pr( An emerald is grue | Emerald is green, it is before time T ) = ~1.

Pr( An emerald is grue | Emerald is green, it is after time T ) = ~0.

Pr( An emerald is grue | Emerald is blue, it is before time T ) = ~0.

Pr( An emerald is grue | Emerald is blue, it is after time T ) = ~1.

These are our grue axioms - the probabilistic representation of "grue iff green before time T or blue after time T".

Pr( New Emerald is green | Green_n ) = 0.99 (this is our first sentence axiom)

Pr( New Emerald is grue | Emerald is green ) = undefined. We need to know if we are pre-T or post-T. Without the prior probability for being pre-T (from which we can derive its complement, post-T, or vice versa).

But that is wussing out; Bayesian agents should always be able to assign some level of credence. Assume maximum ignorance about T: it is equally likely to be pre-T or post-T.

We can find Pr( New Emerald is grue | Emerald is green ) by finding Pr( New Emerald is grue | Emerald is green, pre-T ) and adding it to Pr( New Emerald is grue | Emerald is green, post-T ) then normalising.

The pre-T case: Pr( New Emerald is grue | Green_n, pre-T ) = Pr( emerald is grue | emerald is green, pre-T ) Pr( emerald is green). We know that Pr( emerald is green ) is 0.99. We have Pr( emerald is grue | emerald is green, pre-T ) as an axiom of ~1 above. So 0.99 ~1 = ~0.99.

The post-T case: Pr( New Emerald is grue | Green_n, post-T ) = Pr( emerald is grue | emerald is green, post-T ) Pr( emerald is green). We know that Pr( emerald is green ) is 0.99. We have Pr( emerald is grue | emerald is green, post-T ) as an axiom of ~0 above. So 0.99 ~0 = ~0.

Normalising gives us: Pr( New Emerald is grue | Emerald is green ) = ~.495.

This is for the case where we don't know if pre-T or post-T. When you say

But of course after time T

we can ask a better question than "Pr( New Emerald is grue | Emerald is green ) ?". We can ask Pr( An emerald is grue | Emerald is green, it is after time T ). This is an axiom from before! We now know it's ~0, which resolves the problem: the probabilities sum to 0.99 + epsilon, which is below 1, which conserves the Kolmogorov axioms.

Whence the problem? The error creeps in when you use the pre-T case to get one .99, and then you use the complement of the post-T case to get another .99, and then add them together. If you specify pre-T or post-T, but then swap T in calculating some of the posterior probabilities, of course you can violate probability theory! You're already violating it by varying the state of T within the calculation!

If I am not mistaken, this is my independent formulation of the formal Bayesian resolution of the Goodman Grue paradox.

Solomonoff Induction is a formalized answer to problems of inference which also applies to the grue problem. It basically just says to weigh all possible explanations that fit your data by their complexity, but it is specified mathematically. Since grue is more complex than green, it weighs green much higher until reason to believe in grue shows up.

This is slightly off topic though, because the key is reducing the items you're talking about to what they are made up of so that you can properly encode them in order to compare the complexity. As said here, it just takes reductionism.

Note that this question was first put forward in 1955, so that it was a purely hypothetical question until 1 January 2000, when sapphires were discovered to be grue. (Before and after images of the same gem.)

The case makes an interesting parallel to the term "black swan", another famous philosophical thought experiment that received unexpected data.

One would suspect that the emerald-producing locations in our universe do not behave quite as cleanly as mathematically as you describe them. Instead, fuzziness and messiness creep in. Maybe such sites degrade over time, causing the emeralds to be slightly bluer. Maybe not.

Broad principles like "green earlier implies green now" are approximations that allow us to simplify the complexity of actual, extremely difficult Bayesian inference.