Followup toBeautiful Math, Expecting Beauty, Is Reality Ugly?

Should we expect rationality to be, on some level, simple?  Should we search and hope for underlying beauty in the arts of belief and choice?

Let me introduce this issue by borrowing a complaint of the late great Bayesian Master, E. T. Jaynes (1990):

"Two medical researchers use the same treatment independently, in different hospitals.  Neither would stoop to falsifying the data, but one had decided beforehand that because of finite resources he would stop after treating N=100 patients, however many cures were observed by then.  The other had staked his reputation on the efficacy of the treatment, and decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, however many patients that might require.  But in fact, both stopped with exactly the same data:  n = 100 [patients], r = 70 [cures].  Should we then draw different conclusions from their experiments?"  (Presumably the two control groups also had equal results.)

According to old-fashioned statistical procedure - which I believe is still being taught today - the two researchers have performed different experiments with different stopping conditions.  The two experiments could have terminated with different data, and therefore represent different tests of the hypothesis, requiring different statistical analyses.  It's quite possible that the first experiment will be "statistically significant", the second not.

Whether or not you are disturbed by this says a good deal about your attitude toward probability theory, and indeed, rationality itself.

Non-Bayesian statisticians might shrug, saying, "Well, not all statistical tools have the same strengths and weaknesses, y'know - a hammer isn't like a screwdriver - and if you apply different statistical tools you may get different results, just like using the same data to compute a linear regression or train a regularized neural network.  You've got to use the right tool for the occasion.  Life is messy -"

And then there's the Bayesian reply:  "Excuse you?  The evidential impact of a fixed experimental method, producing the same data, depends on the researcher's private thoughts?  And you have the nerve to accuse us of being 'too subjective'?"

If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing.  If Nature is another way, the likelihood of the data coming out that way will be something else.  But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions.  So whatever our hypotheses about Nature, the likelihood ratio is the same, and the evidential impact is the same, and the posterior belief should be the same, between the two experiments.  At least one of the two Old Style methods must discard relevant information - or simply do the wrong calculation - for the two methods to arrive at different answers.

The ancient war between the Bayesians and the accursèd frequentists stretches back through decades, and I'm not going to try to recount that elder history in this blog post.

But one of the central conflicts is that Bayesians expect probability theory to be... what's the word I'm looking for?  "Neat?"  "Clean?"  "Self-consistent?"

As Jaynes says, the theorems of Bayesian probability are just that, theorems in a coherent proof system.  No matter what derivations you use, in what order, the results of Bayesian probability theory should always be consistent - every theorem compatible with every other theorem.

If you want to know the sum of 10 + 10, you can redefine it as (2 * 5) + (7 + 3) or as (2 * (4 + 6)) or use whatever other legal tricks you like, but the result always has to come out to be the same, in this case, 20.  If it comes out as 20 one way and 19 the other way, then you may conclude you did something illegal on at least one of the two occasions.  (In arithmetic, the illegal operation is usually division by zero; in probability theory, it is usually an infinity that was not taken as a the limit of a finite process.)

If you get the result 19 = 20, look hard for that error you just made, because it's unlikely that you've sent arithmetic itself up in smoke.  If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory - like, say, two different evidential impacts from the same experimental method yielding the same results - then the whole edifice goes up in smoke.  Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory.

Math!  That's the word I was looking for.  Bayesians expect probability theory to be math.  That's why we're interested in Cox's Theorem and its many extensions, showing that any representation of uncertainty which obeys certain constraints has to map onto probability theory.  Coherent math is great, but unique math is even better.

And yet... should rationality be math?  It is by no means a foregone conclusion that probability should be pretty.  The real world is messy - so shouldn't you need messy reasoning to handle it?  Maybe the non-Bayesian statisticians, with their vast collection of ad-hoc methods and ad-hoc justifications, are strictly more competent because they have a strictly larger toolbox.  It's nice when problems are clean, but they usually aren't, and you have to live with that.

After all, it's a well-known fact that you can't use Bayesian methods on many problems because the Bayesian calculation is computationally intractable.  So why not let many flowers bloom?  Why not have more than one tool in your toolbox?

That's the fundamental difference in mindset.  Old School statisticians thought in terms of tools, tricks to throw at particular problems.  Bayesians - at least this Bayesian, though I don't think I'm speaking only for myself - we think in terms of laws.

Looking for laws isn't the same as looking for especially neat and pretty tools.  The second law of thermodynamics isn't an especially neat and pretty refrigerator.

The Carnot cycle is an ideal engine - in fact, the ideal engine.  No engine powered by two heat reservoirs can be more efficient than a Carnot engine.  As a corollary, all thermodynamically reversible engines operating between the same heat reservoirs are equally efficient.

But, of course, you can't use a Carnot engine to power a real car.  A real car's engine bears the same resemblance to a Carnot engine that the car's tires bear to perfect rolling cylinders.

Clearly, then, a Carnot engine is a useless tool for building a real-world car.  The second law of thermodynamics, obviously, is not applicable here.  It's too hard to make an engine that obeys it, in the real world.  Just ignore thermodynamics - use whatever works.

This is the sort of confusion that I think reigns over they who still cling to the Old Ways.

No, you can't always do the exact Bayesian calculation for a problem.  Sometimes you must seek an approximation; often, indeed.  This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is not made out of atoms.  Whatever approximation you use, it works to the extent that it approximates the ideal Bayesian calculation - and fails to the extent that it departs.

Bayesianism's coherence and uniqueness proofs cut both ways.  Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests.  This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains).

You may not be able to compute the optimal answer.  But whatever approximation you use, both its failures and successes will be explainable in terms of Bayesian probability theory.  You may not know the explanation; that does not mean no explanation exists.

So you want to use a linear regression, instead of doing Bayesian updates?  But look to the underlying structure of the linear regression, and you see that it corresponds to picking the best point estimate given a Gaussian likelihood function and a uniform prior over the parameters.

You want to use a regularized linear regression, because that works better in practice?  Well, that corresponds (says the Bayesian) to having a Gaussian prior over the weights.

Sometimes you can't use Bayesian methods literally; often, indeed.  But when you can use the exact Bayesian calculation that uses every scrap of available knowledge, you are done.  You will never find a statistical method that yields a better answer.  You may find a cheap approximation that works excellently nearly all the time, and it will be cheaper, but it will not be more accurate.  Not unless the other method uses knowledge, perhaps in the form of disguised prior information, that you are not allowing into the Bayesian calculation; and then when you feed the prior information into the Bayesian calculation, the Bayesian calculation will again be equal or superior.

When you use an Old Style ad-hoc statistical tool with an ad-hoc (but often quite interesting) justification, you never know if someone else will come up with an even more clever tool tomorrow.  But when you can directly use a calculation that mirrors the Bayesian law, you're done - like managing to put a Carnot heat engine into your car.  It is, as the saying goes, "Bayes-optimal".

It seems to me that the toolboxers are looking at the sequence of cubes {1, 8, 27, 64, 125, ...} and pointing to the first differences {7, 19, 37, 61, ...} and saying "Look, life isn't always so neat - you've got to adapt to circumstances."  And the Bayesians are pointing to the third differences, the underlying stable level {6, 6, 6, 6, 6, ...}.  And the critics are saying, "What the heck are you talking about?  It's 7, 19, 37 not 6, 6, 6.  You are oversimplifying this messy problem; you are too attached to simplicity."

It's not necessarily simple on a surface level.  You have to dive deeper than that to find stability.

Think laws, not tools.  Needing to calculate approximations to a law doesn't change the law.  Planes are still atoms, they aren't governed by special exceptions in Nature for aerodynamic calculations.  The approximation exists in the map, not in the territory.  You can know the second law of thermodynamics, and yet apply yourself as an engineer to build an imperfect car engine.  The second law does not cease to be applicable; your knowledge of that law, and of Carnot cycles, helps you get as close to the ideal efficiency as you can.

We aren't enchanted by Bayesian methods merely because they're beautiful.  The beauty is a side effect.  Bayesian theorems are elegant, coherent, optimal, and provably unique because they are laws.

AddendumCyan directs us to chapter 37 of MacKay's excellent statistics book, free online, for a more thorough explanation of the opening problem.


Jaynes, E. T. (1990.) Probability Theory as Logic. In: P. F. Fougere (Ed.), Maximum Entropy and Bayesian Methods. Kluwer Academic Publishers.

MacKay, D. (2003.) Information Theory, Inference, and Learning Algorithms. Cambridge: Cambridge University Press.

Comments

sorted by
magical algorithm
Highlighting new comments since Today at 11:02 AM
Select new highlight date
All comments loaded

To answer your story about data:

One person decides on a conclusion and then tries to write the most persuasive argument for that conclusion.

Another person begins to write an argument by considering evidence, analyzing it, and then comes to a conclusion based on the analysis.

Both of those people type up their arguments and put them in your mailbox. As it happens, both arguments happen to be identical.

Are you telling me the first person's argument carries the exact same weight as the second?

In other words, yes, the researcher's private thoughts do matter, because P(observation|researcher 1) != P(observation|researcher 2) even though the observations are the same.

I think that's the proper Bayesian objection, anyway.

Eliezer, I accept your point about the underlying laws of probability. However, your example is extremely flawed.

Of course what the researcher operates by should affect our interpretation of the evidence; it is, in itself, another piece of evidence! Specifically in this case, publishing your research only when you reach a certain conclusion implies that any similar researches that did not reach this threshold did not get published, and are thus not available to our evidence pool. This is filtered evidence.

So without knowing how many similar researches were conducted, the conclusion from the one research that did get published can't be seen as very strong. Do I need to draw the Bayesian analysis that shows why?

Emil, thanks, fixed.

Doug, your analogy is not valid because a biased reporting method has a different likelihood function to the possible prior states, compared to an unbiased one. In this case the single, fixed dataset that we see, has a different likelihood to the possible prior states, depending on the reporting method.

If a researcher who happens to be thinking biased thoughts carries out a fixed sequence of experimental actions, the resulting dataset we see does not have a different likelihood function to the possible prior states. All that a Bayesian needs to know is the experimental actions that were actually carried out and the data that was actually observed - not what the researcher was thinking at the time, or what other actions the researcher might have performed if things had gone differently, or what other dataset might then have been observed. We need only consider the actual experimental results.

Londenio, see Ron's comment - it's not a strawperson.

Just a note here: the fact that a dataset has the same likelihood function regardless of the procedure that produced it is actually NOT a trivial statement - the way I see it, it a somewhat deep result which follows from the optional stopping theorem and the fact that the likelihood function is bounded. Not trying to nitpick, just pointing out that there is something to think about here. According to my initial intuitions, this was actually rather surprising - I didn't expect experimental results constructed using biased data (in the sense of non-fixed stopping time) to end up yielding unbiased results, even with full disclosure of all data.

Great point but I worry that people will point to this post and say "See? Publication bias/questionable study design/corporate funding/varying peer review processes don't matter!"

In other words, it's good to strive for a fixed experimental process but reality is rarely that tidy.

Woops, looks like I may have shot myself in the foot. The same way argument screens off authority, the actual experiment that was run screens off the intentions of the researcher.

Efficacy of the drug -> Results of the experiment <- Bias of the researcher

Efficacy, Bias -> Results of the experiment -> Our analysis of the efficacy of the drug

Doug S., I agree on principle, but disagree on your particular example because it is not statistical in nature. Should we not be hugging the query "Is the argument sound?" If a random monkey typed up a third identical argument and put it in the envelope, it's just as true. The difference between this and the a medical trial is that we have an independent means to verify the truth. Argument screens off Methodology...

If evidence is collected in violation of the fourth amendment rights of the accused, it's inadmissable in court, yes, but that doesn't mean that, ceteris paribus, the prosecution KNOWS LESS than if it were obtained legally.

So, when do I start agreeing with you? Here: The problem lies in the fact that the two trial methodologies create different sorts of Everett branches. The fact that the methodologies differed is ITSELF a piece of evidence which the esteemed Mr. Yudkosky doesn't appear to have room for in this Bayesian analysis. I agree that the relevant post appears to be What Evidence Filtered Evidence?

"This doesn't mean that probability theory has ceased to apply, any more than your inability to calculate the aerodynamics of a 747 on an atom-by-atom basis implies that the 747 is made out of atoms" should read "... is not made out of atoms."

"Bayesianism's coherence and uniqueness proofs cut both ways. Just as any calculation that obeys Cox's coherency axioms (or any of the many reformulations and generalizations) must map onto probabilities, so too, anything that is not Bayesian must fail one of the coherency tests. This, in turn, opens you to punishments like Dutch-booking (accepting combinations of bets that are sure losses, or rejecting combinations of bets that are sure gains)."

I've never understood why I should be concerned about dynamic Dutch books (which are the justification for conditionalization, i.e., the Bayesian update). I can understand how static Dutch books are relevant to finding out the truth: I don't want my description of the truth to be inconsistent. But a dynamic Dutch book (in the gambling context) is a way that someone can exploit the combination of my belief at time (t) and my belief at time (t+1) to get something out of me, which doesn't seem like it should carry over to the context of trying to find out the truth. When I want to find the truth, I simply want to have the best possible belief in the present -- at time (t+1) -- so why should "money" I've "lost" at time (t) be relevant?

Perhaps I simply want to avoid getting screwed in life by falling into the equivalents of Dutch books in real, non-gambling-related situations. But if that's the argument, it should depend on how frequently such situations actually crop up -- the mere existence of a Dutch book shouldn't matter if life is never going to make me take it. Why should my entire notion of rationality be based on avoiding one particular -- perhaps rare -- type of misfortune? On the other hand, if the argument is that falling for dynamic Dutch books constitutes "irrationality" in some direct intuitive sense (the same way that falling for static Dutch books does), then I'm not getting it.

If there is a difference, it is not because the experiments went differently, it is because the experiments could have gone differently, and so the likelihoods of them happening the way they did happen is different.

The Monty Hall problem was mentioned above. I pick a door, Monty opens a door to reveal a goat, I can stick or switch (but can't take the goat). Whether Monty is picking a random door or picking the door he knows doesn't have the goat, the evidence is the same - Monty opened a door and revealed a goat. But if Monty what matters is what might have happened otherwise. If Monty always picks a door with a goat, then I win if I switch 2/3 of the time. If Monty might have picked the door with the car (and just happened not to), I win if I switch only 50% of the time.

Same evidence, different conclusions based solely on what someone might have done otherwise not based on what actually happened; and I am confident of the difference in the Monty Hall problem, as I have not only read about it but also simulated it.

In the situation given, Researcher 1 did stop at 100 experiments, but might have stopped at 49, or 280. Researcher 2 was sure to stop at 100. I am not unwilling to accept that this doesn't change the meaning of the evidence, in this case, but I do not understand at all why it should be "obvious" that it can't, given that it does in the case of the Monty Hall problem.

The difference is that depending on Monty's algorithm, there is a different probability of getting the exact result we saw, namely seeing a goat. The exact event we actually saw happens with different probability depending on Monty's rule, so Monty's rule changes the meaning of that result.

The researchers don't get a given exact sequence of 100 results with different probability depending on their state of mind - their state of mind is not part of the state of the world that the result sequence tells us about, the way Monty's state of mind is part of the world that generates the exact goat.

To look at it another way, a spy watching Monty open doors and get goats would determine that Monty was deliberately avoiding the prize. Watching a researcher stop at 100 results doesn't tell you anything about whether the researcher planned to stop at 100 or after getting a certain number of successes. So, just like that result doesn't tell you anything about the researcher's state of mind, knowing about the researcher's state of mind doesn't tell you anything about the result.

Did you read the chapter linked at the end of the post?

A hopefully intuitive explanation: A spy watching the experiments and using Bayesian methods to make his own conclusions about the results, will not see any different evidence in each case and so will end up with the same probability estimate regardless of which experimenter he watched.

While the second experimenter might be contributing to publication bias by using that method in general, he nonetheless should not have come up with a different result.

It seems worth noting the tension between this and bottom-line reasoning. Could the second experimenter have come up with the desired result no matter what, given infinite time? And if so, is there any further entanglement between his hypothesis and reality?

"Two medical researchers use the same treatment independently [...] one had decided beforehand [...] he would stop after treating N=100 patients, [...]. The other [...] decided he would not stop until he had data indicating a rate of cures definitely greater than 60%, [...]. But in fact, both stopped with exactly the same data: n = 100 [patients], r = 70 [cures]. Should we then draw different conclusions from their experiments?"

[...]

If Nature is one way, the likelihood of the data coming out the way we have seen will be one thing. If Nature is another way, the likelihood of the data coming out that way will be something else. But the likelihood of a given state of Nature producing the data we have seen, has nothing to do with the researcher's private intentions. [...]

The expectations and the stopping rule make a difference. The reason the Monty Hall Puzzle turns out the way it does is that part of the setup is that Monty Hall always opens a different door than you chose. When I tell the story without mentioning that fact, you should get a different answer.

Conchis and Benquo: Eliezer's response to Doug was that the probability of a favorable argument is greater, given a clever arguer, than the prior probability of a favorable argument. But the probability of a 60% effectiveness given 100 trials, given an experimenter who intended to keep going until he had a 60% effectiveness, is no greater than the prior probability of a 60% effectiveness given 100 trials. This should be obvious, and does distinguish the case of the biased intentions from the case of the clever arguer.

Eliezer,

I'm afraid that I too was seduced by Doug's analogy, and for some reason am a little too slow to follow your response. Any chance you could try again to explain why the analogy doesn't work?

Are you telling me the first person's argument carries the exact same weight as the second?

Yes. It's the arguments that matter.

Now, if we know that one person was trying to support a thesis and the other presenting the data and drawing a conclusion, we can weight them differently, if we only have access to one. The first case might leave out contrary data and alternative hypotheses in an attempt to make the thesis look better. We expect the second case to mention all relevant data and the obvious alternatives, if only briefly, so the absence of contrary data is evidence of its nonexistence in that case.

Since we have both, we can exclude the possibility that the first author left out data to make his case look better. Thus, the two arguments are equally valid.

You know what really helps me accept a counterintuitive conclusion? Doing the math. I spent an hour reading and rereading this post and the arguments without being fully convinced of Eliezer's position, and then I spent 15 minutes doing the math (R code attached at the end). And once the math came out in favor of Eliezer, the conclusion suddenly doesn't seem so counterintuitive :)

Here we go, I'm diving all the numbers by five to make the code work but it's pretty convincing either way.

  • The setup - Researcher A does 20 trials always, researcher B keeps doing trials until the ratio of cures is at least 70% (1 cure / 1 trial is also acceptable).
  • E - The full evidence, namely that 20 patients were tried and 14 were cured.
  • H0 - The hypothesis that the success rate of the cure is 60%.
  • H1 - The hypothesis that the success rate is 70%.
  • Pa - Researcher A's probabilities.
  • Pb - Researcher B's probabilities.

In this setup, it's clear to see that Pa and Pb aren't equal for every thing you want to measure. For example, for any evidence E that doesn't contain 20 observations Pa(E)=0. However, Reverend Bayes reminds us that the strength of our EVIDENCE depends on the odds ratio, and not on all the sub probabilities:

P(H1|A) / P(H0|B) = P(H1)/P(H0) P(E|H1)/P(E|H0) aka posterior odds = prior odds odds ratio of evidence. Assuming that the prior odds are the same, let's calculate the odds ratio for both Pa and Pb and see if they are different.

Pa(E|H0) = 12.4%, as a simple binomial distribution: dbinom(14,20,0.6). Pa(E|H1) = 19.1%. The odds ratio: Pa(E|H1)/Pa(E|H0) = 1.54. That's the only measure of how much our posterior should change. If originally we gave each hypothesis an equal chance (1:1), we now favor H1 at a ratio of 1.54:1. In terms of probability, we changed our credence in H1 from 50% to 60.6%.

What about researcher B? I simulated researcher B a million times in each possible world, the H0 world and the H1 world. In the H0 world, evidence E occurred only 5974 times out of a million, for Pb(E|H0) = 0.597% which is very far from 12.4%. It makes sense: researcher 2 usually stops after the first trial, and occasionally goes on for zillions! What about the H1 world? Pb(E|H1) = 0.919%. The odds ratio: Pb(E|H1) / Pb(E|H0) = wait for it = 1.537. Exactly the same!

I think all the other posts explain quite well why this was obviously the case, but if you like to see the numbers back up one side of an argument, you got 'em. I personally am now converted, amen.

R code for simulating a single researcher B:

resb<-function(p=0.6){

cures<-0

tries<-0

while(tries < 21) { # Since we only care whether B stops after 20 trials, we don't need to simulate past 21.

tries<-tries+1

cures<-cures+rbinom(1,1,p)

if((cures/tries) >= 0.7) return(tries)

}

tries }

R code for simulating a million researchers B in H1 world:

x<-sapply(1:1000000,function(i) {resb(0.7)})

length(x[x==20])

I believe the example in this post is fundamentally flawed. Some of the other commenters have hinted at the reasons, but I want to add my own thoughts on this.

Before we go into the difference between the frequentist and the Bayesian approach to the problem, we first have to be clear about whether the investigators acknowledge publicly that they use different stopping rules. I am going to cover both cases.

If the stopping rule is not publicly acknowledged, the frequentist data analyst can not take it into account. He will therefore have to use the same tests on the two datasets. Therefore, if experiment one rejects the null hypothesis, so will experiment two.

If the stopping rule is known to the public, a frequentist statistician will appropriately take this into account in his data analysis. As Eliezer says, experiment one may be statistically significant and experiment two may be non-significant. And this is completely appropriate; any rational Bayesian would do essentially the same thing:

Let "A" be the event that somebody publishes a study showing that the drug works in 60% of people. Let "B" be the prior probability that the drug works in at least 60% of people, and let "C" be the biased stopping rule.

In the case of the first investigator, the appropriate likelihood ratio is Pr(A | B) / Pr(A| not B)

In the case of the second investigator, the appropriate likelihood ratio is Pr(A| B, C) / Pr(A| C, not B)

Pr(A | B) / Pr(A| not B) is strictly larger than Pr(A| B, C) / Pr(A| C, not B)

Any Bayesian agent who uses Pr(A | B) / Pr(A| not B) in the case of the second investigator, is throwing away important evidence, namely that a biased stopping rule was applied.

I was confused by this post for some time, and I feel I have an analagous but clearer example: Suppose scientist A says "I believe in proposition A, and will test it at the 95% confidence level", and scientist B says "I believe in proposition B, and will test it at the 99% confidence level". They go away and do their tests, and each comes back from their experiment with a p-value of 0.03. Do we now believe proposition A more or less than proposition B? The traditional scientific method, with its emphasis on testability, prefers A to B; for a bayesian it's clear that we have the same amount of evidence for each.

Have I fairly characterised both sides? Does this capture the same paradox as the original example, and is it any clearer?

"If anyone should ever succeed in deriving a real contradiction from Bayesian probability theory [...] then the whole edifice goes up in smoke. Along with set theory, 'cause I'm pretty sure ZF provides a model for probability theory."

If you think of probability theory as a form of logic, as Jaynes advocates, then the laws and theorems of probability theory are the proof theory for this logic, and measure theory is the logic's model theory, with measure-theoretic probability spaces (which can be defined entirely with ZF, as you suggest) being the models.

But it seems to me that stopping at a desired result is implicitly the same as "throwing out" other possible results

You did not speak about throwing out possible results. You spoke of throwing out data that went against the desired conclusion.

These are very, very different actions, with different implications.

Paul Gowder,

You've read Jaynes -- now read MacKay.

"Information Theory, Inference, and Learning Algorithms" (available for download here).

The key portions are sections 37.2 - 37.3 (pp 462-465).

There are some rather baroque kinds of prior information which would require a Bayesian to try to model the researcher's thought processes. They pretty much rely on the researcher having more information about the treatment effectiveness than is available to the Bayesian, and that the stopping rule depends on that extra information. This idea could probably be expressed more elegantly as a presence or absence of an edge in a Bayesian network, twiddling the d-separation of the stop-decision node with the treatment effectiveness node.

Had to actually think about it a bit, and I think it comes down to this:

The thing that determines the strength of evidence in favor of some hypothesis vs another is "what's the likelihood we would have seen E if H were true vs what's the likelihood we would have seen E if H were false"

Now. experimenter B is not at all filtering based on H being true or false, but merely the properties of E.

So the fact of the experimenter presenting the evidence E to us can only (directly) potentially give us additional information on the properties of the total evidence E that was collected, rather than (directly) telling us anything about H.

But... the "filtering" rule the experimenter uses is only when to stop experimenting. In other words, once the experimenter does present data E, we know that E is all the evidence there is that he collected. In other words, this isn't filtered evidence in the sense of the experimenter throwing away data he or she doesn't like because once we are given E, there's nothing more to know.

Let me clarify that: Imagine you didn't know the difference in the second experimenter's protocol, you had thought they were the same. Then later you learn the difference. Have you actually learned anything new? Is there any new info about E that you have that you didn't already believe you had?

In this case, no, because unlike filtered evidence situations, the information about the experimenter's intent has no affect on what other possible evidence there may have been that was hidden from you. The probability of you seeing this specific evidence, this specific chunk of data from experimenter A is the same as that from experimenter B, given either effectiveness or non effectiveness.

There're other patterns of data that one would expect to be possible to see from A but not from B, and other patterns that one would expect to possibly see from B but not from A, but these specific data sets being published have probability completely independant of which experimenter was doing it, right?

I am by no means an expert in statistics, but I do appreciate Eliezer Yudkowsky's essay, and think I get his point that, given only experiment A and experiment B, as reported, there may be no reason to treat them differently IF WE DON'T KNOW of the difference in protocol (if those thoughts are truly private). But It does seem rather obvious that, if there were a number of independent experiments with protocol A and B, and we were attempting to do a meta-analysis to combine the results of all such experiments, there would be quite a number of experiments where n would be greater than 100 (from protocol B). With the protocol as stated, these would all end when cures were greater than but very close to 60%. If we assume that the "real" cure rate in the population is close to 70%, then, unless some Bayesian term is introduced to account for the bias in methodology, the meta-analysis would seem to be biased toward the incorrect conclusion that the lower 60% figure was closer to reality. Presumably, that kind of bias would be noticed in the experiments with n > 100, and could not have been kept as a private thought with a large number of repeat experiments.

I am not sure, but I would think that, if Bayesian analysis is (or can be) as rigorous as it is claimed, then even the analysis of the original pair might be expected to include some terms that would reflect that potential bias due to a difference in protocol IF THAT DIFFERENCE IS KNOWN to the Bayesian statistician doing the analysis. I find it disturbing that experiment A could have come out to have n = 100 and cure rate = 60%, or n = 1000 and cure rate = 60%, but not with cure rate = 59%, no matter how large n might have become.

Eliezer_Yudkowsky: As you described the scenario at the beginning ... you're right. But realistically? You need to think about P(~2nd researcher tained the experiment|2nd researcher has enormous stake in the result going a certain way). :-P

Oh, wait: assuming the second researcher stops as soon as (r >= 60) AND (N >= 100) (the latter expression to explain that they kept going until r=70), then the distribution above 60 will actually not be any different (all the probability mass that was in r100, well, only the second experimenter could possibly have generated that result.

Are P(r>70|effective) and P(r>70|~effective) really the same in those two experiments? Trivially, at least, in the second one P(r<60)=0, unlike in the first, so the distribution of r over successive runs must be different. The sequences of experimental outcomes happened to be the same in this case, but not in the counterfactual case where fewer than 60 of the first 100 patients were cured, and it seems that in fact that would affect the likelihood ratio. (I may run a simulation when I have the time.)

Something popped into my mind while I was reading about the example in the very beginning. What about research that goes out to prove one thing, but discovers something else?

Group of scientists want to see if there's a link between the consumption of Coca-Cola and stomach cancer. They put together a huge questionnaire full of dozens of questions and have 1000 people fill it out. Looking at the data they discover that there is no correlation between Coca-Cola drinking and stomach cancer, but there is a correlation between excessive sneezing and having large ears.

So now we have a group of scientists who set out to test correlation A, but found correlation B in the data instead. Should they publish a paper about correlation B?

I have no idea about what's done in actual statistical practice, but it seems to make sense to do this:

Publish the likelihood ratio for each correlation. The likelihood ratio for the correlation being real and replicable will be very high.

Since they bothered to do the test, you can figure that people in the know have decently sized prior odds for the association being real and replicable. There must have been animal studies or a biochemical argument or something. Consequently, a high likelihood ratio for this hypothesis may been enough to convinced them - that is, when it's multiplied with the prior, the resulting posterior may have been high enough to represent the "I'm convinced" state of knowledge.

But the prior odds for the correlation being real and replicable are the same tiny prior odds you would have for any equally unsupported correlation. When they combine the likelihood ratio with their prior odds they do end up with a much higher posterior odds for than they do for other arbitrary-seeming correlations. But, still insignificant.

The critical thing that distinguishes the two hypotheses is whatever previous evidence led them to attempt the test; that's why the prior for the association is higher. It's subjective only in the sense that it depends on what you've already seen - it doesn't depend on your thoughts. Whereas, in what Kindly says is the standard solution, you apply a different test depending upon what the researcher's intentions were.

(I have no idea how you would calculate the prior odds. I mean, Solomonoff induction with your previous observations is the Carnot engine for doing it, but I have no idea how you would actually do it in practice)

Before they publish anything (other than a article on Coca-Cola not being related to stomach cancer) they should first use a different test group in order to determine that the first result wasn't a sampling fluke or otherwise biased, (Perhaps sneezing wasn't causing large ears after all, or large ears were correlated to something that also caused sneezing.)

What brought the probability to your attention in the first place shouldn't be what proves it.

If A then B is a separate experiment than If C then D and should require separate additional proof.

That's a useful heuristic to combat our tendency to see patterns that aren't there. It's not strictly necessary.

Another way to solve the same problem is to look at the first 500 questionnaires first. The scientists then notice that there is a correlation between excessive sneezing and large ears. Now the scientists look at the last 500 questionnaires -- an independent experiment. If these questionnaires also show correlation, that is also evidence for the hypothesis, although it's necessarily weaker than if another 1000-person poll were conducted.

So this shows that a second experiment isn't necessary if we think ahead. Now the question is, if we've already foolishly looked at all 1000 results, is there any way to recover?

It turns out that what can save us is math. There's a bunch of standard tests for significance when lots of variables are compared. But the basic idea is the following: we can test if the correlation between sneezing and ears is high, by computing our prior for what sort of correlation the two most closely correlated variables would show.

Note that although our prior for two arbitrary variables might be centered at 0 correlation, our prior for two variables that are selected by choosing the highest correlation should be centered at some positive value. In other words: even if the questions were all about unrelated things, we expect a certain amount of correlation between some things to happen by chance. But we can figure out how much correlation to expect from this phenomenon! And by doing some math, we might be able to show that the correlation between sneezing and having ears is too high to be explained in this way.

Okay, that makes tons more sense, I apparently wasn't thinking too clearly when I wrote the first post. (plus I didn't know about the standard tests)

Thanks for setting me straight.

Incidentally, Eliezer, I don't think you're right about the example at the beginning of the post. The two frequentist tests are asking distinct questions of the data, and there is not necessarily any inconsistency when we ask two different questions of the same data and get two different answers.

Suppose A and B are tossing coins. A and B both get the same string of results -- a whole bunch of heads (let's say 9999) followed by a single tail. But A got this by just deciding to flip a coin 10000 times, while B got it by flipping a coin until the first tail came up. Now suppose they each ask the question "what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?"

In A's case the answer is of course very small; most strings of 10000 flips have many more than one tail. In B's case the answer is of course 1; B's method ensures that exactly one tail is seen, no matter what happens. The data was the same, but the questions were different, because of the "when doing what I did" clause (since A and B did different things). Frequentist tests are often like this -- they involve some sort of reasoning about hypothetical repetitions of the procedure, and if the procedure differs, the question differs.

If we wanted to restate this in Bayesian terms, we'd have to do so by taking into account that the interpreter knows what the method is, not just what the data is, and the distributions used by a Bayesian interpreter should take this into account. For instance, one would be a pretty dumb Bayesian if one's prior for B's method didn't say you'd get one tail with probability one. The observation that's causing us to update isn't "string of data," it's "string of data produced by a given physical process," where the process is different in the two cases.

(I apologize if this has all been mentioned before -- I didn't carefully read all the comments above.)

Now suppose they each ask the question "what is the probability that, when doing what I did, one will come up with at most the number of tails I actually saw?"

That is throwing away data. The evidence that they each observed is the sequence of coin flip results, and the number of tails in that sequence is a partial summary of the data. The reason they get different answers is because that summary throws away more data for B than A. As you say, B already expected to get exactly one tail, so that summary tells him nothing new and he has no information to update on, while A can recover from this summary the number of heads and only loses information about the order (which cancels out anyways in the likelihood ratios between theories of independent coin flips). But if you calculate the probability that they each see that sequence you get the same answer for both, p(heads)^9999 * (1 - p(heads).

That is, the data gathering procedure is needed to interpret a partial summary of the data, but not the complete data.

Sure, the likelihoods are the same in both cases, since A and B's probability distributions assign the same probability to any sequence that is in both of their supports. But the distributions are still different, and various functionals of them are still different -- e.g., the number of tails, the moments (if we convert heads and tails to numbers), etc.

If you're a Bayesian, you think any hypothesis worth considering can predict a whole probability distribution, so there's no reason to worry about these functionals when you can just look at the probability of your whole data set given the hypothesis. If (as in actual scientific practice, at present) you often predict functionals but not the whole distribution, then the difference in the functionals matters. (I admit that the coin example is too basic here, because in any theory about a real coin, we really would have a whole distribution.)

My point is just that there are differences between the two cases. Bayesians don't think these differences could possibly matter to the sort of hypotheses they are interested in testing, but that doesn't mean that in principle there can be no reason to differentiate between the two.