People kept noticing that blood donors were healthier than non-donors. Could giving blood be good for you, perhaps by removing excess iron? Perhaps medieval doctors practicing blood-letting were onto something? Running some studies (1998, 2001) this does seem to be a real correlation, so you see articles like "Men Who Donate Blood May Reduce Risk Of Heart Disease."
While this sounds good, and it's nice when helpful things turn out to be healthy, the evidence is not very strong. When you notice A and B happen together it may be that A causes B, B causes A, or some hidden C causes A and B. We may have good reasons to believe A might cause B, but it's very hard to rule out a potential C. Instead if you intentionally manipulate A and observe what happens to B then you can actually see how much of an effect A has on B.
For example, people observed (2003) that infants fed soy-based formula were more likely to develop peanut allergies. So they recommended that "limiting soy milk or formula in the first 2 years of life may reduce sensitization." Here A is soy formula, B is peanut allergy, and we do see a correlation. When intentionally varying A (2008, n=620), however, B stays constant, which kind of sinks the whole theory. A likely candidate for a third cause, C, was a general predisposition to allergies: those infants were more likely to react to cows-milk formula and so be given soy-based ones, and they were also more likely to react to peanuts.
To take another example, based on studies (2000, 2008, 2010) finding a higher miscarriage rate among coffee drinkers pregnant women are advised to cut back their caffeine consumption. But a randomized controlled trial (2007, n=1207) found people randomly assigned to drink regular or decaf coffee were equally likely to have miscarriages. [EDIT: jimrandomh points out that I misread the study and it didn't actually show this. Instead it was too small a study to detect an effect on miscarriage rates.] A potential third cause (2012) here is that lack of morning sickness is associated with miscarriage (2010) and when you're nauseated you're less likely to drink a morning coffee. This doesn't tell us the root of the problem (why would feeling less sick go along with miscarriages?) but it does tell us cutting back on caffeine is probably not helpful.
Which brings us back to blood donation. What if instead of blood donation making you healthy, healthier people are more likely to donate blood? There's substantial screening involved in becoming a blood donor, plus all sorts of cultural and economic factors that could lead to people choosing to donate blood or not, and those might also be associated with health outcomes. This was noted as a potential problem in 2011 but it's hard to test this with a full experiment because assigning people to give blood or not is difficult, you have to wait a long time, and the apparent size of the effect is small.
One approach that can work in places like this is to look for a "natural experiment," some way in which people might already be being divided into appropriate groups. A recent study (2013, n=12,357+50,889) took advantage of the situation where screening tests sometimes give false positives that disqualify people. These are nearly random, and give us a pool of people who are very similar to blood donors but don't quite make it to giving blood. When comparing the health of these disqualified donors to actual donors the health benefits vanish, supporting the "healthy donor hypothesis."
This isn't to say you should never pay attention to correlations. If your tongue starts peeling after eating lots of citric acid you should probably have less in the future, and the discovery (1950) that smoking causes lung cancer was based on an observation of correlations. Negative results are also helpful: if we don't find a correlation between hair color and musical ability then it's unlikely that one causes the other. Even in cases where correlational studies only provide weak evidence, however, they're so much easier than randomized controlled trials that we still should do them if only to find problems to look into more deeply with a more reliable method. But if you see a news report that comes down to "we observed people with bad outcome X had feature Y in common," it's probably not worth trying to avoid Y.
I also posted this on my blog.
This is a good thing to read: http://intersci.ss.uci.edu/wiki/pdf/Pearl/22_Greenland.pdf (chapter 22 in Judea's Festschrift). In particular the contrast between Fig. 1 and Fig. 2 is relevant.
What is going on here is that what we care about is some causal parameter, for instance "average causal effect (ACE) : E[Y | do(A=1)] - E[Y | do(A=0)]."
This parameter is sometimes identified, and sometimes not identified.
If it is NOT identified, it is simply not a function of the observed data. So any sort of number you get by massaging the observed data will not equal to the ACE. Naturally, if we try to randomize (which will get us the ACE directly) we will not reproduce what our observational data massage got us.
If it IS identified, then it is the matter of what functional of the observed data equals to the ACE. Maybe if we have treatment A, outcome Y, and a set of baseline confounders, the correct functional is:
\sum_{c} ( E[Y | A=1,c] - E[Y | A=0,c] ) p(c)
This is what "adjusting for confounders" means.
However, maybe that's not the right functional at all! Maybe you have a mediating variable M between A and Y, and the right functional is:
\sum{m} \sum{a'} E(Y | m,a') p(a') P(m | A=1) - \sum{m} \sum{a'} E(Y | m,a') p(a') P(m | A=0)
How do we tell what functional is right? We have to agree on what the right causal graph is for our problem, and then consult an algorithm that will either give us the right functional for the ACE given the graph, or tell us the ACE is not identifiable given the graph we got. This algorithm was what a part of my thesis was about.
There is one important historical example of people ignoring graph structure to their peril. In epidemiology people worry about something called the "healthy worker survivor effect." Say we have workers who work with asbestos, which is a bad chemical. We want to get an idea of how bad it is by running a study. The longer you work with asbestos, the worse your outcome. However, if you are sick, you will probably terminate employment early, which means you will not get more exposure to asbestos. So people who get more asbestos are also healthier. So it might seem based on observational data that even though we suspect asbestos is very bad for you, it seems to have a protective effect on workers. This is the "healthy worker survivor effect."
If we were to draw a simple graph with two time slices for this, we would get:
A1 -> H -> A2 -> D
where A1 and A2 are asbestos exposure, H is health status after A1, and D is death (or not). H and D are confounded by a common cause we do not see H <- U -> D. A1 determines H. If H is bad enough, it will cause the worker to leave, and thus set A2 to 0. A1 and A2 determine D.
What we want here is E[D | do(a1,a2)]. The point is that blindly adjusting for H is incorrect, because of the particular graph structure where H arises. H is a standard confounder for A2, but is NOT a standard confounder for A1 (H is what is called a "time-varying confounder.") So you need to use a particular form of adjustment called "g-computation":
\sum_{h} E[D | a1,a2,h] p[h | a1]
If you use the standard adjustment
\sum_{h} E[D | a1,a2,h] p[h]
you will get a biased answer. Jamie Robins wrote a giant 120 page paper in 1986 (that no one ever reads) on (among many many other things) this precise issue:
http://www.hsph.harvard.edu/james-robins/files/2013/03/new-approach.pdf
(edit: the reason you get bias with standard adjustment is because A1 -> H <- U is in your graph. If you condition on H, A1 and U become dependent: this is the so called "Berkson's bias, selection bias, collider stratification bias, or explaining away phenomenon." So standard adjustment creates a non-causal path A1 -> H <- U -> Y between a treatment and the outcome which accounts for part of the magnitude of the effect, and thus creates bias.)
What happens in practice is if you try to get the ACE from observed data, you will have too much confounding to get identification by any method (adjustment or anything else, really). So you need some sort of extra "trick." Maybe you can find a good instrumental variable. Or maybe you have a natural experiment. Or maybe you had really good data collection that really observed most important confounders. Or maybe the treatment variable only has observed parents (this happens in observational longitudinal studies sometimes). If you just blindly use covariate adjustment without thinking about your causal structure you will generally get garbage.