In this post, I will discuss something elementary yet important. Causation versus Association.
Although it is well-worn territory, the topic of causation still causes enormous confusion. The media confuse correlation and causation all the time. In fact, it is common to see a reporter discuss a study, warn the listener that the result is only an association and has not been proved to be causal, and then go on to discuss the finding as if it is causal. It usually goes something like this:
“A study reports that those who sleep less are more likely to have health problems.”
So far so good.
“Researchers emphasize that they have not established a causal connection”
Even better, no claim of causation. But then:
“So make sure you get sufficient sleep to avoid these nasty health problems.”
Ouch, there it is. The leap from association to causation.
What’s worse is that even people trained to know better, namely statisticians and ML people, make the same mistake all the time. They will teach the difference between the two in class and then, a minute after leaving class, fall back into the same fog of confusion as the hypothetical reporter above. (I am guilty of this too.) This just shows how hard-wired our brains are for making the causal leap.
There are (at least) two formal ways to discuss causation rigorously: one is based on counterfactuals and the other is based on casual directed acyclic graphs (DAG’s). They are essentially equivalent. Some things are more easily discussed in one langauge than the other. I will use the language of DAG’s here.
Consider a putative cause and a response . Let represent all variables that could affect or . To be concrete, let’s say is stress and is getting a cold. The variables is a very high-dimensional vector including, genetic variables, environmental variables, etc. The elements of are called confounding variables.
The causal DAG looks like this:
Suppose we only observe and on a large number of people. is unobserved.
The DAG has several implications. First, the distribution factors as . Well, that’s a pretty vacuous statement but let’s keep going. The association between and is described by the conditional distribution . This distribution can be consistently estimated from the observed data. No doubt we will see an association between and (that is, will indeed be a function of ). As usual,
The causal distribution is the distribution we get — not by conditioning — but by intervening and changing the graph. Specifically, we break the arrow into and we fix at a value . The new graph is
The joint distribution for this graph is
where has been replaced with a point mass distribution at . The causal distribution is the marginal distribution of in the new graph, which is,
Using the language of Spirtes, Glymour and Scheines and Pearl, we can summarize this as:
We immediately deduce the following:
1. They are different. This is just the formalization of the fact that causation is not association.
2. If there is no arrow from to in the original graph we will find that depends on but does not depend on . This is the common case where there is no causal relationship between and yet we see a predictive relationship between and . This is grist for many bogus stories in CNN and the NY Times.
3. The causal distribution is not estimable. It depends on and but is not observed.
4. The reason why epidemiologists collect data on lots of other variables is that they are trying to measure or at least, measure some elements of . Then they can estimate
This is called, adjusting for confounders. Of course, they will never measure all of which is why observational studies, though useful, must be taken with a grain of salt.
5. In a randomized study, where we assign the value of to subjects randomly, we break the arrow from to . In this case, it is easy to check that
In other words, we force association to equal causation. That’s why we spend millions of dollars doing randomized studies and it’s why randomized studies are the gold standard.
This raises an interesting question: have there been randomized studies to followup results from observaional studies? In fact, there have. In a recent article, (Young and Karr, 2011) found 12 such randomized studies following up on 52 positive claims from observational studies. They then asked, of 52 claims, how many were verified by the randomized studies? The answer is depressing: zero.
That’s right, 0 out of 52 effects turned out to be real.
We should be careful not to over-generalize here because these studies are certainly not a representative sample of all studies. Still, 0/52 is sobering.
In a future post, I will discuss Simpsons paradox. Here is a preview. Suppose is observed. If there is an arrow from to then but when there is no arrow from to then . Nothing complicated. But when we change the math into words, people— including most statisticians and computer scientists— get very confused. I’ll save the details for a future post.