Suppose that two variables show a correlation and you have concluded that the data are numerous enough and the correlation strong enough that the correlation is not an artifact of sampling variation. In short, you see \(r_{xy}\) with an associated \(p < 0.00\cdots001\).
It’s commonplace in introductory statistics to emphasize that the value and sign of \(r_{xy}\) do not entitle you to draw any conclusion about a potential causal relationship between x and y. This is wrong. In fact, you are entitled to conclude that one of the following situations holds:
Knowing just \(r_{xy}\) does not tell you which of the possibilities is behind the correlation, but one of them must be.
We often know something about the world, in particular what causes what.
Example from Judea Pearl: The rooster’s crow is associated with the sun rise. Your choice:This knowledge can help us weed out some possible configurations.
Perhaps “know” should be replaced with “hypothesize,” and different people can have different hypotheses about relationships. Then there is more work to do.
Crazy example: Dr. A believes that chemotherapy agents become more effective if they have been scattered on Mars. Will NIH fund a collaboration with NASA?
We can intervene. Ideally, we take over variable X, destroying all other causal inputs to it, as in a randomized control trial.
This changes the topology of the causal network.
We can hold variables constant, either through physical intervention or by using them as covariates in models.
Physical intervention more plausible, but stratification is a feasible alternative.
Consider these several simple pathways:
“Correct” here means an unbiased presentation of the influence of X on Y
Simple network | Other factors |
---|---|
To study the possible relationship between age and difficulty we can intervene at support, but not so much at anxiety. Still we can stratify by anxiety.
A simple and provable way to deal with possible non-compliance or pollution.
Deciding what variables to control for: Child seat prices
Build some hypothetical causal networks. What results do you get when excluding/including appropriate variables