# 31 Spurious correlation

Google NGram provides a quick way to track word usage in books over the decades. Figure 31.1 shows the NGram for three statistical words: coefficient, correlation, and regression.

The use of “correlation” started in the mid to late 1800s, reached an early peak in the 1930s, then peaked again around 1980. “Correlation” is tracked closely by “coefficient.” This parallel track might seem evident to historians of statistics; the quantitative measure called the “**correlation coefficient**” was introduced by Francis Galton in 1888 and quickly became a staple of statistics textbooks.

In contrast to mainstream statistics textbooks, “correlation” barely appears in these lessons (until this chapter). There is a good reason for this. Although the correlation coefficient measures the “strength” of the relationship between two variables, it is a special case of a more general and powerful method that appears throughout these Lessons: regression modeling.

Figure 31.1 shows that “regression” got a later start than correlation. That is likely because it took 30-40 years before it was appreciated that correlation could be generalized. Furthermore, regression is more mathematically complicated than correlation, so practical use of regression relied on computing, and computers started to become available only around 1950.

## Correlation

A dictionary is a starting point for understanding the use of a word. Here are four definitions of “correlation” from general-purpose dictionaries.

“

A relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone” Source: Merriam-Webster Dictionary

“

A connection between two things in which one thing changes as the other does” Source: Oxford Learner’s Dictionary

“

A connection or relationship between two or more things that is not caused by chance. A positive correlation means that two things are likely to exist together; a negative correlation means that they are not.” Source: Macmillan dictionary

“A mutual relationship or connection between two or more things,” “interdependence of variable quantities.” Source: [Oxford Languages]

All four definitions use “connection” or “relation/relationship.” That is at the core of “correlation.” Indeed, “relation” is part of the word “correlation.” One of the definitions uses “causes” explicitly, and the everyday meaning of “connection” and “relation” tend to point in this direction. The phrase “one thing changes as the other does” is close to the idea of causality, as is “interdependence.:

Three of the definitions use the words “vary,” “variable,” or “changes.” The emphasis on variation also appears directly in a close statistical synonym for correlation: “covariance.”

Two of the definitions refer to “chance,” that correlation “is not caused by chance,” or “not expected on the basis of chance alone.” These phrases suggest to a general reader that correlation, since not based on chance, must be a matter of fate: pre-determination and the action of causal mechanisms.

We can put the above definitions in the context of four major themes of these Lessons:

- Quantitative description of relationships
- Variation
- Sampling variation
- Causality

Correlation is about relationships; the “correlation coefficient” is a way to describe a straight-line relationship quantitatively. The correlation coefficient addresses the tandem variation of quantities, or, more simply stated, how “one thing changes as the other does.”

To a statistical thinker, the concern about “chance” in the definitions is not about fate but reliability. Sampling variation can lead to the appearance of a pattern in some samples of a process that is not seen in other samples of that same process. Reliability means that the pattern will appear in a large majority of samples.

## Spurious causation

The “Spurious correlations” website http://www.tylervigen.com/spurious-correlations provides entertaining examples of correlations gone wrong. The running gag is that the two correlated variables have no reasonable association, yet the correlation coefficient is very close to its theoretical maximum of 1.0. Typically, one of the variables is morbid, as in Figure 31.2.

According to Aldrich (1995)^[John Aldrich (1994) “Correlations Genuine and Spurious in Pearson and Yule” *Statistical Science* 10(4) URL the idea of **spurious correlations** appears first in an 1897 paper by statistical pioneer and philosopher of science Karl Pearson. The correlation coefficient method was published only in 1888, and, understandably, early users encountered pitfalls. One very early user, W.F.R. Weldon, published a study in 1892 on the correlations between the sizes of organs, such as the tergum and telson in shrimp. (See Figure 31.3.)

Pearson noticed a distinctive feature of Weldon’s method. Weldon measured the tergum and telson as a fraction of the overall body length.

Figure 31.4 shows one possible DAG interpretation where `telson`

and `tergum`

are *not* connected by any causal path. Similarly, `length`

is exogenous with no causal path between it and either `telson`

or `tergum`

.

```
<- dag_make(
shrimp_dag ~ unif(min=2, max=3),
tergum ~ unif(min=4, max=5),
telson ~ unif(min=40, max=80),
length ~ tergum/length + exo(.01),
x ~ telson/length + exo(.01)
y
)# dag_draw(shrimp_dag, seed=101, vertex.label.cex=1)
::include_graphics("www/telson-tergum.png") knitr
```

The Figure 31.4 shows a hypothesis where there is no causal relationship between telson and tergum. Pearson wondered whether dividing those quantities by `length`

to produce variables `x`

and `y`

, might induce a correlation. Weldon had found a correlation coefficient between `x`

and `y`

of about 0.6. Pearson estimated that dividing by `length`

would induce a correlation between `x`

and `y`

of about 0.4-0.5, even if telson and tergum are not causally connected.

We can confirm Pearson’s estimate by sampling from the DAG and modeling `y`

by `x`

. The confidence interval on `x`

shows a relationship between `x`

and `y`

. In 1892, before the invention of regression, the correlation coefficient would have been used. In retrospect, we know the correlation coefficient is a simple scaling of the `x`

coefficient.

```
<- sample(shrimp_dag, size=1000)
Sample lm(y ~ x, data=Sample) %>% conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 0.0457665 | 0.0490190 | 0.0522715 |

x | 0.6147549 | 0.6856831 | 0.7566114 |

`cor(y ~ x, data=Sample)`

`[1] 0.514812`

Pearson’s 1897 work precedes the earliest conception of DAGs by three decades. An entire century would pass before DAGs came into widespread use. However, from the DAG of Figure 31.4] in front of us, we can see that `length`

is a common cause of `x`

and `y`

.

Within 20 years of Pearson’s publication, a mathematical technique called “**partial correlation**” was in use that could deal with this particular problem of spurious correlation. The key is that the model should include `length`

as a covariate. The covariate correctly blocks the path from `x`

to `y`

via `length`

.

`lm(y ~ x + length, data=Sample) %>% conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 0.1507687 | 0.1571398 | 0.1635108 |

x | -0.0362598 | 0.0235473 | 0.0833543 |

length | -0.0013975 | -0.0013241 | -0.0012508 |

The confidence interval on the `x`

coefficient includes zero once `length`

is included in the model. So the data, properly analyzed, show no correlation between telson and tergum.

In this case, “spurious correlation” stems from using an inappropriate method. This situation, identified 130 years ago and addressed a century ago, is still a problem for those who use the correlation coefficient. Although regression allows the incorporation of covariates, the correlation coefficient does not.

## “Correlation implies causation.”

Francis Galton’s 1890 example of the clerks on the bus introduces “correlation” as a causality story. The bus trip causes variation in commute times. Two clerks riding the same bus will have correlated commute times. In the dictionary definitions of “correlation” at the start of the Lesson, the words “connection,” “relationship,” and “interdependence” suggests causal connections.

Insofar as the dictionary definitions of correlation suggest a causal relationship, they are at odds with the statistical mainstream, which famously holds that “correlation does not imply causation.” This view is so entrenched that it appears on tee shirts, one style of which is available for sale by the American Statistical Association.

The statement “A is not B” can be valid only if we know what A and B are. We have a handle on the meaning of “correlation.” So what is the meaning of “causation?”

Dictionaries define “causation” using the word “cause.” So we look there for guidance.

A person or thing that gives rise to an action, phenomenon, or condition. Source: Oxford Languages

An event, thing, or person that makes something happen. Source: Macmillan Dictionary

A person or thing that acts, happens, or exists in such a way that some specific thing happens as a result; the producer of an effect. Source: Dictionary.com

Interpreting these definitions requires making sense of “give rise to,” “makes happen,” or “happens as a result.” All of them are synonyms for “cause.”

This circularity produces a muddle. Centuries of philosophical debate have yet to clarify things much.

Still, we can do something. The point of view of these Lessons is to support decision-making. Causation is a valuable concept for decision-making, particularly in cases where the decision-maker is considering an *intervention*. With this as an anchor, a pragmatic definition of “causation” is available:

Causation describes a class of hypotheses that DAGs can represent. In that representation, a causal relationship between two nodes X and Y is marked by a causal path connecting X to Y. In Lesson 30, we defined “causal path” in terms of the directions of arrows in a DAG.

^{2}A definitive demonstration of a causal relationship between X and Y is that intervening to change X results reliably in a change in Y,all other nodes not on the causal path being held constant.(Lesson 32 treats the methodology behind this definitive sign.)

Whether or not a definitive demonstration is feasible is not directly relevant to the decision-maker. A decision-maker acts under the guidance of one or more hypotheses. A good rule of thumb for decision-makers is to be guided only by plausible hypotheses. Whether a hypothesis is plausible is a matter of informed belief. A definitive demonstration should sharpen that belief. If no such definitive demonstration is available, the decision-maker must rely on alternative sources for belief. Austin Bradford Hill (1898-1991), an epidemiologist and eminent statistician, famously published a list of nine criteria that support belief in a causal hypothesis.

Using my definition of causation, and in marked disagreement with many statisticians, I submit that

Correlation implies causation.

“Correlation implies causation” is not the same as saying, “A correlation between A and B implies that A causes B.” That statement is false. For instance, it might be instead that B causes A. Alternatively, there might be a common cause C for both A and B. Or, C might be a collider between A and B.

There is no mechanism to produce correlation that I am aware of, other than the sources of spurious correlation described previously, that does not involve causation in some way.