Google NGram provides a quick way to track word usage in books over the decades. Figure fig-ngram shows the NGram for three statistical words: coefficient, correlation, and regression.
The use of “correlation” started in the mid to late 1800s, reached an early peak in the 1930s, then peaked again around 1980. “Correlation” is tracked closely by “coefficient.” This parallel track might seem evident to historians of statistics; the quantitative measure called the “correlation coefficient” was introduced by Francis Galton in 1888 and quickly became a staple of statistics textbooks.
In contrast to mainstream statistics textbooks, “correlation” barely appears in these lessons (until this chapter). There is a good reason for this. Although the correlation coefficient measures the “strength” of the relationship between two variables, it is a special case of a more general and powerful method that appears throughout these Lessons: regression modeling.
Figure fig-ngram shows that “regression” got a later start than correlation. That is likely because it took 30-40 years before it was appreciated that correlation could be generalized. Furthermore, regression is more mathematically complicated than correlation, so practical use of regression relied on computing, and computers started to become available only around 1950.
Correlation
A dictionary is a starting point for understanding the use of a word. Here are four definitions of “correlation” from general-purpose dictionaries.
“A relation existing between phenomena or things or between mathematical or statistical variables which tend to vary, be associated, or occur together in a way not expected on the basis of chance alone” Source: Merriam-Webster Dictionary
“A connection between two things in which one thing changes as the other does” Source: Oxford Learner’s Dictionary
“A connection or relationship between two or more things that is not caused by chance. A positive correlation means that two things are likely to exist together; a negative correlation means that they are not.” Source: Macmillan dictionary
“A mutual relationship or connection between two or more things,” “interdependence of variable quantities.” Source: [Oxford Languages]
All four definitions use “connection” or “relation/relationship.” That is at the core of “correlation.” Indeed, “relation” is part of the word “correlation.” One of the definitions uses “causes” explicitly, and the everyday meaning of “connection” and “relation” tend to point in this direction. The phrase “one thing changes as the other does” is close to the idea of causality, as is “interdependence.:
Three of the definitions use the words “vary,” “variable,” or “changes.” The emphasis on variation also appears directly in a close statistical synonym for correlation: “covariance.”
Two of the definitions refer to “chance,” that correlation “is not caused by chance,” or “not expected on the basis of chance alone.” These phrases suggest to a general reader that correlation, since not based on chance, must be a matter of fate: pre-determination and the action of causal mechanisms.
We can put the above definitions in the context of four major themes of these Lessons:
- Quantitative description of relationships
- Variation
- Sampling variation
- Causality
Correlation is about relationships; the “correlation coefficient” is a way to describe a straight-line relationship quantitatively. The correlation coefficient addresses the tandem variation of quantities, or, more simply stated, how “one thing changes as the other does.”
To a statistical thinker, the concern about “chance” in the definitions is not about fate but reliability. Sampling variation can lead to the appearance of a pattern in some samples of a process that is not seen in other samples of that same process. Reliability means that the pattern will appear in a large majority of samples.
The unlikeliness of the correlations on the website is another clue to their origin as methodological. Nobody woke up one morning with the hypothesis that cheese consumption and bedsheet mortality are related. Instead, the correlation is the product of a search among many miscellaneous records. Imagine that data were available on 10,000 annually tabulated variables for the last decade. These 10,000 variables create the opportunity for 50 million pairs of variables. Even if none of these 50 million pairs have a genuine relationship, sampling variation will lead to some of them having a strong correlation coefficient.
In statistics, such a blind search is called the “multiple comparisons problem.” Ways to address the problem have been available since the 1950s. (We will return to this topic under the label “false discovery” in Lesson sec-NHT.) Multiple comparisons can be used as a trick, as with the website. However, multiple comparisons also arise naturally in some fields. For example, in molecular genetics, “micro-arrays” make a hundred thousand simultaneous measurements of gene expression. Correlations in the expression of two genes give a clue to cellular function and disease. With so many pairs available, multiple comparisons will be an issue.
Some of the spurious correlations presented on the eponymous website can be attributed to methodological error: using inapproriate statistical methods.
The methods we describe in this Lesson to summarize the contents of a data frame have a property that is perhaps surprising. The summaries do not change even if you re-order the rows in the data frame, say, reversing them top to bottom or even placing intact rows in a random order. Or, seen in another way, the summaries are based on the assumption that each specimen in a data frame was collected independently of all the other specimens.
There is a common situation where this assumption does not hold true. This is when the different specimens are measurements of the same thing spread out over time, for instance, a day-to-day record of temperature or a stock-market index, or an economic statistic such as the unemployment rate. Such a data frame is called a “time series.”
The realization that time series require special statistical techniques came early in the history of statistics. The paper, “On the influence of the time factor on the correlation between the barometric heights at stations more than 1000 miles apart,” by F.E. Cave-Browne-Cave, was published in 1904 in the Proceedings of the Royal Society. Perhaps one reason for the use of initials by the author relates to an important social problem: the failure to recognize properly the contributions of women to science. “Miss Cave,” as she was referred to in 1917 and 1921, respectively by eminent statisticians William Sealy Gosset (who published under the name “Student”) and George Udny Yule, also offered a solution to the problem. Her solution is a historical precursor of “time-series analysis,” a contemporary specialized area of statistics.