19  Variation

Statistical thinking

These Lessons will introduce you to several habits of mind that have, over the last century, been found useful when collecting and interpreting data. Whenever we encounter something new, questions or ideas for actions come to mind. For instance, a financially-minded person arranging for a loan will presumably ask about the interest rate. An economy-minded consumer, seeing a price for, say, olive oil, will know to check the volume that is being provided for that price.

A statistically-minded thinker knows “how and when we can draw valid inferences from data.” [Source] The word “valid” means several things at once: faithful to the data, consistent with the process used to assemble the data, and informative for the uses to which the inferences are to be directed. Part of statistical thinking is being aware of a variety of useful tools for looking at data and judging from the context of the task which tools are appropriate and which not.

Every person has a natural ability to think. We train our thinking skills by observing and emulating the logic and language of people and sources deemed authoritative. We have resources spanning several millennia to hone our ability to think. However, statistical thinking is a comparatively recent arrival on the intellectual scene, germinating and developing over only the last 150 years. As a result, hardly anything that we hear or read exemplifies statistical thinking.

In general, effective thinking requires us to grasp various intellectual tools, for example, logic. Our mode of logical thinking was promulgated by Aristotle (384–322 BC) and, to quote the Stanford Encyclopedia of Philosophy, “has had an unparalleled influence on the history of Western thought.” In the 2500 years since Aristotle’s time, the use of Aristotelian logic has been so pervasive that we expect any well-educated person to be able to identify logical thinking. For example, the statement “John’s car is red” has implications. Which of these two statements are among those implications? “That red car is necessarily John’s,” or “The blue car is not John’s car.” Not so hard!

The intellectual tools needed for statistical thinking are, by and large, unfamiliar and non-intuitive. These Lessons are intended to provide the tools you will need to engage in effective statistical thinking.

To get started, consider this headline from The Economist, a well-reputed international news magazine: “The pandemic’s indirect effects on small children could last a lifetime.” As support for this claim, the headlined article provides more detail. For instance:

“Stress and distraction made some patients more distant. LENA, a charity in Colorado, has for years used wearable microphones to keep track of how much chatter babies and the care-givers exchange. During the pandemic the number of such "conversations" declined. ….”[g]etting lots of interaction in the early years of life is essential for healthy development, so these kinds of data "are a red flag".” The article goes on to talk of “children starved of stimulation at home …..”

This short excerpt might raise some questions. Think about it briefly and note what questions come to mind.

For those already along the road toward statistical thinking, the phrase, “the number of such conversations declined” might prompt this question: “By how much?” Similarly, reading the claim that “getting lots of interactions … is essential for healthy development,” your mind might insist on these questions: How much is “lots?” How does the decline in the number compare to “lots?”

Not finding the answer to these questions in the article’s text, it would be sensible to look for the primary source of the information. In our Internet age, that’s comparatively easy to do. The LENA website includes an article, “COVID-era infants vocalize less and experience fewer conversational turns, says LENA research team.” The article contains several graphs, one of which is reproduced in Figure 19.1.

To make any proper sense of Figure 19.1, you need some basic technical knowledge. For example, what do the vertical bars in the graph mean as opposed to the dots? What is the meaning of “Percentile” and what does it signify? What is the purpose behind displaying \(n=494\) and \(n=136\) below the graph? What does the subcaption “t(628) = 3.03, p = 0.003” tell us, if anything? Turning back to the text of The Economist, how does this graph justify raising a “red flag?” More basically, are these graphs the “data,” or is there more data behind the graphs? What would that data show?

Figure 19.1: A statistical graphic from the LENA website captioned, “Children from the COVID-era sample produced significantly fewer vocalizations than their pre-COVID peers.”

The LENA article does not link to supporting data, that is, what lies behind the graphs in Figure 19.1. But the LENA article does point to other publications.

These findings from LENA support a growing body of evidence that babies born during the COVID pandemic are, on average, experiencing developmental delays. For example, researchers from the COMBO (COVID-19 Mother Baby Outcomes) consortium at Columbia University published findings in the January 2022 issue of JAMA Pediatrics showing that children born during the pandemic achieved significantly lower gross motor, fine motor, and personal-social scores at six months of age.

To the statistical thinker, phrases like “red flag,” “growing body of evidence,” and “significantly lower” are weasel words, that is, terms “used in order to evade or retreat from a direct or forthright statement or position.” [Source] In ordinary thinking, such evasiveness or lack of forthrightness would naturally prompt concern about the reliability of the claim. It makes sense to look deeper, for instance, by checking out the JAMA article. Many people would be hesitant to do this, anticipating that the article would be filled with jargon and incomprehensible. An important reason to study statistical thinking is to tear down barriers to substantiating or debunking claims. In fact, the JAMA article contains very little that requires knowledge of pediatrics or the meaning of “gross motor, fine motor, and personal-social scores,” but a lot that depends on understanding statistical notation and convention and the reasoning behind the conventions.

The tools of statistical thinking are the tools for making sense of data. Evaluating data is essential to determine whether to rely on claims supposedly based on those data. In the words of eminent engineer and statistician W. Edwards Demming: “In God we trust. All others must bring data.” Similarly, former President Ronald Reagan famously quoted a Russian proverb: “Trust, but verify.” Unfortunately, until you have the statistical thinking tools needed to interpret data reliably, all you can do is trust, not verify.

Defining statistical thinking

Learning a new way of thinking is genuinely hard. As you learn statistical thinking, it may help to have a concise definition. The following definition captures much of the essence of statistical thinking:

Statistic thinking is the accounting for variation in the context of what remains unaccounted for.

Implicit in this definition is a pathway for learning to think statistically:

  1. Learn how to measure variation;
  2. Learn how to account for variation;
  3. Learn how to measure what remains unaccounted for.

The next three sections briefly touch on each of these three topics.


Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions. —– Stephen Jay Gould (1941- 2002), paleontologist and historian of science.

We will use a _vari_ety of words to express differences from case to case, such as the diverse durations of gestation. Variation is about how things vary. Variance has a non-technical meaning, as in a “zoning variance” which gives permission to depart from zoning rules. For us, variance will always be used in a technical sense: a number summarizing variation of the values in a variable. Whenever you see the stem “var”, you should be thinking of case-to-case differences.

To illustrate variation, let’s consider a process fundamental to human life: gestation. We all know that human pregnancy “typically” lasts around nine-months but differs unpredictably from one birth to another.

Figure 19.2 shows data from the Gestation data frame. In this data frame, each of the 1200 rows is one pregnancy and birth about which several measurements were made. The gestation variable records the length of the pregnancy (in days).

Gestation <- Gestation %>% 
  mutate(parity = ifelse(parity == 0, "first-time", "previous-preg")) 
Plot1 <- Gestation %>%
  ggplot(aes(x=parity, y=gestation)) + 
  geom_jitter(alpha=0.2, width=0.2, height=0) 

Figure 19.2: Gestational period for first-time mothers and mothers with a previous pregancy.

Figure 19.2 divides the 1200 births in the Gestation data frame according to the variable parity, which describes whether or not the pregnancy is the mother’s first.

The variation in gestation is evident directly from the dots in the graph. One strategy for describing variation is to specify an interval: the span between a lower and an upper value. For instance,

  • The large majority of pregancies last between 250 and 310 days. Or,
  • The majority of pregnancies are between 275 and 290 days.

A more subtle description avoids setting hard bounds in favor of saying which durations are common and which not. This common-or-not description is called a “distribution.” The “histogram” is a famous style of presentation of a distribution. Even elementary-school students are introduced to histograms; they are easy to draw.

There are good reasons to avoid the busy display of a histogram. For instance, we want to be able to show relationships between variables and we want, whenever possible, to put the graphical summaries of data as a layer on top of the data themselves. And we have the computer as a tool for making graphics. Consequently, our preferred format for displaying distributions is a smooth shape, oriented along the vertical axis. The width of the shape expresses how common is the corresponding region of the vertical axis. The word “density” is often used when talking about distributions. Where the data points are closely spaced to one another, the density is high. Where data points are sparse, the density is low. You can see the density at any level of the vertical axis, just as you can read by eye the density of tufts of grass sprouting in a newly tilled field.

Figure 19.3 shows the density display layered on top of the pregnancy data. For reasons that may be evident, this sort of display is called a “violin plot.”

Plot1 +
              fill="blue", alpha=0.65, color=NA)

Figure 19.3: A violin plot. The long axis of the violin-like shape is oriented along the response-variable axis (that is, the vertical axis in our standard format). The width of the violin for each possible value of the response variable is proportional to the density of data near that value.

The shapes of the two violins in Figure 19.3 are similar, suggesting that the variation in the duration of pregnancy is about the same for first-time mothers as for mothers in a second or later pregnancy.

There is a strong link between interval descriptions of variation and the density display. Suppose you specify the fraction of cases that you want to include in an interval description, say 50% or 80%. In terms of the violin, that fraction is a proportion of the overall area of the violin. For instance, the 50% interval would include the central 50% of the area of the violin, leaving 25% out at the bottom and another 25% out at the top. The 80% interval would leave out only 10% of the area at the top and bottom of the violin. This suggests that the interval style of describing variation really involves three numbers; the top and bottom of the interval as well as the selected percentage (say, 50% or 80%) used to find the location of the top and bottom.

Yet another style for describing variation—one that will take primary place in these Lessons—uses only a single-number. Perhaps the simplest way to imagine how a single number can capture variation is to think about the numerical difference between the top and bottom of an interval description. In taking such a distance as the measure of variation, we are throwing out some information. Taken together, the top and bottom of the interval describe two things: the location of the values and how different the values are from one another. These are both important, but it is the difference between values that gives a pure description of variation.

Early pioneers of statistics took some time to agree on a standard way of measuring variation. For instance, should it be the distance between the top and bottom of a 50% interval, or should an 80% interval be used, or something else. In the end, the selected standard is not about an interval but something rather more basic: the distances between pairs of individual values.

To illustrate, suppose the gestation variable had only two entries, say, 267 and 293 days. The difference between these is \(267-293 = -26\) days. Of course, we don’t intend to measure distance with a negative number. One solution is to use the absolute value of the difference. However, for subtle mathematical reasons relating to—of all things!—the Pythagorean theorem, we avoid the possibility of a negative number by using the square of the difference, that is, \((293 - 267)^2 = 676\) days-squared.

To extend this very simple measure of variation to data with \(n > 2\) is simple: look at the square difference between every possible pair of values, then average. For instance, for \(n=3\) with values 267, 293, 284, look at the differences \((267-293)^2, (267-284)^2\) and \((293-284)^2\) and average them! This simple way of measuring variation is called the “modulus” and dates from at least 1885. Since then, statisticians have standardized on a closely related measure, the “variance,” which is the modulus divided by \(\sqrt{2}\). Either one would have been fine, but there are advantages to standardizing on one: the variance.

Variance as pairwise-differences

Figure 19.4 is a jitter plot of the gestation duration variable from the Gestation data frame. There is no explanatory variable in the graph because we are focusing on just the one variable: gestation. The range in the values of gestation runs from just over 220 days to just under 360 days.

Each red line in Figure 19.4 connects two randomly selected values from the variable. Some of lines are short; the values are pretty close (in vertical offset). Some of the lines are long; the values differ substantially.

Figure 19.4: The variance is related to the average square difference between all pairs of values in the variable.

Only a few pairs of points have been connected with the red lines. To connect every possible pair of points would fill the graph with so many lines that it would be impossible to see that each line connects a pair of values.

The average of the square of the length of the lines (in the vertical direction) is called the “modulus.” We won’t need to use this word, since the “variance” is the standard description of variability. Numerically, the variance is half the value of the modulus.

Calculating the variance is straightforward using the var() function. Remember, var() is similar to the other summary functions such as mean() or median() that reduce multiple values into a single value. As always, such reduction functions are used along with the summarize() wrangling command.

Tip: In the expression summarize( variance=var(gestation)), the name variance is selected for human readability of the results. The name need not have anything to do with the quantity being calculated. So, summarize(f2=var(gestation)) is perfectly valid from the computer’s point of view, but not so helpful from the human perspective. Perhaps you would prefer a very short name such as vgest or a slightly more descriptive name such as gest_var. It’s up to you!
Gestation %>%
  summarize(variance = var(gestation))

A consequence of the use of squaring in defining the variance is the units of the result. gestation is measured in days, so var(gestation) is measured in days2. The advantage to this will only become clear later in these Lessons. For now, you might prefer to think about the square-root of the variance, which has been given the name “standard deviation” and which has the more natural units, in the case of sd(gestation) of days.

Gestation %>%
  summarize(standard_deviation = sd(gestation))

Accounting for variation

The word “account” has several related meanings.1

  • To “account for something” means “to be the explanation or cause of something.” [Oxford Languages]
  • An “account of something” is a story, a description, or an explanation, as in the Biblical account of the creation of the world.
  • To “take account of something” means “to consider particular facts, circumstances, etc. when making a decision about something.”

Synonyms for “account” include “description,”report,” “version,” “story,” “statement,” “explanation,” “interpretation,” “sketch,” and “portrayal.” “Accountants” and their “account books” keep track of where money comes from and goes to.

These various nuances of meaning, from a simple arithmetical tallying up to an interpretation or version serve the purposes of statistical thinking well. When we “account for variation,” we are telling a story that tries to explain where the variation might have come from. An accounting of variation is not necessarily definitive, true, or helpful. Just as witnesses of an event can have different accounts, so there can be many accounts of the variation even of the same variable in the same data frame.

There are many formats for stories, many ways of organizing facts and data, and many ways of accounting for variance. In these Lessons, we will use regression modeling almost exclusively as our method of accounting. Here, for example, are two different accounts of gestation:

lm(gestation ~ 1, data=Gestation) %>% conf_interval()
term .lwr .coef .upr
(Intercept) 278.4 279.3 280.2
lm(gestation ~ parity, data = Gestation) %>% conf_interval()
term .lwr .coef .upr
(Intercept) 279.500 281.300 283.0000
parityprevious-preg -4.641 -2.585 -0.5288

In the R language, expressions like gestation ~ 1 and gestation ~ parity are called “tilde expressions.” They are the means by which the modeler specifies the structure of the model that is to be built. Training (or “fitting”) translates the model specification into an arithmetic formula that involves the explanatory variables and numerical coefficients.

The coefficients from a regression model are part of an accounting for variation. Learning how to read them is an important skill in statistical thinking. For instance, the coefficient from a model in the form y ~ 1 is always the average value of variable y. In contrast, in a model like y ~ x, the “intercept” is a baseline value and the x-coefficient describes what part of the variation in y can be credited to x.

Variation unaccounted for

A model typically accounts for only some of the variation in a response variable. The remaining variation is called “residual variation.”

Consider the model gestation ~ parity. In the next lines of code we build this model, training it with the Gestation data. Then we evaluate the model on the trained data. This amounts to using the model coefficients to generate a model output for each row in the training data, and can be accomplished with the model_eval() R function.

Model <- lm(gestation ~ parity, data = Gestation)
Evaluated <- model_eval(Model)
.response parity .output .resid .lwr .upr
262 previous-preg 278.7 -16.70 247 310
280 previous-preg 278.7 1.32 247 310
286 previous-preg 278.7 7.32 247 310
290 previous-preg 278.7 11.30 247 310
277 first-time 281.3 -4.26 250 313
The .response variable

The output from model_eval() repeats some columns from the data used for evaluation. For example, the explanatory variables are listed by name. (Here, the only explanatory variable is parity.) The response variable is also included, but given a generic name, .response to make it easy to distinguish it from the explanatory variables.

To see where the .output comes from, let’s look again at the model coefficients:

Model %>% conf_interval()
term .lwr .coef .upr
(Intercept) 279.500 281.300 283.0000
parityprevious-preg -4.641 -2.585 -0.5288

The baseline value is 281.3 days. This applies to first-time mothers. For the other mothers, those with a previous pregnancy, the coefficient indicates that the model value is 2.6 days less than the baseline, or 279.7 days.

The output from model_eval() includes other columns of importance. For us, here, those are. the response variable itself (gestation, which has been given a generic name, .response) and the residuals from the model (.resid). There is a simple relationship between .response, .output and .resid:

\[\mathtt{.response} = \mathtt{.output} + \mathtt{.resid}\]

Demonstration: Why the variance?

The subtle mathematical reasoning behind the choice of variance to measure variation is illuminated when we compute the variances of the three quantities in the previous equation.

Evaluated %>%
  summarize(var_response = var(.response),
            var_output = var(.output),
            var_resid  = var(.resid))
var_response var_output var_resid
256.887 1.273587 255.6134

The variances of the output and residuals add up to equal, exactly, the variance of the response variable! This isn’t true for the standard deviations:

Evaluated %>%
  summarize(sd_response = sd(.response),
            sd_output = sd(.output),
            sd_resid  = sd(.resid))
sd_response sd_output sd_resid
16.02769 1.128533 15.98791

  1. These are drawn from the Oxford Languages dictionaries.↩︎