%>%
Gestation summarize(variance = var(gestation))
variance |
---|
256.887 |
These Lessons will introduce you to several habits of mind that have, over the last century, been found useful when collecting and interpreting data. Whenever we encounter something new, questions or ideas for actions come to mind. For instance, a financially-minded person arranging for a loan will presumably ask about the interest rate. An economy-minded consumer, seeing a price for, say, olive oil, will know to check the volume that is being provided for that price.
A statistically-minded thinker knows “how and when we can draw valid inferences from data.” [Source] The word “valid” means several things at once: faithful to the data, consistent with the process used to assemble the data, and informative for the uses to which the inferences are to be directed. Part of statistical thinking is being aware of a variety of useful tools for looking at data and judging from the context of the task which tools are appropriate and which not.
Every person has a natural ability to think. We train our thinking skills by observing and emulating the logic and language of people and sources deemed authoritative. We have resources spanning several millennia to hone our ability to think. However, statistical thinking is a comparatively recent arrival on the intellectual scene, germinating and developing over only the last 150 years. As a result, hardly anything that we hear or read exemplifies statistical thinking.
In general, effective thinking requires us to grasp various intellectual tools, for example, logic. Our mode of logical thinking was promulgated by Aristotle (384–322 BC) and, to quote the Stanford Encyclopedia of Philosophy, “has had an unparalleled influence on the history of Western thought.” In the 2500 years since Aristotle’s time, the use of Aristotelian logic has been so pervasive that we expect any well-educated person to be able to identify logical thinking. For example, the statement “John’s car is red” has implications. Which of these two statements are among those implications? “That red car is necessarily John’s,” or “The blue car is not John’s car.” Not so hard!
The intellectual tools needed for statistical thinking are, by and large, unfamiliar and non-intuitive. These Lessons are intended to provide the tools you will need to engage in effective statistical thinking.
To get started, consider this headline from The Economist, a well-reputed international news magazine: “The pandemic’s indirect effects on small children could last a lifetime.” As support for this claim, the headlined article provides more detail. For instance:
“Stress and distraction made some patients more distant. LENA, a charity in Colorado, has for years used wearable microphones to keep track of how much chatter babies and the care-givers exchange. During the pandemic the number of such "conversations" declined. ….”[g]etting lots of interaction in the early years of life is essential for healthy development, so these kinds of data "are a red flag".” The article goes on to talk of “children starved of stimulation at home …..”
This short excerpt might raise some questions. Think about it briefly and note what questions come to mind.
For those already along the road toward statistical thinking, the phrase, “the number of such conversations declined” might prompt this question: “By how much?” Similarly, reading the claim that “getting lots of interactions … is essential for healthy development,” your mind might insist on these questions: How much is “lots?” How does the decline in the number compare to “lots?”
Not finding the answer to these questions in the article’s text, it would be sensible to look for the primary source of the information. In our Internet age, that’s comparatively easy to do. The LENA website includes an article, “COVID-era infants vocalize less and experience fewer conversational turns, says LENA research team.” The article contains several graphs, one of which is reproduced in Figure 19.1.
To make any proper sense of Figure 19.1, you need some basic technical knowledge. For example, what do the vertical bars in the graph mean as opposed to the dots? What is the meaning of “Percentile” and what does it signify? What is the purpose behind displaying \(n=494\) and \(n=136\) below the graph? What does the subcaption “t(628) = 3.03, p = 0.003” tell us, if anything? Turning back to the text of The Economist, how does this graph justify raising a “red flag?” More basically, are these graphs the “data,” or is there more data behind the graphs? What would that data show?
The LENA article does not link to supporting data, that is, what lies behind the graphs in Figure 19.1. But the LENA article does point to other publications.
“These findings from LENA support a growing body of evidence that babies born during the COVID pandemic are, on average, experiencing developmental delays. For example, researchers from the COMBO (COVID-19 Mother Baby Outcomes) consortium at Columbia University published findings in the January 2022 issue of JAMA Pediatrics showing that children born during the pandemic achieved significantly lower gross motor, fine motor, and personal-social scores at six months of age.”
To the statistical thinker, phrases like “red flag,” “growing body of evidence,” and “significantly lower” are weasel words, that is, terms “used in order to evade or retreat from a direct or forthright statement or position.” [Source] In ordinary thinking, such evasiveness or lack of forthrightness would naturally prompt concern about the reliability of the claim. It makes sense to look deeper, for instance, by checking out the JAMA article. Many people would be hesitant to do this, anticipating that the article would be filled with jargon and incomprehensible. An important reason to study statistical thinking is to tear down barriers to substantiating or debunking claims. In fact, the JAMA article contains very little that requires knowledge of pediatrics or the meaning of “gross motor, fine motor, and personal-social scores,” but a lot that depends on understanding statistical notation and convention and the reasoning behind the conventions.
The tools of statistical thinking are the tools for making sense of data. Evaluating data is essential to determine whether to rely on claims supposedly based on those data. In the words of eminent engineer and statistician W. Edwards Demming: “In God we trust. All others must bring data.” Similarly, former President Ronald Reagan famously quoted a Russian proverb: “Trust, but verify.” Unfortunately, until you have the statistical thinking tools needed to interpret data reliably, all you can do is trust, not verify.
Learning a new way of thinking is genuinely hard. As you learn statistical thinking, it may help to have a concise definition. The following definition captures much of the essence of statistical thinking:
Statistic thinking is the accounting for variation in the context of what remains unaccounted for.
Implicit in this definition is a pathway for learning to think statistically:
The next three sections briefly touch on each of these three topics.
Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions. —– Stephen Jay Gould (1941- 2002), paleontologist and historian of science.
To illustrate variation, let’s consider a process fundamental to human life: gestation. We all know that human pregnancy “typically” lasts around nine-months but differs unpredictably from one birth to another.
Figure 19.2 shows data from the Gestation
data frame. In this data frame, each of the 1200 rows is one pregnancy and birth about which several measurements were made. The gestation
variable records the length of the pregnancy (in days).
<- Gestation %>%
Gestation mutate(parity = ifelse(parity == 0, "first-time", "previous-preg"))
<- Gestation %>%
Plot1 ggplot(aes(x=parity, y=gestation)) +
geom_jitter(alpha=0.2, width=0.2, height=0)
Plot1
Figure 19.2 divides the 1200 births in the Gestation
data frame according to the variable parity
, which describes whether or not the pregnancy is the mother’s first.
The variation in gestation
is evident directly from the dots in the graph. One strategy for describing variation is to specify an interval: the span between a lower and an upper value. For instance,
A more subtle description avoids setting hard bounds in favor of saying which durations are common and which not. This common-or-not description is called a “distribution.” The “histogram” is a famous style of presentation of a distribution. Even elementary-school students are introduced to histograms; they are easy to draw.
There are good reasons to avoid the busy display of a histogram. For instance, we want to be able to show relationships between variables and we want, whenever possible, to put the graphical summaries of data as a layer on top of the data themselves. And we have the computer as a tool for making graphics. Consequently, our preferred format for displaying distributions is a smooth shape, oriented along the vertical axis. The width of the shape expresses how common is the corresponding region of the vertical axis. The word “density” is often used when talking about distributions. Where the data points are closely spaced to one another, the density is high. Where data points are sparse, the density is low. You can see the density at any level of the vertical axis, just as you can read by eye the density of tufts of grass sprouting in a newly tilled field.
Figure 19.3 shows the density display layered on top of the pregnancy data. For reasons that may be evident, this sort of display is called a “violin plot.”
+
Plot1 geom_violin(aes(group=parity),
fill="blue", alpha=0.65, color=NA)
The shapes of the two violins in Figure 19.3 are similar, suggesting that the variation in the duration of pregnancy is about the same for first-time mothers as for mothers in a second or later pregnancy.
There is a strong link between interval descriptions of variation and the density display. Suppose you specify the fraction of cases that you want to include in an interval description, say 50% or 80%. In terms of the violin, that fraction is a proportion of the overall area of the violin. For instance, the 50% interval would include the central 50% of the area of the violin, leaving 25% out at the bottom and another 25% out at the top. The 80% interval would leave out only 10% of the area at the top and bottom of the violin. This suggests that the interval style of describing variation really involves three numbers; the top and bottom of the interval as well as the selected percentage (say, 50% or 80%) used to find the location of the top and bottom.
Yet another style for describing variation—one that will take primary place in these Lessons—uses only a single-number. Perhaps the simplest way to imagine how a single number can capture variation is to think about the numerical difference between the top and bottom of an interval description. In taking such a distance as the measure of variation, we are throwing out some information. Taken together, the top and bottom of the interval describe two things: the location of the values and how different the values are from one another. These are both important, but it is the difference between values that gives a pure description of variation.
Early pioneers of statistics took some time to agree on a standard way of measuring variation. For instance, should it be the distance between the top and bottom of a 50% interval, or should an 80% interval be used, or something else. In the end, the selected standard is not about an interval but something rather more basic: the distances between pairs of individual values.
To illustrate, suppose the gestation
variable had only two entries, say, 267 and 293 days. The difference between these is \(267-293 = -26\) days. Of course, we don’t intend to measure distance with a negative number. One solution is to use the absolute value of the difference. However, for subtle mathematical reasons relating to—of all things!—the Pythagorean theorem, we avoid the possibility of a negative number by using the square of the difference, that is, \((293 - 267)^2 = 676\) days-squared.
To extend this very simple measure of variation to data with \(n > 2\) is simple: look at the square difference between every possible pair of values, then average. For instance, for \(n=3\) with values 267, 293, 284, look at the differences \((267-293)^2, (267-284)^2\) and \((293-284)^2\) and average them! This simple way of measuring variation is called the “modulus” and dates from at least 1885. Since then, statisticians have standardized on a closely related measure, the “variance,” which is the modulus divided by \(\sqrt{2}\). Either one would have been fine, but there are advantages to standardizing on one: the variance.
Figure 19.4 is a jitter plot of the gestation
duration variable from the Gestation
data frame. There is no explanatory variable in the graph because we are focusing on just the one variable: gestation
. The range in the values of gestation
runs from just over 220 days to just under 360 days.
Each red line in Figure 19.4 connects two randomly selected values from the variable. Some of lines are short; the values are pretty close (in vertical offset). Some of the lines are long; the values differ substantially.
Only a few pairs of points have been connected with the red lines. To connect every possible pair of points would fill the graph with so many lines that it would be impossible to see that each line connects a pair of values.
The average of the square of the length of the lines (in the vertical direction) is called the “modulus.” We won’t need to use this word, since the “variance” is the standard description of variability. Numerically, the variance is half the value of the modulus.
Calculating the variance is straightforward using the var()
function. Remember, var()
is similar to the other summary functions such as mean()
or median()
that reduce multiple values into a single value. As always, such reduction functions are used along with the summarize()
wrangling command.
summarize(
variance=var(gestation))
, the name variance
is selected for human readability of the results. The name need not have anything to do with the quantity being calculated. So, summarize(f2=var(gestation))
is perfectly valid from the computer’s point of view, but not so helpful from the human perspective. Perhaps you would prefer a very short name such as vgest
or a slightly more descriptive name such as gest_var
. It’s up to you!%>%
Gestation summarize(variance = var(gestation))
variance |
---|
256.887 |
A consequence of the use of squaring in defining the variance is the units of the result. gestation
is measured in days, so var(gestation)
is measured in days2. The advantage to this will only become clear later in these Lessons. For now, you might prefer to think about the square-root of the variance, which has been given the name “standard deviation” and which has the more natural units, in the case of sd(gestation)
of days.
%>%
Gestation summarize(standard_deviation = sd(gestation))
standard_deviation |
---|
16.02769 |
The word “account” has several related meanings.1
Synonyms for “account” include “description,”report,” “version,” “story,” “statement,” “explanation,” “interpretation,” “sketch,” and “portrayal.” “Accountants” and their “account books” keep track of where money comes from and goes to.
These various nuances of meaning, from a simple arithmetical tallying up to an interpretation or version serve the purposes of statistical thinking well. When we “account for variation,” we are telling a story that tries to explain where the variation might have come from. An accounting of variation is not necessarily definitive, true, or helpful. Just as witnesses of an event can have different accounts, so there can be many accounts of the variation even of the same variable in the same data frame.
There are many formats for stories, many ways of organizing facts and data, and many ways of accounting for variance. In these Lessons, we will use regression modeling almost exclusively as our method of accounting. Here, for example, are two different accounts of gestation
:
lm(gestation ~ 1, data=Gestation) %>% conf_interval()
term | .lwr | .coef | .upr |
---|---|---|---|
(Intercept) | 278.4 | 279.3 | 280.2 |
lm(gestation ~ parity, data = Gestation) %>% conf_interval()
term | .lwr | .coef | .upr |
---|---|---|---|
(Intercept) | 279.500 | 281.300 | 283.0000 |
parityprevious-preg | -4.641 | -2.585 | -0.5288 |
In the R language, expressions like gestation ~ 1
and gestation ~ parity
are called “tilde expressions.” They are the means by which the modeler specifies the structure of the model that is to be built. Training (or “fitting”) translates the model specification into an arithmetic formula that involves the explanatory variables and numerical coefficients.
The coefficients from a regression model are part of an accounting for variation. Learning how to read them is an important skill in statistical thinking. For instance, the coefficient from a model in the form y ~ 1
is always the average value of variable y
. In contrast, in a model like y ~ x
, the “intercept” is a baseline value and the x
-coefficient describes what part of the variation in y
can be credited to x
.
A model typically accounts for only some of the variation in a response variable. The remaining variation is called “residual variation.”
Consider the model gestation ~ parity
. In the next lines of code we build this model, training it with the Gestation
data. Then we evaluate the model on the trained data. This amounts to using the model coefficients to generate a model output for each row in the training data, and can be accomplished with the model_eval()
R function.
<- lm(gestation ~ parity, data = Gestation)
Model <- model_eval(Model) Evaluated
.response | parity | .output | .resid | .lwr | .upr |
---|---|---|---|---|---|
262 | previous-preg | 278.7 | -16.70 | 247 | 310 |
280 | previous-preg | 278.7 | 1.32 | 247 | 310 |
286 | previous-preg | 278.7 | 7.32 | 247 | 310 |
290 | previous-preg | 278.7 | 11.30 | 247 | 310 |
277 | first-time | 281.3 | -4.26 | 250 | 313 |
.response
variable
The output from model_eval()
repeats some columns from the data used for evaluation. For example, the explanatory variables are listed by name. (Here, the only explanatory variable is parity
.) The response variable is also included, but given a generic name, .response
to make it easy to distinguish it from the explanatory variables.
To see where the .output
comes from, let’s look again at the model coefficients:
%>% conf_interval() Model
term | .lwr | .coef | .upr |
---|---|---|---|
(Intercept) | 279.500 | 281.300 | 283.0000 |
parityprevious-preg | -4.641 | -2.585 | -0.5288 |
The baseline value is 281.3 days. This applies to first-time mothers. For the other mothers, those with a previous pregnancy, the coefficient indicates that the model value is 2.6 days less than the baseline, or 279.7 days.
The output from model_eval()
includes other columns of importance. For us, here, those are. the response variable itself (gestation
, which has been given a generic name, .response
) and the residuals from the model (.resid
). There is a simple relationship between .response
, .output
and .resid
:
\[\mathtt{.response} = \mathtt{.output} + \mathtt{.resid}\]
The subtle mathematical reasoning behind the choice of variance to measure variation is illuminated when we compute the variances of the three quantities in the previous equation.
%>%
Evaluated summarize(var_response = var(.response),
var_output = var(.output),
var_resid = var(.resid))
var_response | var_output | var_resid |
---|---|---|
256.887 | 1.273587 | 255.6134 |
The variances of the output and residuals add up to equal, exactly, the variance of the response variable! This isn’t true for the standard deviations:
%>%
Evaluated summarize(sd_response = sd(.response),
sd_output = sd(.output),
sd_resid = sd(.resid))
sd_response | sd_output | sd_resid |
---|---|---|
16.02769 | 1.128533 | 15.98791 |
These are drawn from the Oxford Languages dictionaries.↩︎