|>
Galton group_by(sex) |>
summarize(mheight = mean(height))
sex mheight
---- --------
F 64.1102
M 69.2288
Lesson 8 listed three basic components of statistical thinking and provided instruction on the first one:
The word “account” has several related meanings. These are drawn from the Oxford Languages dictionaries.
Synonyms for “account” include “description,”report,” “version,” “story,” “statement,” “explanation,” “interpretation,” “sketch,” and “portrayal.” “Accountants” and their “account books” keep track of where money comes from and goes to.
These various nuances of meaning, from a simple arithmetical tallying up to an interpretative story serve the purposes of statistical thinking well. When we “account for variation,” we are telling a story that tries to explain where the variation might have come from. Although the arithmetic used in the accounting is correct, the story behind the accounting is not necessarily definitive, true, or helpful. Just as witnesses of an event can have different accounts, so there can be many accounts of the variation even of the same variable in the same data frame.
In this Lesson, we will take on components two and three:
In business, accounting is the process of keeping track of money and value: assets, receipts, expenditures, etc. Likewise, in statistics, accounting for variation is the process of keeping track of variation. At its most basic, accounting for variation splits a variable into components: those parts which can be attributed to one or more other variables and that part which cannot be so attributed.
The variable being split up is called the “response variable.” The choice of a response variable is made by the human modeler, depending on her purposes for undertaking the work.
To illustrate, we’ll return to a historical statistical landmark, the data collected in the 1880s by Francis Galton on the heights of parents and their full-grown children. In that era, there was much excitement, uncertainty, and controversy about Charles Darwin’s theory of evolution. Today, we understand the role of DNA in encoding genetic information. But in Galton’s day, the concept of “gene” was unknown.
In the Galton
data frame, each specimen is a full-grown child. The variables recorded are the height
of the child in inches, the heights of the mother
and father
, and the sex
of the child according to the conventions of Victorian England, where the data were collected.
Galton’s overall project was to quantify the heritability of traits (such as height) from parents to offspring. Darwin had famously measured the beaks of finches on the Galapagos Islands. Galton, in London, worked with the most readily available local species: humans. Height is easy and socially acceptable to measure, a good candidate for data collection.
Everyone can see that height varies from person to person. Galton sought to divide the height variation in the children into two parts: that which could be attributed to the parents and that which remained unaccounted for, which we call the “residual.” We will start with a mathematically simpler project: dividing the variation in the children’s heights into that part associated with the child’s sex and the residual.
As introduced in Lesson 4, we can calculate the average height for females and males using the group_by()
and summarize()
wrangling verbs.
|>
Galton group_by(sex) |>
summarize(mheight = mean(height))
sex mheight
---- --------
F 64.1102
M 69.2288
This simple report indicates that the sexes differ, on average, by about 5 inches in height. But this report offers no insight into how much of the person-to-person variation in height is attributable to sex
. For that, we need to look at the group averages in the context of the individuals. To that end, we will use mutate()
to assign to each individual a “model value” that depends only on sex
. The mutate()
verb is helpful here.
<- Galton |>
Model1 group_by(sex) |>
mutate(modval = mean(height)) |>
ungroup()
set.seed(101)
|>
Model1 select(height, modval, sex, mother, father) |>
sample(10)
?(caption)
height modval sex mother father orig.id
------- ------- ---- ------- ------- --------
70.7 69.23 M 65.5 65 841
66.0 69.23 M 67.0 65 825
68.5 69.23 M 66.7 69 430
63.0 64.11 F 63.0 72 95
70.0 69.23 M 62.0 71 209
70.5 69.23 M 66.5 69 442
62.5 64.11 F 63.7 70 351
63.0 64.11 F 66.0 65 829
70.0 69.23 M 64.0 70 315
66.0 64.11 F 67.0 70 246
Figure 8.1 shows ?tbl-galton-model1c graphically. Some things to notice:
mother
or father
. The same is true for the males.|> sample(size=250) |>
Model1 gf_point(height ~ sex,
alpha=0.3,
position = position_jitter(height=0, seed=101)) |>
gf_point(modval ~ sex, height=0, color="blue", alpha=0.3,
position = position_jitter(height=0, seed=101))
height
) and the model values from ?tbl-galton-model1c plotted versus sex
. The model values are the same within each sex for all individuals.The numerical difference between each specimen’s height and its model value is called the “residual.” This is easy to calculate:
<- Model1 |>
Model1 mutate(resid = height - modval)
set.seed(101)
|>
Model1 select(height, modval, resid, sex, mother, father) |>
sample(size=10)
?(caption)
height modval resid sex mother father orig.id
------- ------- -------- ---- ------- ------- --------
70.7 69.23 1.4710 M 65.5 65 841
66.0 69.23 -3.2290 M 67.0 65 825
68.5 69.23 -0.7288 M 66.7 69 430
63.0 64.11 -1.1100 F 63.0 72 95
70.0 69.23 0.7712 M 62.0 71 209
70.5 69.23 1.2710 M 66.5 69 442
62.5 64.11 -1.6100 F 63.7 70 351
63.0 64.11 -1.1100 F 66.0 65 829
70.0 69.23 0.7712 M 64.0 70 315
66.0 64.11 1.8900 F 67.0 70 246
<- Model1 |>
Model1 mutate(resid = height - modval)
The residuals for the model of height
versus sex
are shown in ?tbl-galton-model1d. Notice that some residuals are positive, meaning that the actual height
is larger than the model value. Other residuals, when the actual height
is below the model value, are negative.
Instructors should note the similarity of the partitioning of variances to the Pythagorean Theorem. With right triangles, the square lengths of the legs add to the square length of the hypotenuse: \(A^2 + B^2 = C^2\). This is not a coincidence.
Seen as vectors, the response variable is the hypotenuse of a right triangle whose legs are the model values and residuals. You can see that the legs are perpendicular from the dot product of the model values and the residuals:
|>
Model1 summarize(dotproduct=sum(modval * resid))
dotproduct
-----------
0
As always, the variance is our preferred measure of the amount of variation. Looking at the variances of the model values and of the residuals, you can see that they are smaller than the variance of the response variable (height
).
|>
Model1 summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.84 6.549 6.288
Remarkably, the variances of the model values and of the residuals add up exactly to equal the variance of the response variable. In other words, the variation in the response variable is split into two parts: the variation associated with the explanatory variable and the remaining (“residual”) variation.
The previous section used the categorical variable sex
as the explanatory variable. Galton’s interest, however, was in the relationship between children’s and parents’ height.
It’s trivial to substitute the height of the mother
or father
in place of sex
in the method introduced in the previous section. However, as we shall see, the results are not satisfactory. Galton’s key discovery was the proper method to use when relating two quantitative variables such as height
and mother
.
First, let’s try simply substituting in mother
as the explanatory variable and using mean()
to create the model values.
<- Galton |>
Model2 group_by(mother) |>
mutate(modval = mean(height),
resid = height - modval) |>
ungroup()
The variance of height
is split into two parts in the same way for mother
as when sex
was used as the explanatory variable.
|>
Model2 summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8 1.53 11.3
Unfortunately, the graph of the model values versus the explanatory variable does not show any kind of sensible relationship. (See Figure 8.2.)
|>
Model2 gf_point(height ~ mother, alpha=0.1) |>
gf_point(modval ~ mother, color="blue", alpha=0.3) |>
gf_line(modval ~ mother, color="blue", linewidth=0.5)
height
of the child by the mother
’s height using group_by(mother)
and modval=mean(height)
. It’s hard to see any pattern in the model values. A thin line has been added connecting adjacent model values to highlight how unsatisfactory the model is.Common sense that the model linking mothers’ heights to children’s height should be smooth rather than the jagged pattern seen in Figure 8.2. The source of the jaggedness is the use of group_by()
. When using group_by()
, the calculated mean height for children whose mothers are, say, 62 inches tall is calculated without any reference to any other height of mother. Intuition suggests that the model value for children whose mothers are 62.5 inches in height should be very similar to the model value for the children of the 62-inch mothers. Likewise, one expects that the model value for mothers of middle height should be somewhere inbetween the model values for short and for tall mothers.
The solution to the problem of jagged model values is to avoid the absolute splitting into non-overlapping groups by mothers height. Instead, we want to find a smooth relationship. Galton invented the method for accomplishing this. A modern form of his method is provided by the model_values()
function, which we shall use to construct Model3
.
<- Galton |>
Model3 mutate(modval = model_values(height ~ mother),
resid = height - modval)
Notice that the group_by()
step has been entirely removed. Notice also that model_values()
uses the same kind of tilde expression as we have employed when plotting. The response variable is listed on the left of the , the explanatory variable on the right side. In other words, we are modeling the child’s
height
as a function of mother
’s height.
|>
Model3 gf_point(height ~ mother, alpha=0.1) |>
gf_point(modval ~ mother, color="blue", alpha=0.3) |>
gf_line(modval ~ mother, color="blue", linewidth=0.5)
height
with respect to mother
created with model_values(height ~ sex)
.As always, modeling splits the variance of the response variable into two parts, one associated with the explanatory variable and the other holding what’s left over: the residual. Here’s the split for Model3
which uses mother
as an explanatory variable:
|>
Model3 summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.84 0.52 12.32
The models we work with in these Lessons always have exactly one response variable. But models can have any number of explanatory variables.
Whatever the number of explanatory variables, and however many levels a categorical explanatory variable has, the model splits the variance of the response into two complementary pieces: the variance accounted for by the explanatory variables and the part not accounted for, that is, the residual variance. To illustrate, here is a sequence of models of height
with different numbers of explanatory variables.
Three explanatory variables
|>
Galton mutate(modval = model_values(height ~ sex + mother + father),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 8.211706 4.625599
Two explanatory variables
|>
Galton mutate(modval = model_values(height ~ sex + mother),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 7.212022 5.625283
One explanatory variable
|>
Galton mutate(modval = model_values(height ~ sex),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 6.549134 6.288171
Zero explanatory variables
|>
Galton mutate(modval = model_values(height ~ 1),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 0 12.8373
When selecting explanatory variables, it is often helpful to compare two or more different models sharing the same response variable: a simple model and a model which adds one or more explanatory variables to the simple model. The model with no explanatory variables, is always the simplest possible model. For example, in Section 8.2, the model height ~ 1
is the simplest. Compared to the simplest model, the model height ~ sex
has one additional explanatory variable, sex
. Similarly, height ~ sex + mother
has one additional explanatory variable compared to height ~ sex
, and height ~ sex + mother + father
adds in still another explanatory variable.
The simpler model is said to be “nested in” the more extensive model, analogous to a series of Matroshka dolls. A simple measure of how much of the response variance is accounted for by the explanatory variables is the ratio of the variance of the model values divided by the variance of the response variable itself. This ratio is called “R2”, pronounced “R-squared.”
::include_graphics("www/Russian-Matroshka_no_bg.jpeg") knitr
For instance, R^2 for the model height ~ sex + mother
is
\[\text{R}^2 = \frac{7.21}{12.84} = 0.56\] By comparison, R^2 for the simpler model, height ~ sex
, is slightly smaller:
\[\text{R}^2 = \frac{6.55}{12.84} = 0.51\] For all models, \(0 \leq\) R2 \(\leq 1\). It is tempting to believe that the “best” model in a set of nested models is the one with the highest R2, but, statistical thinkers understand that “best” ought to depend on the purpose for which the model is being built. This matter will be a major theme in the remaining Lessons.
DRAFT: Simple calculation of \(R^2\).
DRAFT: Show that the sum of squares and mean square for modeling using the median is larger than for the mean.
DRAFT: Do a nested model approach with Anthro_F
and R2