```
|>
Galton group_by(sex) |>
summarize(mheight = mean(height))
```

```
sex mheight
---- --------
F 64.1102
M 69.2288
```

Lesson 8 listed three basic components of statistical thinking and provided instruction on the first one:

The word “account” has several related meanings. These are drawn from the Oxford Languages dictionaries.

- To “account for something” means “to be the explanation or cause of something.” [Oxford Languages]
- An “account of something” is a story, a description, or an explanation, as in the Biblical account of the creation of the world.
- To “take account of something” means “to consider particular facts, circumstances, etc. when making a decision about something.”

Synonyms for “account” include “description,”report,” “version,” “story,” “statement,” “explanation,” “interpretation,” “sketch,” and “portrayal.” “Accountants” and their “account books” keep track of where money comes from and goes to.

These various nuances of meaning, from a simple arithmetical tallying up to an interpretative story serve the purposes of statistical thinking well. When we “account for variation,” we are telling a story that tries to explain where the variation might have come from. Although the arithmetic used in the accounting is correct, the story behind the accounting is not necessarily definitive, true, or helpful. Just as witnesses of an event can have different accounts, so there can be many accounts of the variation even of the same variable in the same data frame.

- How to measure the amount of variation

In this Lesson, we will take on components two and three:

- Accounting for variation
- Measuring what remains unaccounted for

In business, accounting is the process of keeping track of money and value: assets, receipts, expenditures, etc. Likewise, in statistics, accounting for variation is the process of keeping track of variation. At its most basic, accounting for variation splits a variable into components: those parts which can be attributed to one or more other variables and that part which cannot be so attributed.

The variable being split up is called the “**response variable**.” The choice of a response variable is made by the human modeler, depending on her purposes for undertaking the work.

To illustrate, we’ll return to a historical statistical landmark, the data collected in the 1880s by Francis Galton on the heights of parents and their full-grown children. In that era, there was much excitement, uncertainty, and controversy about Charles Darwin’s *theory of evolution*. Today, we understand the role of DNA in encoding genetic information. But in Galton’s day, the concept of “gene” was unknown.

Remarkably, Galton was Darwin’s cousin.

In the `Galton`

data frame, each specimen is a full-grown child. The variables recorded are the `height`

of the child in inches, the heights of the `mother`

and `father`

, and the `sex`

of the child according to the conventions of Victorian England, where the data were collected.

Galton’s overall project was to quantify the heritability of traits (such as height) from parents to offspring. Darwin had famously measured the beaks of finches on the Galapagos Islands. Galton, in London, worked with the most readily available local species: humans. Height is easy and socially acceptable to measure, a good candidate for data collection.

Keep in mind that Galton *invented* the methods we are going to use in this example.

Everyone can see that height varies from person to person. Galton sought to divide the height variation in the children into two parts: that which could be attributed to the parents and that which remained unaccounted for, which we call the “**residual**.” We will start with a mathematically simpler project: dividing the variation in the children’s heights into that part associated with the child’s sex and the residual.

As introduced in Lesson 4, we can calculate the average height for females and males using the `group_by()`

and `summarize()`

wrangling verbs.

```
|>
Galton group_by(sex) |>
summarize(mheight = mean(height))
```

```
sex mheight
---- --------
F 64.1102
M 69.2288
```

This simple report indicates that the sexes differ, on average, by about 5 inches in height. But this report offers no insight into how much of the person-to-person variation in height is attributable to `sex`

. For that, we need to look at the group averages in the context of the individuals. To that end, we will use `mutate()`

to assign to each individual a “**model value**” that depends only on `sex`

. The `mutate()`

verb is helpful here.

```
<- Galton |>
Model1 group_by(sex) |>
mutate(modval = mean(height)) |>
ungroup()
```

```
set.seed(101)
|>
Model1 select(height, modval, sex, mother, father) |>
sample(10)
```

```
height modval sex mother father orig.id
------- ------- ---- ------- ------- --------
70.7 69.23 M 65.5 65 841
66.0 69.23 M 67.0 65 825
68.5 69.23 M 66.7 69 430
63.0 64.11 F 63.0 72 95
70.0 69.23 M 62.0 71 209
70.5 69.23 M 66.5 69 442
62.5 64.11 F 63.7 70 351
63.0 64.11 F 66.0 65 829
70.0 69.23 M 64.0 70 315
66.0 64.11 F 67.0 70 246
```

Figure 8.1 shows ?tbl-galton-model1c graphically. Some things to notice:

- The model values are different for males and females.
- Among the females the model values do
*not*vary with`mother`

or`father`

. The same is true for the males. - The model values are not the same as the individual heights.

```
|> sample(size=250) |>
Model1 gf_point(height ~ sex,
alpha=0.3,
position = position_jitter(height=0, seed=101)) |>
gf_point(modval ~ sex, height=0, color="blue", alpha=0.3,
position = position_jitter(height=0, seed=101))
```

The numerical difference between each specimen’s height and its model value is called the “**residual**.” This is easy to calculate:

```
<- Model1 |>
Model1 mutate(resid = height - modval)
set.seed(101)
|>
Model1 select(height, modval, resid, sex, mother, father) |>
sample(size=10)
```

```
height modval resid sex mother father orig.id
------- ------- -------- ---- ------- ------- --------
70.7 69.23 1.4710 M 65.5 65 841
66.0 69.23 -3.2290 M 67.0 65 825
68.5 69.23 -0.7288 M 66.7 69 430
63.0 64.11 -1.1100 F 63.0 72 95
70.0 69.23 0.7712 M 62.0 71 209
70.5 69.23 1.2710 M 66.5 69 442
62.5 64.11 -1.6100 F 63.7 70 351
63.0 64.11 -1.1100 F 66.0 65 829
70.0 69.23 0.7712 M 64.0 70 315
66.0 64.11 1.8900 F 67.0 70 246
```

```
<- Model1 |>
Model1 mutate(resid = height - modval)
```

The residuals for the model of `height`

versus `sex`

are shown in ?tbl-galton-model1d. Notice that some residuals are positive, meaning that the actual `height`

is larger than the model value. Other residuals, when the actual `height`

is below the model value, are negative.

**Instructors** should note the similarity of the partitioning of variances to the Pythagorean Theorem. With right triangles, the square lengths of the legs add to the square length of the hypotenuse: \(A^2 + B^2 = C^2\). This is not a coincidence.

Seen as vectors, the response variable is the hypotenuse of a right triangle whose legs are the model values and residuals. You can see that the legs are perpendicular from the dot product of the model values and the residuals:

```
|>
Model1 summarize(dotproduct=sum(modval * resid))
```

```
dotproduct
-----------
0
```

As always, the variance is our preferred measure of the amount of variation. Looking at the variances of the model values and of the residuals, you can see that they are *smaller* than the variance of the response variable (`height`

).

```
|>
Model1 summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.84 6.549 6.288
```

Remarkably, the variances of the model values and of the residuals add up exactly to equal the variance of the response variable. In other words, the variation in the response variable is split into two parts: the variation associated with the explanatory variable and the remaining (“residual”) variation.

The previous section used the categorical variable `sex`

as the explanatory variable. Galton’s interest, however, was in the relationship between children’s and parents’ height.

It’s trivial to substitute the height of the `mother`

or `father`

in place of `sex`

in the method introduced in the previous section. However, as we shall see, the results are not satisfactory. Galton’s key discovery was the proper method to use when relating two quantitative variables such as `height`

and `mother`

.

First, let’s try simply substituting in `mother`

as the explanatory variable and using `mean()`

to create the model values.

```
<- Galton |>
Model2 group_by(mother) |>
mutate(modval = mean(height),
resid = height - modval) |>
ungroup()
```

The variance of `height`

is split into two parts in the same way for `mother`

as when `sex`

was used as the explanatory variable.

```
|>
Model2 summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8 1.53 11.3
```

Unfortunately, the graph of the model values versus the explanatory variable does not show any kind of sensible relationship. (See Figure 8.2.)

```
|>
Model2 gf_point(height ~ mother, alpha=0.1) |>
gf_point(modval ~ mother, color="blue", alpha=0.3) |>
gf_line(modval ~ mother, color="blue", linewidth=0.5)
```

Common sense that the model linking mothers’ heights to children’s height should be **smooth** rather than the jagged pattern seen in Figure 8.2. The source of the jaggedness is the use of `group_by()`

. When using `group_by()`

, the calculated mean height for children whose mothers are, say, 62 inches tall is calculated without any reference to any other height of mother. Intuition suggests that the model value for children whose mothers are 62.5 inches in height should be very similar to the model value for the children of the 62-inch mothers. Likewise, one expects that the model value for mothers of middle height should be somewhere inbetween the model values for short and for tall mothers.

The solution to the problem of jagged model values is to avoid the absolute splitting into non-overlapping groups by mothers height. Instead, we want to find a smooth relationship. Galton invented the method for accomplishing this. A modern form of his method is provided by the `model_values()`

function, which we shall use to construct `Model3`

.

```
<- Galton |>
Model3 mutate(modval = model_values(height ~ mother),
resid = height - modval)
```

Notice that the `group_by()`

step has been entirely removed. Notice also that `model_values()`

uses the same kind of **tilde expression** as we have employed when plotting. The response variable is listed on the left of the , the explanatory variable on the right side. In other words, we are modeling the child’s `height`

as a function of `mother`

’s height.

```
|>
Model3 gf_point(height ~ mother, alpha=0.1) |>
gf_point(modval ~ mother, color="blue", alpha=0.3) |>
gf_line(modval ~ mother, color="blue", linewidth=0.5)
```

As always, modeling splits the variance of the response variable into two parts, one associated with the explanatory variable and the other holding what’s left over: the residual. Here’s the split for `Model3`

which uses `mother`

as an explanatory variable:

```
|>
Model3 summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.84 0.52 12.32
```

The models we work with in these Lessons *always* have exactly one response variable. But models can have any number of *explanatory* variables.

Note that the idea of “response” and “explanatory” variables refers to a model and are not at all intrinsic to a bare data frame. A data frame can contain many variables, any of which can be used as explanatory variables. The choice of response variable depends on the needs of the modeler.

Whatever the number of explanatory variables, and however many levels a categorical explanatory variable has, the model splits the variance of the response into two complementary pieces: the variance accounted for by the explanatory variables and the part not accounted for, that is, the *residual* variance. To illustrate, here is a sequence of models of `height`

with different numbers of explanatory variables.

Many statistical terms mean something different in statistical than in everyday use. “Residual” is a pleasant exception: the statistical meaning is closely matched by its everyday dictionary definition.

**Three explanatory variables**

```
|>
Galton mutate(modval = model_values(height ~ sex + mother + father),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 8.211706 4.625599
```

**Two explanatory variables**

```
|>
Galton mutate(modval = model_values(height ~ sex + mother),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 7.212022 5.625283
```

**One explanatory variable**

```
|>
Galton mutate(modval = model_values(height ~ sex),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 6.549134 6.288171
```

**Zero explanatory variables**

```
|>
Galton mutate(modval = model_values(height ~ 1),
resid = height - modval) |>
summarize(var(height), var(modval), var(resid))
```

```
var(height) var(modval) var(resid)
------------ ------------ -----------
12.8373 0 12.8373
```

When selecting explanatory variables, it is often helpful to compare two or more different models sharing the same response variable: a simple model and a model which adds one or more explanatory variables to the simple model. The model with *no explanatory variables*, is always the simplest possible model. For example, in Section 8.2, the model `height ~ 1`

is the simplest. Compared to the simplest model, the model `height ~ sex`

has one additional explanatory variable, `sex`

. Similarly, `height ~ sex + mother`

has one additional explanatory variable compared to `height ~ sex`

, and `height ~ sex + mother + father`

adds in still another explanatory variable.

The simpler model is said to be “**nested in**” the more extensive model, analogous to a series of Matroshka dolls. A simple measure of how much of the response variance is accounted for by the explanatory variables is the ratio of the variance of the model values divided by the variance of the response variable itself. This ratio is called “R^{2}”, pronounced “R-squared.”

`::include_graphics("www/Russian-Matroshka_no_bg.jpeg") knitr`

For instance, R^2 for the model `height ~ sex + mother`

is

\[\text{R}^2 = \frac{7.21}{12.84} = 0.56\] By comparison, R^2 for the simpler model, `height ~ sex`

, is slightly smaller:

R^{2} is also known as the “coefficient of determination,” a little-used term that we shall avoid. Still, it’s worth noting the attitude behind the term; it quantifies the extent to which the response variable is “determined” by the explanatory variables.

\[\text{R}^2 = \frac{6.55}{12.84} = 0.51\] For all models, \(0 \leq\) R^{2} \(\leq 1\). It is tempting to believe that the “best” model in a set of nested models is the one with the highest R^{2}, but, statistical thinkers understand that “best” ought to depend on the purpose for which the model is being built. This matter will be a major theme in the remaining Lessons.

Instructor Note: Strictly speaking, “all” should be qualified to mean “linear least-squares models with an intercept term.”

Exercise 8.1

DRAFT: Simple calculation of \(R^2\).

Exercise 8.2

Exercise 8.3

DRAFT: Show that the sum of squares and mean square for modeling using the median is larger than for the mean.

Exercise 8.4

DRAFT: Do a nested model approach with `Anthro_F`

and R^{2}