8  Accounting for variation

Lesson 8 listed three basic components of statistical thinking and provided instruction on the first one:

The word “account” has several related meanings. These are drawn from the Oxford Languages dictionaries.

  • To “account for something” means “to be the explanation or cause of something.” [Oxford Languages]
  • An “account of something” is a story, a description, or an explanation, as in the Biblical account of the creation of the world.
  • To “take account of something” means “to consider particular facts, circumstances, etc. when making a decision about something.”

Synonyms for “account” include “description,”report,” “version,” “story,” “statement,” “explanation,” “interpretation,” “sketch,” and “portrayal.” “Accountants” and their “account books” keep track of where money comes from and goes to.

 

These various nuances of meaning, from a simple arithmetical tallying up to an interpretative story serve the purposes of statistical thinking well. When we “account for variation,” we are telling a story that tries to explain where the variation might have come from. Although the arithmetic used in the accounting is correct, the story behind the accounting is not necessarily definitive, true, or helpful. Just as witnesses of an event can have different accounts, so there can be many accounts of the variation even of the same variable in the same data frame.


  1. How to measure the amount of variation

In this Lesson, we will take on components two and three:

  1. Accounting for variation
  2. Measuring what remains unaccounted for

In business, accounting is the process of keeping track of money and value: assets, receipts, expenditures, etc. Likewise, in statistics, accounting for variation is the process of keeping track of variation. At its most basic, accounting for variation splits a variable into components: those parts which can be attributed to one or more other variables and that part which cannot be so attributed.

The variable being split up is called the “response variable.” The choice of a response variable is made by the human modeler, depending on her purposes for undertaking the work.

To illustrate, we’ll return to a historical statistical landmark, the data collected in the 1880s by Francis Galton on the heights of parents and their full-grown children. In that era, there was much excitement, uncertainty, and controversy about Charles Darwin’s theory of evolution. Today, we understand the role of DNA in encoding genetic information. But in Galton’s day, the concept of “gene” was unknown.

Remarkably, Galton was Darwin’s cousin.

In the Galton data frame, each specimen is a full-grown child. The variables recorded are the height of the child in inches, the heights of the mother and father, and the sex of the child according to the conventions of Victorian England, where the data were collected.

Galton’s overall project was to quantify the heritability of traits (such as height) from parents to offspring. Darwin had famously measured the beaks of finches on the Galapagos Islands. Galton, in London, worked with the most readily available local species: humans. Height is easy and socially acceptable to measure, a good candidate for data collection.

Keep in mind that Galton invented the methods we are going to use in this example.

Everyone can see that height varies from person to person. Galton sought to divide the height variation in the children into two parts: that which could be attributed to the parents and that which remained unaccounted for, which we call the “residual.” We will start with a mathematically simpler project: dividing the variation in the children’s heights into that part associated with the child’s sex and the residual.

As introduced in Lesson 4, we can calculate the average height for females and males using the group_by() and summarize() wrangling verbs.

Galton |>
  group_by(sex) |>
  summarize(mheight = mean(height))
sex    mheight
----  --------
F      64.1102
M      69.2288

This simple report indicates that the sexes differ, on average, by about 5 inches in height. But this report offers no insight into how much of the person-to-person variation in height is attributable to sex. For that, we need to look at the group averages in the context of the individuals. To that end, we will use mutate() to assign to each individual a “model value” that depends only on sex. The mutate() verb is helpful here.

Model1 <- Galton |>
  group_by(sex) |>
  mutate(modval = mean(height)) |>
  ungroup()
Code
set.seed(101)
Model1 |> 
  select(height, modval, sex, mother, father) |>
  sample(10)

?(caption)

 height   modval  sex    mother   father  orig.id 
-------  -------  ----  -------  -------  --------
   70.7    69.23  M        65.5       65  841     
   66.0    69.23  M        67.0       65  825     
   68.5    69.23  M        66.7       69  430     
   63.0    64.11  F        63.0       72  95      
   70.0    69.23  M        62.0       71  209     
   70.5    69.23  M        66.5       69  442     
   62.5    64.11  F        63.7       70  351     
   63.0    64.11  F        66.0       65  829     
   70.0    69.23  M        64.0       70  315     
   66.0    64.11  F        67.0       70  246     

Figure 8.1 shows ?tbl-galton-model1c graphically. Some things to notice:

  1. The model values are different for males and females.
  2. Among the females the model values do not vary with mother or father. The same is true for the males.
  3. The model values are not the same as the individual heights.
Code
Model1 |> sample(size=250) |>
  gf_point(height ~ sex, 
           alpha=0.3, 
           position = position_jitter(height=0, seed=101)) |> 
  gf_point(modval ~ sex, height=0, color="blue", alpha=0.3, 
           position = position_jitter(height=0, seed=101)) 

Figure 8.1: The response variable (height) and the model values from ?tbl-galton-model1c plotted versus sex. The model values are the same within each sex for all individuals.

The numerical difference between each specimen’s height and its model value is called the “residual.” This is easy to calculate:

Code
Model1 <- Model1 |>
  mutate(resid = height - modval)
set.seed(101)
Model1 |> 
  select(height, modval, resid, sex, mother, father) |>
  sample(size=10)

?(caption)

 height   modval     resid  sex    mother   father  orig.id 
-------  -------  --------  ----  -------  -------  --------
   70.7    69.23    1.4710  M        65.5       65  841     
   66.0    69.23   -3.2290  M        67.0       65  825     
   68.5    69.23   -0.7288  M        66.7       69  430     
   63.0    64.11   -1.1100  F        63.0       72  95      
   70.0    69.23    0.7712  M        62.0       71  209     
   70.5    69.23    1.2710  M        66.5       69  442     
   62.5    64.11   -1.6100  F        63.7       70  351     
   63.0    64.11   -1.1100  F        66.0       65  829     
   70.0    69.23    0.7712  M        64.0       70  315     
   66.0    64.11    1.8900  F        67.0       70  246     
Model1 <- Model1 |>
  mutate(resid = height - modval)

The residuals for the model of height versus sex are shown in ?tbl-galton-model1d. Notice that some residuals are positive, meaning that the actual height is larger than the model value. Other residuals, when the actual height is below the model value, are negative.

Instructors should note the similarity of the partitioning of variances to the Pythagorean Theorem. With right triangles, the square lengths of the legs add to the square length of the hypotenuse: \(A^2 + B^2 = C^2\). This is not a coincidence.

Seen as vectors, the response variable is the hypotenuse of a right triangle whose legs are the model values and residuals. You can see that the legs are perpendicular from the dot product of the model values and the residuals:

Model1 |>
  summarize(dotproduct=sum(modval * resid))
  dotproduct
 -----------
           0

As always, the variance is our preferred measure of the amount of variation. Looking at the variances of the model values and of the residuals, you can see that they are smaller than the variance of the response variable (height).

Model1 |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
       12.84         6.549        6.288

Remarkably, the variances of the model values and of the residuals add up exactly to equal the variance of the response variable. In other words, the variation in the response variable is split into two parts: the variation associated with the explanatory variable and the remaining (“residual”) variation.

Numerical explanatory variables

The previous section used the categorical variable sex as the explanatory variable. Galton’s interest, however, was in the relationship between children’s and parents’ height.

It’s trivial to substitute the height of the mother or father in place of sex in the method introduced in the previous section. However, as we shall see, the results are not satisfactory. Galton’s key discovery was the proper method to use when relating two quantitative variables such as height and mother.

First, let’s try simply substituting in mother as the explanatory variable and using mean() to create the model values.

Model2 <- Galton |>
  group_by(mother) |> 
  mutate(modval = mean(height),
         resid = height - modval) |>
  ungroup()

The variance of height is split into two parts in the same way for mother as when sex was used as the explanatory variable.

Model2 |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
        12.8          1.53         11.3

Unfortunately, the graph of the model values versus the explanatory variable does not show any kind of sensible relationship. (See Figure 8.2.)

Code
Model2 |> 
  gf_point(height ~ mother, alpha=0.1) |> 
  gf_point(modval ~ mother, color="blue", alpha=0.3) |>
  gf_line(modval ~ mother, color="blue", linewidth=0.5)

Figure 8.2: Modeling height of the child by the mother’s height using group_by(mother) and modval=mean(height). It’s hard to see any pattern in the model values. A thin line has been added connecting adjacent model values to highlight how unsatisfactory the model is.

Common sense that the model linking mothers’ heights to children’s height should be smooth rather than the jagged pattern seen in Figure 8.2. The source of the jaggedness is the use of group_by(). When using group_by(), the calculated mean height for children whose mothers are, say, 62 inches tall is calculated without any reference to any other height of mother. Intuition suggests that the model value for children whose mothers are 62.5 inches in height should be very similar to the model value for the children of the 62-inch mothers. Likewise, one expects that the model value for mothers of middle height should be somewhere inbetween the model values for short and for tall mothers.

The solution to the problem of jagged model values is to avoid the absolute splitting into non-overlapping groups by mothers height. Instead, we want to find a smooth relationship. Galton invented the method for accomplishing this. A modern form of his method is provided by the model_values() function, which we shall use to construct Model3.

Model3 <- Galton |>
  mutate(modval = model_values(height ~ mother),
         resid = height - modval)

Notice that the group_by() step has been entirely removed. Notice also that model_values() uses the same kind of tilde expression as we have employed when plotting. The response variable is listed on the left of the , the explanatory variable on the right side. In other words, we are modeling the child’s height as a function of mother’s height.

Code
Model3 |> 
  gf_point(height ~ mother, alpha=0.1) |> 
  gf_point(modval ~ mother, color="blue", alpha=0.3) |>
  gf_line(modval ~ mother, color="blue", linewidth=0.5)

Figure 8.3: A smooth model of height with respect to mother created with model_values(height ~ sex).

As always, modeling splits the variance of the response variable into two parts, one associated with the explanatory variable and the other holding what’s left over: the residual. Here’s the split for Model3 which uses mother as an explanatory variable:

Model3 |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
       12.84          0.52        12.32

Multiple explanatory variables

The models we work with in these Lessons always have exactly one response variable. But models can have any number of explanatory variables.

Note that the idea of “response” and “explanatory” variables refers to a model and are not at all intrinsic to a bare data frame. A data frame can contain many variables, any of which can be used as explanatory variables. The choice of response variable depends on the needs of the modeler.

Whatever the number of explanatory variables, and however many levels a categorical explanatory variable has, the model splits the variance of the response into two complementary pieces: the variance accounted for by the explanatory variables and the part not accounted for, that is, the residual variance. To illustrate, here is a sequence of models of height with different numbers of explanatory variables.

Many statistical terms mean something different in statistical than in everyday use. “Residual” is a pleasant exception: the statistical meaning is closely matched by its everyday dictionary definition.

Three explanatory variables

Galton |>
  mutate(modval = model_values(height ~ sex + mother + father),
         resid = height - modval) |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
     12.8373      8.211706     4.625599

Two explanatory variables

Galton |>
  mutate(modval = model_values(height ~ sex + mother),
         resid = height - modval) |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
     12.8373      7.212022     5.625283

One explanatory variable

Galton |>
  mutate(modval = model_values(height ~ sex),
         resid = height - modval) |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
     12.8373      6.549134     6.288171

Zero explanatory variables

Galton |>
  mutate(modval = model_values(height ~ 1),
         resid = height - modval) |>
  summarize(var(height), var(modval), var(resid))
 var(height)   var(modval)   var(resid)
------------  ------------  -----------
     12.8373             0      12.8373

Comparing models with R2

When selecting explanatory variables, it is often helpful to compare two or more different models sharing the same response variable: a simple model and a model which adds one or more explanatory variables to the simple model. The model with no explanatory variables, is always the simplest possible model. For example, in Section 8.2, the model height ~ 1 is the simplest. Compared to the simplest model, the model height ~ sex has one additional explanatory variable, sex. Similarly, height ~ sex + mother has one additional explanatory variable compared to height ~ sex, and height ~ sex + mother + father adds in still another explanatory variable.

The simpler model is said to be “nested in” the more extensive model, analogous to a series of Matroshka dolls. A simple measure of how much of the response variance is accounted for by the explanatory variables is the ratio of the variance of the model values divided by the variance of the response variable itself. This ratio is called “R2”, pronounced “R-squared.”

knitr::include_graphics("www/Russian-Matroshka_no_bg.jpeg")

Figure 8.4: A sequence of five nested Matroshka dolls. Each smaller doll fits inside a larger one.

For instance, R^2 for the model height ~ sex + mother is

\[\text{R}^2 = \frac{7.21}{12.84} = 0.56\] By comparison, R^2 for the simpler model, height ~ sex, is slightly smaller:

R2 is also known as the “coefficient of determination,” a little-used term that we shall avoid. Still, it’s worth noting the attitude behind the term; it quantifies the extent to which the response variable is “determined” by the explanatory variables.

\[\text{R}^2 = \frac{6.55}{12.84} = 0.51\] For all models, \(0 \leq\) R2 \(\leq 1\). It is tempting to believe that the “best” model in a set of nested models is the one with the highest R2, but, statistical thinkers understand that “best” ought to depend on the purpose for which the model is being built. This matter will be a major theme in the remaining Lessons.

Instructor Note: Strictly speaking, “all” should be qualified to mean “linear least-squares models with an intercept term.”

Exercises

DRAFT: Simple calculation of \(R^2\).

DRAFT: Show that the sum of squares and mean square for modeling using the median is larger than for the mean.

DRAFT: Do a nested model approach with Anthro_F and R2