# 29 Covariates eat variance

In Lesson 28, we introduced covariates to set the relationship between an explanatory variable and the response variable in the correct context. In Lesson 30, we will return to this context-setting role to show that the appropriate choice of covariates to include in a model depends on the modeler’s opinion about the relevant structure of a DAG. Here, we will treat covariates as commodity items to show a surprising property of models. This property is a boon to the modeler, helping to enable sound decisions about whether to include any given covariate. However, it is also a pitfall lieing in wait for the wishful thinker.

## How much variation is explained

We start by returning to the definition of statistical thinking introduced at the start of these Lessons:

Statistic thinking is the explanation or description of variationin the context ofwhat remains unexplained or undescribed.

In this Lesson, we will work with a straightforward measure of “what remains unexplained or undescribed.” The fitted model values represent the explained part of the variation. The residuals are what is left over, the difference between the actual values of the response variable and the fitted model values.

As a reminder, we will construct a simple model of the list price of books as a function of the number of pages and whether the book is a paperback or hardcover.^{1}

```
<- lm(list_price ~ num_pages + hard_paper,
Price_model data = amazon_books)
```

The `model_eval()`

function can extract the fitted model values and the residuals from the model. We show just a few rows here, but we will use the entire report from `model_eval()`

. Remember that when `model_eval()`

is not given input values, it uses the model *training* data as input.

`<- model_eval(Price_model) Results `

`Using training data as input to model_eval().`

.response | num_pages | hard_paper | .output | .resid | .lwr | .upr |
---|---|---|---|---|---|---|

12.95 | 304 | P | 16.60 | -3.65 | -10.73 | 43.94 |

15.00 | 273 | P | 15.98 | -0.98 | -11.36 | 43.32 |

1.50 | 96 | P | 12.45 | -10.95 | -14.97 | 39.88 |

15.99 | 672 | P | 23.95 | -7.96 | -3.57 | 51.47 |

30.50 | 720 | P | 24.91 | 5.59 | -2.67 | 52.49 |

28.95 | 460 | H | 24.57 | 4.38 | -2.89 | 52.03 |

The first book in the training data is a 304-page paperback with a list price of $12.95. The fitted model value for that book is $16.60. (Ordinarily, we refer to the output of the model function simply as the “output” or the “model output.” However, the output of the model function, when applied to rows from the *training* data also called the *fitted model value*.)

At $16.60, the fitted model value is $3.65 *higher* than the list price. This difference is the residual for that book, the sign reflecting the definition \[\text{residual} \equiv \text{response value} - \text{fitted model value}\ .\] When the residual is small in magnitude, the fitted model value is close to the response value. Conversely, a large residual means the model was way off target for that book.

The standard measure of the typical size of a residual is the standard deviation or, equivalently, the variance.

`%>% summarize(se_resids = sd(.resid), v_resids=var(.resid)) Results `

se_resids | v_resids |
---|---|

13.81885 | 190.9606 |

As always, the standard deviation is easier to read because it has sensible units, in this case, dollars. On the other hand, the variance has strange units (square dollars) because it is the square of the standard deviation. We will use the variance for measuring the typical size of a residual for the reasons described in Lesson 20; variances add nicely in a manner analogous to the Pythagorean Theorem.

A simple measure of how much of the variation in the response variable remains *unexplained* is the ratio of the variance of the residuals and the variance of the response variable.

`%>% summarize(unexplained_fraction = var(.resid)/var(.response)) Results `

unexplained_fraction |
---|

0.9246907 |

More than 90% of the variation remains unexplained by the `Price_model`

! This high fraction of unexplained variance suggests the model has little to tell us. In the spirit of putting a positive spin on things, statisticians typically work with the complement of the unexplained fraction. Since the unexplained fraction is 92.5%, the complement is 7.5%. This number is written R^{2} and pronounced “R-squared.” (It also has a formal name: the “coefficient of determination.” In Lesson 30, we will meet the inventor of the coefficient of determination, Sewall Wright, who is an early hero of causal reasoning.)

R^{2} is such a widely used summary of how the explanatory variables account for the response variable that a software extractor calculates it and some related values.

`%>% R2() Price_model `

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

317 | 2 | 0.0753093 | 12.78651 | 0.0694196 | 4.6e-06 | 2 | 314 |

Many modelers act as if their goal is to build a model that makes R^{2} as big as possible. Their thinking is that large R^{2} means that the explanatory variables account for much of the response variable’s variance. Unfortunately, it is a naive goal. Instead, always focus on the model’s suitability for the purpose at hand. Often, shooting for a large R^{2} imposes costs that can undermine the purpose for the model. Furthermore, even models with the largest possible R^{2} sometimes have nothing to say about the response variable.

## Getting to 1

R^{2} can range from zero to one. Zero means that the model accounts for *none* of the variation in the response variable. We can construct such a model quickly enough: `list_price ~ 1`

has no explanatory variables and, therefore, no ability to distinguish one book from another.

```
<- lm(list_price ~ 1, data = amazon_books)
Null_model %>% R2() Null_model
```

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 0 | 0 | NaN | 0 | NaN | 0 | 318 |

We are using the word “null” to name this model. “Null” is part of the statistics tradition. The dictionary definition of “null” is “having or associated with the value zero” or “lacking distinctive qualities; having no positive substance or content.”^{2}

In the null model, the fitted model values are all the same; all the variation is in the residuals.

`%>% model_eval() Null_model `

`Using training data as input to model_eval().`

.response | .output | .resid | .lwr | .upr |
---|---|---|---|---|

12.95 | 18.6 | -5.65 | -9.69 | 46.89 |

15.00 | 18.6 | -3.60 | -9.69 | 46.89 |

1.50 | 18.6 | -17.10 | -9.69 | 46.89 |

15.99 | 18.6 | -2.61 | -9.69 | 46.89 |

30.50 | 18.6 | 11.90 | -9.69 | 46.89 |

28.95 | 18.6 | 10.35 | -9.69 | 46.89 |

At the other extreme, where R^{2} = 1, the explanatory variables account for every bit of variation in the response variable. We can try various combinations of explanatory variables to see if we can accomplish this. For example, `publisher`

explains 67% of the variation in list price.

`lm(list_price ~ publisher, data = amazon_books) %>% R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 158 | 0.6749786 | 2.103008 | 0.3540199 | 0 | 158 | 160 |

We can also check whether `author`

has anything to say about the list price.

`lm(list_price ~ author, data = amazon_books) %>% R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 250 | 0.9434046 | 4.534044 | 0.7353333 | 0 | 250 | 68 |

Incredible! How about if we use *both* `publisher`

and `author`

as explanatory variables? We get very close to R^{2} = 1.

```
lm(list_price ~ publisher + author, data = amazon_books) %>%
R2()
```

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 281 | 0.9821609 | 7.249441 | 0.84668 | 0 | 281 | 37 |

The modeler discovering this tremendous explanatory power of `publisher`

and `author`

can be forgiven for thinking he or she has found a meaningful explanation. But, unfortunately, the high R^{2} is an illusion in this case.

To see why, consider another possible explanatory variable, the International Standard Book Number (ISBN). The ISBN is a ten- or thirteen-digit number that marks each book with a unique number.

There is a system behind ISBNs, but despite the “N” standing for “number,” an ISBN is a character string or word (written using only digits). Consequently, the `isbn_10`

variable in `amazon_books`

is categorical.

```
<- lm(list_price ~ isbn_10, data = amazon_books)
ISBN_model %>% R2() ISBN_model
```

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 318 | 1 | NaN | NaN | NaN | 318 | 0 |

The `isbn_10`

explains all variation in the list price!

Given that the ISBN is, as we have said, an arbitrary sequence of characters, why does it do such a good job of accounting for the list price? The answer lies not in the content of the ISBN but in another fact: each book has a unique ISBN. As well, each book has a single price. So the ISBN identifies the price of each book. Cleverness is not involved; the list price could be anything, and the ISBN would still identify it precisely. The model coefficients store the whole set of ISBNs and the corresponding set of list prices.

We can substantiate the claim just made—that the list price could be anything at all—by synthesizing a data frame with random list prices:

```
%>%
amazon_books mutate(random_list_price = rnorm(nrow(.))) %>%
lm(random_list_price ~ isbn_10, data = .) %>%
R2()
```

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 318 | 1 | NaN | NaN | NaN | 318 | 0 |

Similar randomization can be accomplished by *shuffling* the `isbn_10`

column of the data frame so that each ISBN points to a random book. Of course, such shuffling destroys the link between the ISBN and the list price. Even so, the R^{2} remains high.

`lm(list_price ~ shuffle(isbn_10), data=amazon_books) %>% R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 318 | 1 | NaN | NaN | NaN | 318 | 0 |

`lm(shuffle(list_price) ~ isbn_10, data=amazon_books) %>% R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

319 | 318 | 1 | NaN | NaN | NaN | 318 | 0 |

Statistical nomenclature is obscure here. So we will make up a name for such incidental alignment with no true explanatory power: the “**ISBN-effect**.”

Statistical thinkers know to be aware of situations where categorical variables have many levels and check whether the ISBN effect is in play.

## The ISBN effect as a benchmark

Shuffling an explanatory variable (while keeping the response variable in the original order) voids any possible explanatory connection between the two. An R^{2}=0, as we get from any model of the form `y ~ 1`

, signals that the `1`

cannot account for any variation. However, this does not mean shuffling will lead to R^{2} = 0. Instead, there is a systematic relationship between the number of model coefficients associated with the shuffled variable, the sample size \(n\), and R^{2}.

We can demonstrate this relationship by conducting many trials of modeling the `list_price`

with a shuffled explanatory variable: either `publisher`

, `author`

, or `isbn_10`

.

The theory relating R^{2} to the number of coefficients associated is straightforward for shuffled explanatory variables: R^{2} will be random with mean value \(\frac{k}{n-1}\).

We can carry out similar trials for the models `list_price ~ shuffle(author)`

and `list_price ~ shuffle(isbn_10)`

, which have \(k=251\) and \(k=319\) respectively.

The blue diagonal line in Figure 29.3 shows the theoretical average R^{2} as a function of the number of model coefficients when the explanatory variable is randomized. R^{2} will always be 1.0 when \(k=n\), that is, when the number of coefficients is the same as the sample size.

Figure 29.3 suggests a way to distinguish between R^{2} resulting from the ISBN-effect and R^{2} that shows some true explanatory power: Check if R^{2} is substantially above the blue diagonal line, that is, check if R^{2}\(\gg \frac{k}{n-1}\) where \(k\) is the number of model coefficients.

## The F statistic

\(k\) and \(n\) provide the necessary context for proper interpretation of R^{2}; all three numbers are needed to establish whether R^{2} \(\gg \frac{k}{n-1}\) to rule out the ISBN effect. The calculation is not difficult; the modeler always knows the size \(n\) of the training data and can find \(k\) as the number of coefficients in the model (not counting the Intercept term).

Perhaps a little easier than interpreting R^{2} is the interpretation of another statistic, named F, which folds in the \(k\), \(n\), and R^{2} into a single number: \[F \equiv \frac{n-k-1}{k} \frac{\text{R}^2}{1 - \text{R}^2}\] Figure 29.4 is a remake of Figure 29.3 but using F instead of R^{2}. The blue line, which had the formula R^{2}\(= k/(n-1)\) in Figure 29.3, gets translated to the constant value 1.0 in Figure 29.4, regardless of \(k\). To decide when a model points to a connection stronger than the ISBN effect, the threshold F \(> 3\) is a good rule of thumb. (Lesson 37 introduces a more precise calculation for the F threshold, which is built into statistical software and presented as a “**p-value**.”)

## Comparing models

Modelers are often in the position of having a model that they like but are contemplating adding one or more additional explanatory variables. To illustrate, consider the following models:

- Model 1:
`list_price ~ 1`

- Model 2:
`list_price ~ 1 + hard_paper`

- Model 3:
`list_price ~ 1 + hard_paper + num_pages`

- Model 4:
`list_price ~ 1 + hard_paper + num_pages + weight_oz`

All the explanatory variables in the smaller models also apply to the bigger models. Such sets are said to be “**nested**” in much the same way as for Russian dolls.

For a nested set of models, R^{2} can never decrease when moving from a smaller model to a larger one—almost always, there is an increase in R^{2}. To demonstrate:

```
<- amazon_books %>%
amazon_books select(list_price, weight_oz, num_pages, hard_paper) %>%
filter(complete.cases(.))
<- lm(list_price ~ 1, data=amazon_books)
model1 <- lm(list_price ~ 1 + weight_oz, data = amazon_books)
model2 <- lm(list_price ~ 1 + weight_oz + num_pages, data=amazon_books)
model3 <- lm(list_price ~ 1 + weight_oz + num_pages + hard_paper, data=amazon_books) model4
```

`R2(model1)`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

309 | 0 | 0 | NaN | 0 | NaN | 0 | 308 |

`R2(model2)`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

309 | 1 | 0.16 | 57 | 0.15 | 0 | 1 | 307 |

`R2(model3)`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

309 | 2 | 0.17 | 30 | 0.16 | 0 | 2 | 306 |

`R2(model4)`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

309 | 3 | 0.17 | 21 | 0.16 | 0 | 3 | 305 |

When adding explanatory variables to a model, a good question is whether the new variable(s) add to the ability to account for the variability in the response variable. R^{2} never goes down when moving from a smaller to a larger model, so we cannot rely on the increase in R^{2}. A valuable technique called “**Analysis of Variance**” (ANOVA for short) looks at the incremental change in variance explained from a smaller model to a larger one. The increase can be presented as an F statistic. To illustrate:

`anova_summary(model1, model2, model3, model4)`

term | df.residual | rss | df | sumsq | statistic | p.value |
---|---|---|---|---|---|---|

list_price ~ 1 | 308 | 54531 | NA | NA | NA | NA |

list_price ~ 1 + weight_oz | 307 | 46032 | 1 | 8499 | 57.2 | 0.00 |

list_price ~ 1 + weight_oz + num_pages | 306 | 45466 | 1 | 566 | 3.8 | 0.05 |

list_price ~ 1 + weight_oz + num_pages + hard_paper | 305 | 45277 | 1 | 189 | 1.3 | 0.26 |

Focus on the column named `statistic`

. This records the F statistic. The move from Model 1 to Model 2 produces F=57, well above the threshold described above and clearly indicating that the `weight_oz`

variable accounts for some of the list price. Moving from Model 2 to Model 3 creates a much less impressive F of 3.8. It is as if the added explanatory variable, `num_pages`

, is just barely pulling its own “weight.” Finally, moving from Model 3 to Model 4 produces a below-threshold F of 1.3. In other words, in the context of `weight_oz`

and `num_pages,`

the `hard_paper`

variable does not carry additional information about the list price.

The last column of the report, labeled `Pr(>F)`

, translates F into a universal 0 to 1 scale called a p-value. A large F produces a small p-value. The rule of thumb for reading p-values is that a value \(p < 0.05\) indicates that the added variable brings new information about the response variable. We will return to p-values and the controversy they have entailed in Lessons 36 through 38.

If you seek to duplicate the results presented in this chapter, please note that we have deleted six rows from `amazon_books because the rows are either duplicates or have one of the variables missing. The deleted rows are 62, 103, 205, 211, 242, and 303.↩︎

Source: Oxford Languages↩︎

The data frame in the

`moderndive`

package has six additional rows, which we have deleted as duplicates or because of missing data.↩︎