17  R-squared and covariates

In Lessons sec-accounting-for-variation and sec-signal-and-noise, we introduced a commonly used measure, R2, which measures how much variation in a response variable a model can be accounted for using the explanatory variables.

Do we need R2?

If you read scientific papers that involve analysis of data, you are likely to encounter R2 (or its little cousin, r). R2 and r have a long history. r was introduced in 1888 as a way of quantifying the relationship between two quantitative variables. R2 was introduced in 1921 as linear regression techniques started to get traction in statistical methodology.

Such a long pedagree may explain why R2 is used so much in research literature. If only for this reason, you must learn about R2. R2 also offers a window into a statistical artifact that surprises most newcomers to statistics, which you will read about in this Lesson.

For models with only quantitative explanatory variables or zero-one explanatory variables, model coefficients, and confidence intervals provide more information, but even so, R2 can highlight when explanatory variables overlap. And when there are categorical explanatory variables, particularly ones with many levels, R2 can genuinely contribute to understanding how much each explanatory variable contributes to a model. But some care is needed to make sure that R2 is put in the context of the sample size and the number of coefficients in a model. We’ll get to that.

Fraction of variance explained

R2 has a simple-sounding interpretation in terms of how much of the variation in a response variable is accounted for by the explanatory variables.

Recall that we measure variation using the variance. R2 compares the variance “captured” by the explanatory variable to the amount of variation in the response variable on its own. To illustrate, consider the Hill_races data frame. Taking the winning running times as the response variable, we might wonder how much of the variation in time is accounted for by the race characteristics, for instance by the length of the race course (in km).

Here’s the variance of the time variable:

Hill_racing |> summarize(var(time, na.rm = TRUE), sd(time, na.rm = TRUE))
var(time, na.rm = TRUE) sd(time, na.rm = TRUE)
9754276 3123.184

As always, the units of the variance are the square of the units of the variable. Since time is in seconds, var(time) has units of “seconds-squared.” The standard deviation, which is the square root of the variance, is often easier to understand as an “amount.” That the standard deviation is about 3000 s, about an hour, means that the running times of the various races collected in Hill_racing range over hours: very different races are included in the data frame.

Naturally, the races differ from one another. Among other things, they differ in distance (in km). We can model time versus difference and look at the coefficients:

Hill_racing |> model_train(time ~ distance) |> conf_interval()
term .lwr .coef .upr
(Intercept) -296.1214 -210.9137 -125.7060
distance 374.4936 381.0230 387.5524

The units of the distance coefficient are seconds-per-kilometer (s/km). Three hundred eighty seconds per kilometer is a pace slightly slower than six minutes per km, or about ten miles per hour: a ten-minute mile. These are the winning times in the races. You might be tempted to think that these races are for casual runners.

R2 provides another way to summarize the model.

Hill_racing |> model_train(time ~ distance) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2226 1 0.854827 13095.65 0.8547617 0 1 2224

The R2 for the model is 0.85. A simple explanation is that the race distance explains 85% of the variation from race to race in running time: the large majority. This is no surprise to those familiar with racing: a 440 m race takes much less time than a 10,000-meter race. What might account for the other 15% of the variation in time? There are many possibilities.

An important feature of Scottish hill racing is the … hills. Many races feature substantial climbs. How much of the variation in race time is explained by the height (in m) of the climb? R2 provides a ready answer:

Hill_racing |> model_train(time ~ climb) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2224 1 0.7650186 7234.066 0.7649128 0 1 2222

The height of the climb also explains a lot of the variation in time: about three-quarters of it.

To know how much of the time variance climb and distance together explain, don’t simply add together the individual R2. By trying it, you can see why in this case: the amount of variation explained is 85% + 76% = 161%. That should strike you as strange! No matter how good the explanatory variables, they can never explain more than 100% of the variation in the response variable.

The source of the impossibly large R2 is that, to some extent, both time and climb share in the explanation; the two explanatory variables each explain much the same thing. We avoid such double-counting by including both explanatory variables at the same time:

Hill_racing |> model_train(time ~ distance + climb) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2224 2 0.9223273 13186.68 0.9222574 0 2 2221

Taken together, distance and climb account for 92% of the variation in race time. This leaves at most 8% of the variation yet to be explained: the residual variance.

R2 and categorical explanatory variables

Consider this question: Does the name of the race along with 565 other race names recorded as the race variable) influence the running time for the race?

There are 154 levels in the race variable, which records the name of the race. (name is the name of the runner.) Examples: Glen Rosa Horseshoe, Ben Nevis Race, …

Here’s a simple model:

Race_name_model1 <- Hill_racing |> model_train(time ~ race) 

R2 offers a much easier-to-interpret summary than the model coefficients in this situation. Here are the model coefficients:

Race_name_model1 |> conf_interval() -> Goo
term .lwr .coef .upr
(Intercept) 1300 2010.0 2710
raceAlex Brett Cioch Mhor 1950 2710.0 3480
raceAlva Games Hill Race -1340 -580.0 182
raceAonach Mor UKA event (men) -1600 -23.7 1550
raceAonach Mor UKA event (women) -893 329.0 1550
raceAonach Mor Uphill -1060 -224.0 612
      ... for 154 coefficient altogether

The reference race is the Aberfeldy Games Race. (Why? Because Aberfeldy is first alphabetically.) Each coefficient on another race gives a result by comparison to Aberfeldy.

The question that started this section was not about the Alva Games Hill Race, the Aonach Mor Uphill, or any of the others, but about the whole collection of differently named races. The R2 summary brings all the races together to measure the amount of time variance “explained” by the race names:

Race_name_model1 |> R2()
n k Rsquared F adjR2 p df.num df.denom
2226 153 0.9505387 260.2572 0.9468864 0 153 2072

This is a strikingly large R2. Based on this, some people might be tempted to think that a race’s name plays a big part in the race outcome. Statistical thinkers, however, will wonder whether the race name encodes other information that explains the race outcome. For example, we can look at how well the race name “explains” the race distance and climb:

Hill_racing |> model_train(distance ~ race) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2236 153 1 Inf 1 0 153 2082
Hill_racing |> model_train(climb ~ race) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2234 152 1 Inf 1 0 152 2081

The race name “explains” 100% of the variation for both’ distance’ and’ climb’. That’s because each distinct race has its own distance and climb. So, the race name carries all the information in the distance and climb variables.

By adjusting for distance and climb, we can ask more focused questions about the possible role of the race name in determining. First, recall that just distance and climb account for 92% of the variance in time.

Hill_racing |> model_train(time ~ distance + climb) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2224 2 0.9223273 13186.68 0.9222574 0 2 2221

Adding in race increases the R2 by only three percentage points, to 95%. (See Exercise exr-Q30-201.)

Degrees of freedom

Using categorical variables with a large number of levels are used as explanatory variables, a new phenomenon becomes apparent, a sort of mirage of explanation. To illustrate, consider the model time ~ name. There are five-hundred sixty-seven unique runners in the Hill_racing data.

Hill_racing |> model_train(time ~ name) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2226 565 0.42936 2.210645 0.2351361 0 565 1660

The runner’s identity accounts for about 43% of the variance in running time. Understandably, the R2 is not much higher: runners participate in multiple races with different distances and climbs so it’s natural for an individual runner to have a spread of running time.

Let’s experiment to illustrate the difficulty of interpreting R2 when there are many levels in a categorical explanatory variable. We will create a new variable consisting only of random noise.

Hill_racing <- Hill_racing |> mutate(noise = rnorm(n()))

Naturally, there is no genuine explanation of noise. For instance, distance and climb account for 92% of the actual running times, but a trivial percentage of the noise:

Hill_racing |> model_train(noise ~ distance + climb) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2234 2 0.0017704 1.978339 0.0008755 0.138541 2 2231

In contrast, name, with its 567 different levels, seems to “explain” a lot of noise:

Hill_racing |> model_train(noise ~ name) |> R2()
n k Rsquared F adjR2 p df.num df.denom
2236 566 0.2467746 0.9660854 -0.0086631 0.6926547 566 1669

The 567 names explain about one-quarter of the variance in noise, which ought not to be explainable at all!

How can name explain something that it has no connection with? First, note that the Hill_racing sample size is n=2236. (You can see the sample size in all R2 reports under the name n.) When we fit the model noise ~ name, there will be 567 different coefficients, one of which is the intercept and 566 of which relate to name. This number—labelled k in the R2 report—is called the “degrees of freedom” of the model.

In general, models with more degrees of freedom can explain more of the response variable, even when there is nothing to explain. On average, the R2 in a nothing-to-explain situation will be roughly k/n. For the noise ~ name model, the k-over-n ratio is 566/2236 = 0.25.

Small data

In some situations, a sample may include just a handful of specimens, say \(n=5\). A simple model, such as y ~ x, will have a small number of degrees of freedom. With y ~ x, there are two coefficients: the intercept and the coefficient on x. With only a single non-intercept coefficient, the model degrees of freedom is \(k=1\).

Nonetheless, the typical R2 from such a model, even when y and x are completely unrelated, will be at least \(k/n = 0.20\). It’s tempting to interpret an R2 of 0.20 as the sign of a relationship between y and x. To avoid such misinterpretations, statistical formulas and software carefully track k and n and arrange things to compensate.

One simple compensation for model degrees of freedom is “adjusted R2.” The adjustment is roughly this: take R2 and subtract \(k/n\). Insofar as there is no relationship between the response and explanatory variables, this will bring R2 down to about zero. An adjusted R2 greater than zero indicates a relationship between the response and explanatory variables. Adjusted R2 is useful when the goal is to ascertain whether there is a substantial relationship. This goal is common in fields such as econometrics.

Statistics textbooks favor other styles of adjustment that are, perhaps surprisingly, not oriented to pointing to a substantial relationship. A famous style of adjustment is encapsulated in the t statistic, which applies to models with only a single degree of freedom. A generalization of t to models with more degrees of freedom is the F statistic.


Exercise 17.1  

Comparing models with ANOVA

Modelers are often in the position of having a model that they like but are contemplating adding one or more additional explanatory variables. To illustrate, consider the following models:

  • Model 1: list_price ~ 1
  • Model 2: list_price ~ 1 + hard_paper
  • Model 3: list_price ~ 1 + hard_paper + num_pages
  • Model 4: list_price ~ 1 + hard_paper + num_pages + weight_oz
Figure 17.1: Nesting Russian dolls

All the explanatory variables in the smaller models also apply to the bigger models. Such sets are said to be “nested” in much the same way as for Russian dolls.

For a nested set of models, R2 can never decrease when moving from a smaller model to a larger one—almost always, there is an increase in R2. To demonstrate:

amazon_books <- amazon_books |> 
  select(list_price, weight_oz, num_pages, hard_paper) 
amazon_books <- amazon_books |>
model1 <- amazon_books |> 
  model_train(list_price ~ 1)
model2 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz)
model3 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz + num_pages)
model4 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz + num_pages + hard_paper)
n k Rsquared F adjR2 p df.num df.denom
314 0 0 NaN 0 NaN 0 313
n k Rsquared F adjR2 p df.num df.denom
314 1 0.16 57 0.15 0 1 312
n k Rsquared F adjR2 p df.num df.denom
314 2 0.17 31 0.16 0 2 311
n k Rsquared F adjR2 p df.num df.denom
314 3 0.17 21 0.16 0 3 310

When adding explanatory variables to a model, a good question is whether the new variable(s) add to the ability to account for the variability in the response variable. R2 never goes down when moving from a smaller to a larger model, so we cannot rely on the increase in R2. A valuable technique called “Analysis of Variance” (ANOVA for short) looks at the incremental change in variance explained from a smaller model to a larger one. The increase can be presented as an F statistic. To illustrate:

anova_summary(model1, model2, model3, model4)
term df.residual rss df sumsq statistic p.value
list_price ~ 1 313 54606 NA NA NA NA
list_price ~ 1 + weight_oz 312 46122 1 8484 58.0 0.00
list_price ~ 1 + weight_oz + num_pages 311 45513 1 609 4.2 0.04
list_price ~ 1 + weight_oz + num_pages + hard_paper 310 45338 1 175 1.2 0.27

Focus on the column named statistic. This records the F statistic. The move from Model 1 to Model 2 produces F=57, well above the threshold described above and clearly indicating that the weight_oz variable accounts for some of the list price. Moving from Model 2 to Model 3 creates a much less impressive F of 3.8. It is as if the added explanatory variable, num_pages, is just barely pulling its own “weight.” Finally, moving from Model 3 to Model 4 produces a below-threshold F of 1.3. In other words, in the context of weight_oz and num_pages, the hard_paper variable does not carry additional information about the list price.

The last column of the report, labeled Pr(>F), translates F into a universal 0 to 1 scale called a p-value. A large F produces a small p-value. The rule of thumb for reading p-values is that a value \(p < 0.05\) indicates that the added variable brings new information about the response variable. We will return to p-values and the controversy they have entailed in Lesson sec-NHT.


Exercise 17.2  

In-sample versus out-of-sample R2

The R2 reported on a model is always optimistic; it underestimates the residual noise from the model. The “adjusted R2” attempts to correct for this optimism. Adjusted R2 is based on a theory that’s appropriate for linear regression models.

Increasingly, particularly when it comes to models used as parts of artificial intelligence systems, nonlinear modeling methods are used, for example “neural networks” or “random forests.” In assessing the predictive ability of these models, a computational technique is used to avoid overly optimistic models.

DEMONSTRATE in- and out-of-sample