17 R-squared and covariates

In Lessons 9 and 13, we introduced a commonly used measure, R², which measures how much variation in a response variable a model can be accounted for using the explanatory variables.

Do we need R²?

If you read scientific papers that involve analysis of data, you are likely to encounter R² (or its little cousin, r). R² and r have a long history. r was introduced in 1888 as a way of quantifying the relationship between two quantitative variables. R² was introduced in 1921 as linear regression techniques started to get traction in statistical methodology.

Such a long pedagree may explain why R² is used so much in research literature. If only for this reason, you must learn about R². R² also offers a window into a statistical artifact that surprises most newcomers to statistics, which you will read about in this Lesson.

For models with only quantitative explanatory variables or zero-one explanatory variables, model coefficients, and confidence intervals provide more information, but even so, R² can highlight when explanatory variables overlap. And when there are categorical explanatory variables, particularly ones with many levels, R² can genuinely contribute to understanding how much each explanatory variable contributes to a model. But some care is needed to make sure that R² is put in the context of the sample size and the number of coefficients in a model. We’ll get to that.

Fraction of variance explained

R² has a simple-sounding interpretation in terms of how much of the variation in a response variable is accounted for by the explanatory variables.

Recall that we measure variation using the variance. R² compares the variance “captured” by the explanatory variable to the amount of variation in the response variable on its own. To illustrate, consider the Hill_races data frame. Taking the winning running times as the response variable, we might wonder how much of the variation in time is accounted for by the race characteristics, for instance by the length of the race course (in km).

Here’s the variance of the time variable:

Hill_racing |> summarize(var(time, na.rm = TRUE), sd(time, na.rm = TRUE))

var(time, na.rm = TRUE)
sd(time, na.rm = TRUE)


9754276
3123.184

var(time, na.rm = TRUE)	sd(time, na.rm = TRUE)
9754276	3123.184

As always, the units of the variance are the square of the units of the variable. Since time is in seconds, var(time) has units of “seconds-squared.” The standard deviation, which is the square root of the variance, is often easier to understand as an “amount.” That the standard deviation is about 3000 s, about an hour, means that the running times of the various races collected in Hill_racing range over hours: very different races are included in the data frame.

Naturally, the races differ from one another. Among other things, they differ in distance (in km). We can model time versus difference and look at the coefficients:

Hill_racing |> model_train(time ~ distance) |> conf_interval()

term
.lwr
.coef
.upr


(Intercept)
-296.1214
-210.9137
-125.7060

distance
374.4936
381.0230
387.5524

term	.lwr	.coef	.upr
(Intercept)	-296.1214	-210.9137	-125.7060
distance	374.4936	381.0230	387.5524

The units of the distance coefficient are seconds-per-kilometer (s/km). Three hundred eighty seconds per kilometer is a pace slightly slower than six minutes per km, or about ten miles per hour: a ten-minute mile. These are the winning times in the races. You might be tempted to think that these races are for casual runners.

R² provides another way to summarize the model.

Hill_racing |> model_train(time ~ distance) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2226
1
0.854827
13095.65
0.8547617
0
1
2224

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2226	1	0.854827	13095.65	0.8547617	0	1	2224

The R² for the model is 0.85. A simple explanation is that the race distance explains 85% of the variation from race to race in running time: the large majority. This is no surprise to those familiar with racing: a 440 m race takes much less time than a 10,000-meter race. What might account for the other 15% of the variation in time? There are many possibilities.

An important feature of Scottish hill racing is the … hills. Many races feature substantial climbs. How much of the variation in race time is explained by the height (in m) of the climb? R² provides a ready answer:

Hill_racing |> model_train(time ~ climb) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2224
1
0.7650186
7234.066
0.7649128
0
1
2222

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2224	1	0.7650186	7234.066	0.7649128	0	1	2222

The height of the climb also explains a lot of the variation in time: about three-quarters of it.

To know how much of the time variance climb and distance together explain, don’t simply add together the individual R². By trying it, you can see why in this case: the amount of variation explained is 85% + 76% = 161%. That should strike you as strange! No matter how good the explanatory variables, they can never explain more than 100% of the variation in the response variable.

The source of the impossibly large R² is that, to some extent, both time and climb share in the explanation; the two explanatory variables each explain much the same thing. We avoid such double-counting by including both explanatory variables at the same time:

Hill_racing |> model_train(time ~ distance + climb) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2224
2
0.9223273
13186.68
0.9222574
0
2
2221

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2224	2	0.9223273	13186.68	0.9222574	0	2	2221

Taken together, distance and climb account for 92% of the variation in race time. This leaves at most 8% of the variation yet to be explained: the residual variance.

R² and categorical explanatory variables

Consider this question: Does the name of the race along with 565 other race names recorded as the race variable) influence the running time for the race?

There are 154 levels in the race variable, which records the name of the race. (name is the name of the runner.) Examples: Glen Rosa Horseshoe, Ben Nevis Race, …

Here’s a simple model:

Race_name_model1 <- Hill_racing |> model_train(time ~ race)

R² offers a much easier-to-interpret summary than the model coefficients in this situation. Here are the model coefficients:

Race_name_model1 |> conf_interval() -> Goo

term
.lwr
.coef
.upr


(Intercept)
1300
2010.0
2710

raceAlex Brett Cioch Mhor
1950
2710.0
3480

raceAlva Games Hill Race
-1340
-580.0
182

raceAonach Mor UKA event (men)
-1600
-23.7
1550

raceAonach Mor UKA event (women)
-893
329.0
1550

raceAonach Mor Uphill
-1060
-224.0
612



      ... for 154 coefficient altogether

term	.lwr	.coef	.upr
(Intercept)	1300	2010.0	2710
raceAlex Brett Cioch Mhor	1950	2710.0	3480
raceAlva Games Hill Race	-1340	-580.0	182
raceAonach Mor UKA event (men)	-1600	-23.7	1550
raceAonach Mor UKA event (women)	-893	329.0	1550
raceAonach Mor Uphill	-1060	-224.0	612

The reference race is the Aberfeldy Games Race. (Why? Because Aberfeldy is first alphabetically.) Each coefficient on another race gives a result by comparison to Aberfeldy.

The question that started this section was not about the Alva Games Hill Race, the Aonach Mor Uphill, or any of the others, but about the whole collection of differently named races. The R² summary brings all the races together to measure the amount of time variance “explained” by the race names:

Race_name_model1 |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2226
153
0.9505387
260.2572
0.9468864
0
153
2072

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2226	153	0.9505387	260.2572	0.9468864	0	153	2072

This is a strikingly large R². Based on this, some people might be tempted to think that a race’s name plays a big part in the race outcome. Statistical thinkers, however, will wonder whether the race name encodes other information that explains the race outcome. For example, we can look at how well the race name “explains” the race distance and climb:

Hill_racing |> model_train(distance ~ race) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2236
153
1
Inf
1
0
153
2082

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2236	153	1	Inf	1	0	153	2082

Hill_racing |> model_train(climb ~ race) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2234
152
1
Inf
1
0
152
2081

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2234	152	1	Inf	1	0	152	2081

The race name “explains” 100% of the variation for both’ distance’ and’ climb’. That’s because each distinct race has its own distance and climb. So, the race name carries all the information in the distance and climb variables.

By adjusting for distance and climb, we can ask more focused questions about the possible role of the race name in determining. First, recall that just distance and climb account for 92% of the variance in time.

Hill_racing |> model_train(time ~ distance + climb) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2224
2
0.9223273
13186.68
0.9222574
0
2
2221

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2224	2	0.9223273	13186.68	0.9222574	0	2	2221

Adding in race increases the R² by only three percentage points, to 95%. (See ?exr-Q30-201.)

Degrees of freedom

Using categorical variables with a large number of levels are used as explanatory variables, a new phenomenon becomes apparent, a sort of mirage of explanation. To illustrate, consider the model time ~ name. There are five-hundred sixty-seven unique runners in the Hill_racing data.

Hill_racing |> model_train(time ~ name) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2226
565
0.42936
2.210645
0.2351361
0
565
1660

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2226	565	0.42936	2.210645	0.2351361	0	565	1660

The runner’s identity accounts for about 43% of the variance in running time. Understandably, the R² is not much higher: runners participate in multiple races with different distances and climbs so it’s natural for an individual runner to have a spread of running time.

Let’s experiment to illustrate the difficulty of interpreting R² when there are many levels in a categorical explanatory variable. We will create a new variable consisting only of random noise.

Hill_racing <- Hill_racing |> mutate(noise = rnorm(n()))

Naturally, there is no genuine explanation of noise. For instance, distance and climb account for 92% of the actual running times, but a trivial percentage of the noise:

Hill_racing |> model_train(noise ~ distance + climb) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2234
2
0.0017704
1.978339
0.0008755
0.138541
2
2231

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2234	2	0.0017704	1.978339	0.0008755	0.138541	2	2231

In contrast, name, with its 567 different levels, seems to “explain” a lot of noise:

Hill_racing |> model_train(noise ~ name) |> R2()

n
k
Rsquared
F
adjR2
p
df.num
df.denom


2236
566
0.2467746
0.9660854
-0.0086631
0.6926547
566
1669

n	k	Rsquared	F	adjR2	p	df.num	df.denom
2236	566	0.2467746	0.9660854	-0.0086631	0.6926547	566	1669

The 567 names explain about one-quarter of the variance in noise, which ought not to be explainable at all!

How can name explain something that it has no connection with? First, note that the Hill_racing sample size is n=2236. (You can see the sample size in all R² reports under the name n.) When we fit the model noise ~ name, there will be 567 different coefficients, one of which is the intercept and 566 of which relate to name. This number—labelled k in the R² report—is called the “degrees of freedom” of the model.

In general, models with more degrees of freedom can explain more of the response variable, even when there is nothing to explain. On average, the R² in a nothing-to-explain situation will be roughly k/n. For the noise ~ name model, the k-over-n ratio is 566/2236 = 0.25.

Small data

In some situations, a sample may include just a handful of specimens, say $n = 5$ . A simple model, such as y ~ x, will have a small number of degrees of freedom. With y ~ x, there are two coefficients: the intercept and the coefficient on x. With only a single non-intercept coefficient, the model degrees of freedom is $k = 1$ .

Nonetheless, the typical R² from such a model, even when y and x are completely unrelated, will be at least $k / n = 0.20$ . It’s tempting to interpret an R² of 0.20 as the sign of a relationship between y and x. To avoid such misinterpretations, statistical formulas and software carefully track k and n and arrange things to compensate.

One simple compensation for model degrees of freedom is “adjusted R².” The adjustment is roughly this: take R² and subtract $k / n$ . Insofar as there is no relationship between the response and explanatory variables, this will bring R² down to about zero. An adjusted R² greater than zero indicates a relationship between the response and explanatory variables. Adjusted R² is useful when the goal is to ascertain whether there is a substantial relationship. This goal is common in fields such as econometrics.

Statistics textbooks favor other styles of adjustment that are, perhaps surprisingly, not oriented to pointing to a substantial relationship. A famous style of adjustment is encapsulated in the t statistic, which applies to models with only a single degree of freedom. A generalization of t to models with more degrees of freedom is the F statistic.

Exercises

Exercise 17.1

Comparing models with ANOVA

Modelers are often in the position of having a model that they like but are contemplating adding one or more additional explanatory variables. To illustrate, consider the following models:

Model 1: list_price ~ 1
Model 2: list_price ~ 1 + hard_paper
Model 3: list_price ~ 1 + hard_paper + num_pages
Model 4: list_price ~ 1 + hard_paper + num_pages + weight_oz

All the explanatory variables in the smaller models also apply to the bigger models. Such sets are said to be “nested” in much the same way as for Russian dolls.

For a nested set of models, R² can never decrease when moving from a smaller model to a larger one—almost always, there is an increase in R². To demonstrate:

amazon_books <- amazon_books |> 
  select(list_price, weight_oz, num_pages, hard_paper) 
amazon_books <- amazon_books |>
  filter(complete.cases(amazon_books))
model1 <- amazon_books |> 
  model_train(list_price ~ 1)
model2 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz)
model3 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz + num_pages)
model4 <- amazon_books |> 
  model_train(list_price ~ 1 + weight_oz + num_pages + hard_paper)

R2(model1)

n
k
Rsquared
F
adjR2
p
df.num
df.denom


314
0
0
NaN
0
NaN
0
313

n	k	Rsquared	F	adjR2	p	df.num	df.denom
314	0	0	NaN	0	NaN	0	313

R2(model2)

n
k
Rsquared
F
adjR2
p
df.num
df.denom


314
1
0.16
57
0.15
0
1
312

n	k	Rsquared	F	adjR2	p	df.num	df.denom
314	1	0.16	57	0.15	0	1	312

R2(model3)

n
k
Rsquared
F
adjR2
p
df.num
df.denom


314
2
0.17
31
0.16
0
2
311

n	k	Rsquared	F	adjR2	p	df.num	df.denom
314	2	0.17	31	0.16	0	2	311

R2(model4)

n
k
Rsquared
F
adjR2
p
df.num
df.denom


314
3
0.17
21
0.16
0
3
310

n	k	Rsquared	F	adjR2	p	df.num	df.denom
314	3	0.17	21	0.16	0	3	310

When adding explanatory variables to a model, a good question is whether the new variable(s) add to the ability to account for the variability in the response variable. R² never goes down when moving from a smaller to a larger model, so we cannot rely on the increase in R². A valuable technique called “Analysis of Variance” (ANOVA for short) looks at the incremental change in variance explained from a smaller model to a larger one. The increase can be presented as an F statistic. To illustrate:

anova_summary(model1, model2, model3, model4)

term
df.residual
rss
df
sumsq
statistic
p.value


list_price ~ 1
313
54606
NA
NA
NA
NA

list_price ~ 1 + weight_oz
312
46122
1
8484
58.0
0.00

list_price ~ 1 + weight_oz + num_pages
311
45513
1
609
4.2
0.04

list_price ~ 1 + weight_oz + num_pages + hard_paper
310
45338
1
175
1.2
0.27

term	df.residual	rss	df	sumsq	statistic	p.value
list_price ~ 1	313	54606	NA	NA	NA	NA
list_price ~ 1 + weight_oz	312	46122	1	8484	58.0	0.00
list_price ~ 1 + weight_oz + num_pages	311	45513	1	609	4.2	0.04
list_price ~ 1 + weight_oz + num_pages + hard_paper	310	45338	1	175	1.2	0.27

Focus on the column named statistic. This records the F statistic. The move from Model 1 to Model 2 produces F=57, well above the threshold described above and clearly indicating that the weight_oz variable accounts for some of the list price. Moving from Model 2 to Model 3 creates a much less impressive F of 3.8. It is as if the added explanatory variable, num_pages, is just barely pulling its own “weight.” Finally, moving from Model 3 to Model 4 produces a below-threshold F of 1.3. In other words, in the context of weight_oz and num_pages, the hard_paper variable does not carry additional information about the list price.

The last column of the report, labeled Pr(>F), translates F into a universal 0 to 1 scale called a p-value. A large F produces a small p-value. The rule of thumb for reading p-values is that a value $p < 0.05$ indicates that the added variable brings new information about the response variable. We will return to p-values and the controversy they have entailed in Lesson 29.

id=Q17-301

Exercise 17.2

In-sample versus out-of-sample R²

The R² reported on a model is always optimistic; it underestimates the residual noise from the model. The “adjusted R²” attempts to correct for this optimism. Adjusted R² is based on a theory that’s appropriate for linear regression models.

Increasingly, particularly when it comes to models used as parts of artificial intelligence systems, nonlinear modeling methods are used, for example “neural networks” or “random forests.” In assessing the predictive ability of these models, a computational technique is used to avoid overly optimistic models.

DEMONSTRATE in- and out-of-sample

id=Q17-302

Fraction of variance explained

R2 and categorical explanatory variables

Degrees of freedom

Exercises

R² and categorical explanatory variables