`|> summarize(var(time, na.rm = TRUE), sd(time, na.rm = TRUE)) Hill_racing `

var(time, na.rm = TRUE) | sd(time, na.rm = TRUE) |
---|---|

9754276 | 3123.184 |

As always, the units of the variance are the square of the units of the variable. Since `time`

is in seconds, `var(time)`

has units of “seconds-squared.” The standard deviation, which is the square root of the variance, is often easier to understand as an “amount.” That the standard deviation is about 3000 s, about an hour, means that the running times of the various races collected in `Hill_racing`

range over hours: very different races are included in the data frame.

Naturally, the races differ from one another. Among other things, they differ in `distance`

(in km). We can model `time`

versus difference and look at the coefficients:

`|> model_train(time ~ distance) |> conf_interval() Hill_racing `

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | -296.1214 | -210.9137 | -125.7060 |

distance | 374.4936 | 381.0230 | 387.5524 |

The units of the `distance`

coefficient are seconds-per-kilometer (s/km). Three hundred eighty seconds per kilometer is a pace slightly slower than six minutes per km, or about ten miles per hour: a ten-minute mile. These are the winning times in the races. You might be tempted to think that these races are for casual runners.

R^{2} provides another way to summarize the model.

`|> model_train(time ~ distance) |> R2() Hill_racing `

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

2226 | 1 | 0.854827 | 13095.65 | 0.8547617 | 0 | 1 | 2224 |

The R^{2} for the model is 0.85. A simple explanation is that the race distance explains 85% of the variation from race to race in running time: the large majority. This is no surprise to those familiar with racing: a 440 m race takes much less time than a 10,000-meter race. What might account for the other 15% of the variation in time? There are many possibilities.

An important feature of Scottish hill racing is the … hills. Many races feature substantial climbs. How much of the variation in race `time`

is explained by the height (in m) of the `climb`

? R^{2} provides a ready answer:

`|> model_train(time ~ climb) |> R2() Hill_racing `

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

2224 | 1 | 0.7650186 | 7234.066 | 0.7649128 | 0 | 1 | 2222 |

The height of the `climb`

*also* explains a lot of the variation in `time`

: about three-quarters of it.

To know how much of the `time`

variance `climb`

and `distance`

together explain, don’t simply add together the individual R^{2}. By trying it, you can see why in this case: the amount of variation explained is 85% + 76% = 161%. That should strike you as strange! No matter how good the explanatory variables, they can never explain more than 100% of the variation in the response variable.

The source of the impossibly large R^{2} is that, to some extent, both `time`

and `climb`

share in the explanation; the two explanatory variables each explain much the same thing. We avoid such double-counting by including both explanatory variables at the same time:

`|> model_train(time ~ distance + climb) |> R2() Hill_racing `

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

2224 | 2 | 0.9223273 | 13186.68 | 0.9222574 | 0 | 2 | 2221 |

Taken together, `distance`

and `climb`

account for 92% of the variation in race `time`

. This leaves at most 8% of the variation yet to be explained: the **residual variance**.