```
%>%
Hill_racing lm(time ~ distance + climb, data=.) %>%
conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | -533.432471 | -469.976937 | -406.521402 |

distance | 246.387096 | 253.808295 | 261.229494 |

climb | 2.493307 | 2.609758 | 2.726209 |

Lesson 22 took a simulation approach to observing sampling variation: generate many trials from a source such as a DAG and observe how the same sample statistic varies from trial to trial. We quantified the sampling variation in the same way we usually quantify variation, taking the **variance** of the sample statistic across all the trials. We called this measure of variation the **sampling variance** as a reminder that it comes from repeated trials of sampling.

The variance of a quantity has units that are the *square* of the quantity’s units. For purposes of interpretation, we often present variation using the square root of the variance, that is, the *standard deviation*. Following this practice, Lesson 22 introduced the square root of the sampling variance. Common sense might suggest that this ought to be called the “sampling standard deviation,” but that is long-winded and awkward. Instead, the square root of the sampling variation is called the “**standard error**” of the sample statistic. Unfortunately, this traditional name contains no reminder that it refers to sampling variation. So be careful to remember that “standard error” is always about sampling variation.

In everyday language the words “precision” and “accuracy” are used more or less interchangeably to describe how well a measurement has been made. Nevertheless there are two distinct concepts in “how well.” The easier concept has to do with reproducibility and reliability: if the measurement is taken many times, how much will the measurements differ from one another. This is the same issue as *sampling variation*. In the technical lingo of measurement, “**precision**” is used to express the idea of reproducibility or sampling variation. Precision is just about the measurements themselves.

In contrast, in speaking technically we use “**accuracy**” to refer to a different concept than “precision.” Accuracy cannot be computed with just the measurements. Accuracy refers to something outside the measurements, what we might call the “true” value of what we are trying to measure. Disappointingly, the “true” value is an elusive quantity since all we typically have is our measurements. We can easily measure precision from data, but our data have practically nothing to say about accuracy.

An analogy is often made between precision and accuracy and the patterns seen in archery. Figure 23.1 shows five arrows shot during archery practice. The arrows are in an area about the size of a dinner plate 6 inches in radius: that’s the precision.

A dinner-plate’s precision is not bad for a beginner archer. Unfortunately, the dinner plate is not centered on the bullseye but about 10 inches higher. In other words, the arrows are inaccurate by about 10 inches.

Since the “true” target is visible, it is easy to know the accuracy of the shooting. The analogy of archery to the situation in statistics would be better if the target was shown in plane white, that is, if the “true” value were not known directly. In that situation, as with data analysis, the spread in the arrows’ locations could tell us only about the precision.

The standard error is a perfectly reasonable way to measure precision. Nonetheless, the statistical convention for reporting precision is as an **interval** called the “**confidence interval**.” There are two equivalent ways to write the interval, either as [lower, upper] or center\(\pm\)half-width. Both styles are correct. (The preferred style can depend on the field or the journal publishing the report.)

The overall length of the interval is four times the standard error. Or, equivalently, the half-width is twice the standard error. Why twice? Returning to the archery analogy, we want the interval to include almost all the arrows. It turns out that if the standard error were used directly as the half-width of the confidence interval, only about 66% of the arrows would be inside the interval. Using twice the standard error as the half-width means that about 95% of the arrows will be in the interval.

The traditional name for the half-width of the confidence interval is the “**margin of error**.” The margin of error is twice the standard error.

In practice, confidence intervals are calculated using special-purpose software such as the `conf_interval()`

function, for instance:

Note: Experienced R users may have encountered the

`confint()`

function. It does exactly the same calculation as `conf_interval()`

, but `conf_interval()`

formats the output into a data frame, making it more suitable for data wrangling the results.```
%>%
Hill_racing lm(time ~ distance + climb, data=.) %>%
conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | -533.432471 | -469.976937 | -406.521402 |

distance | 246.387096 | 253.808295 | 261.229494 |

climb | 2.493307 | 2.609758 | 2.726209 |

Notice that there is a separate confidence interval for each model coefficient. The sampling variation is essentially the same, but that variation appears different when translated to the various coefficients’ units.

The name “confidence interval” is used universally, but it can be a little misleading for those starting out in statistics. The word “confidence” in “confidence interval” has *nothing to do* with self-assuredness, boldness, or confidentiality. A more descriptive name is “**precision interval**.” For example, the mass of the Earth is known quite precisely, \(5.9722\pm 0.0005 \times 10^{24}\text{kg}\).

In Lesson 22, we repeated trials over and over again to gain some feeling for sampling variation. We quantified the repeatability in any of several closely related ways: the sampling variance or its square root (the “standard error”) or a “margin of error” or a “confidence interval.” Our experiments with simulations demonstrated an important property of sampling variation: the amount of sampling variation depends on the sample size \(n\). In particular, the sampling variance gets smaller as \(n\) increases in proportion to \(1/n\). (Consequently, the standard error gets smaller in proportion to \(1/\sqrt{n}\).)

It is time to take off the DAG simulation training wheels and measure sampling variation from a *single* data frame. Our first approach will be to turn the single sample into several smaller samples: subsampling. Later, we will turn to another technique, resampling, which draws a sample of full size from the data frame. Sometimes, in particular with regression models, it is possible to calculate the sampling variation from a formula, allowing software to carry out and report the calculations automatically.

The next sections show two approaches to calculating a confidence interval. For the most part, this is background information to show you how it’s possible to measure sampling variation from a single sample. Usually you will use `conf_interval()`

or similar software for the calculation.

Although computing a confidence interval is a simple matter in software, it is helpful to have a conceptual idea of what is behind the computation. This section and Section 23.3.2 describe two methods for calculating a confidence interval from a single sample. The `conf_interval()`

summary function uses yet another method that is more mathematically intricate, but which we won’t describe here.

To “subsample” means to draw a smaller sample from a large one. “Small” and “large” are relative. For our example, we turn to the `TenMileRace`

data frame containing the record of thousands of runners’ times in a race, along with basic information about each runner. There are many ways we could summarize `TenMileRace.`

Any summary would do for the example. We will summarize the relationship between the runners’ ages and their start-to-finish times (variable `net`

), that is, `net ~ age`

. To avoid the complexity of a runner’s improvement with age followed by a decline, we will limit the study to people over 40.

```
%>% filter(age > 40) %>%
TenMileRace lm(net ~ age, data = .) %>% conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 4014.7081 | 4278.21279 | 4541.71744 |

age | 22.8315 | 28.13517 | 33.43884 |

The units of `net`

are seconds, and the units of `age`

are years. The model coefficient on `age`

tells us how the `net`

time changes for each additional year of `age`

: seconds per year. Using the entire data frame, we see that the time to run the race gets longer by about 28 seconds per year. So a 45-year-old runner who completed this year’s 10-mile race in 3900 seconds (about 9.2 mph, a pretty good pace!) might expect that, in ten years, when she is 55 years old, her time will be longer by 280 seconds.

It would be asinine to report the ten-year change as 281.3517 seconds. The runner’s time ten years from now will be influenced by the weather, crowding, the course conditions, whether she finds a good pace runner, the training regime, improvements in shoe technology, injuries, and illnesses, among other factors. There is little or nothing we can say from the `TenMileRace`

data about such factors.

There’s also sampling variation. There are 2898 people older than 40 in the `TenMileRace`

data frame. The way the data was collected (radio-frequency interrogation of a dongle on the runner’s shoe) suggests that the data is a census of finishers. However, it is also fair to treat it as a sample of the kind of people who run such races. People might have been interested in running but had a schedule conflict, lived too far away, or missed their train to the start line in the city.

We see sampling variation by comparing multiple samples. To create those multiple samples from `TenMileRace`

, we will draw, at random, subsamples of, say, one-tenth the size of the whole, that is, \(n=290\)

```
<- TenMileRace %>% filter(age > 40)
Over40 lm(time ~ age, data = Over40 %>% sample(size=290)) %>% conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 3163.95021 | 4040.56678 | 4917.18336 |

age | 21.41999 | 39.02011 | 56.62023 |

`lm(time ~ age, data = Over40 %>% sample(size=290)) %>% conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 4751.21767 | 5695.660073 | 6640.10247 |

age | -16.68618 | 2.420675 | 21.52753 |

The age coefficients from these two subsampling trials differ one from the other by about 0.5 seconds. To get a more systematic view, run more trials:

```
# a sample of summaries
<- do(1000) * {
Trials lm(time ~ age, data = sample(Over40, size=290)) %>% conf_interval()
}# a summary of the sample of summaries
%>%
Trials group_by(term) %>%
::summarize(se = sd(.coef)) dplyr
```

term | se |
---|---|

(Intercept) | 437.044245 |

age | 8.842183 |

We used the name `se`

for the summary of samples of summaries because what we have calculated is the standard error of the age coefficient from samples of size \(n=290\).

In Lesson 22 we saw that the standard error is proportional to \(1/\sqrt{\strut n}\), where \(n\) is the sample size. From the subsamples, know that the SE for \(n=290\) is about 9.0 seconds. This tells us that the SE for the full \(n=2898\) samples would be about \(9.0 \frac{\sqrt{290}}{\sqrt{2898}} = 2.85\).

So the interval summary of the `age`

coefficient—the *confidence interval*— is \[\underbrace{28.1}_\text{age coef.} \pm 2\times\!\!\!\!\!\!\! \underbrace{2.85}_\text{standard error} =\ \ \ \ 28.1 \pm\!\!\!\!\!\!\!\! \underbrace{5.6}_\text{margin of error}\ \ \text{or, equivalently, 22.6 to 33.6}\]

There is a trick, called “**resampling**,” to generate a random subsample of a data frame with the same \(n\) as the data frame: draw the new sample randomly from the original sample **with replacement**. An example will suffice to show what the “with replacement” does:

```
<- c(1,2,3,4,5)
example # without replacement
sample(example)
```

`[1] 1 4 3 5 2`

```
# now, with replacement
sample(example, replace=TRUE)
```

`[1] 2 4 3 3 5`

`sample(example, replace=TRUE)`

`[1] 3 5 4 4 4`

`sample(example, replace=TRUE)`

`[1] 1 1 2 2 3`

`sample(example, replace=TRUE)`

`[1] 4 3 1 4 5`

The “with replacement” leads to the possibility that some values will be repeated two or more times and other values will be left out entirely.

The calculation of the SE using resampling is called “**bootstrapping**.”