23  Confidence intervals

Lesson 22 took a simulation approach to observing sampling variation: generate many trials from a source such as a DAG and observe how the same sample statistic varies from trial to trial. We quantified the sampling variation in the same way we usually quantify variation, taking the variance of the sample statistic across all the trials. We called this measure of variation the sampling variance as a reminder that it comes from repeated trials of sampling.

The variance of a quantity has units that are the square of the quantity’s units. For purposes of interpretation, we often present variation using the square root of the variance, that is, the standard deviation. Following this practice, Lesson 22 introduced the square root of the sampling variance. Common sense might suggest that this ought to be called the “sampling standard deviation,” but that is long-winded and awkward. Instead, the square root of the sampling variation is called the “standard error” of the sample statistic. Unfortunately, this traditional name contains no reminder that it refers to sampling variation. So be careful to remember that “standard error” is always about sampling variation.

Precision in measurement

In everyday language the words “precision” and “accuracy” are used more or less interchangeably to describe how well a measurement has been made. Nevertheless there are two distinct concepts in “how well.” The easier concept has to do with reproducibility and reliability: if the measurement is taken many times, how much will the measurements differ from one another. This is the same issue as sampling variation. In the technical lingo of measurement, “precision” is used to express the idea of reproducibility or sampling variation. Precision is just about the measurements themselves.

In contrast, in speaking technically we use “accuracy” to refer to a different concept than “precision.” Accuracy cannot be computed with just the measurements. Accuracy refers to something outside the measurements, what we might call the “true” value of what we are trying to measure. Disappointingly, the “true” value is an elusive quantity since all we typically have is our measurements. We can easily measure precision from data, but our data have practically nothing to say about accuracy.

An analogy is often made between precision and accuracy and the patterns seen in archery. Figure 23.1 shows five arrows shot during archery practice. The arrows are in an area about the size of a dinner plate 6 inches in radius: that’s the precision.

Figure 23.1: Results from archery practice

A dinner-plate’s precision is not bad for a beginner archer. Unfortunately, the dinner plate is not centered on the bullseye but about 10 inches higher. In other words, the arrows are inaccurate by about 10 inches.

Since the “true” target is visible, it is easy to know the accuracy of the shooting. The analogy of archery to the situation in statistics would be better if the target was shown in plane white, that is, if the “true” value were not known directly. In that situation, as with data analysis, the spread in the arrows’ locations could tell us only about the precision.

Summary

The standard error is a measure of precision: the reproducibility from sample to sample. It tells us nothing about accuracy.

The confidence interval

The standard error is a perfectly reasonable way to measure precision. Nonetheless, the statistical convention for reporting precision is as an interval called the “confidence interval.” There are two equivalent ways to write the interval, either as [lower, upper] or center\(\pm\)half-width. Both styles are correct. (The preferred style can depend on the field or the journal publishing the report.)

The overall length of the interval is four times the standard error. Or, equivalently, the half-width is twice the standard error. Why twice? Returning to the archery analogy, we want the interval to include almost all the arrows. It turns out that if the standard error were used directly as the half-width of the confidence interval, only about 66% of the arrows would be inside the interval. Using twice the standard error as the half-width means that about 95% of the arrows will be in the interval.

The traditional name for the half-width of the confidence interval is the “margin of error.” The margin of error is twice the standard error.

In practice, confidence intervals are calculated using special-purpose software such as the conf_interval() function, for instance:

Note: Experienced R users may have encountered the confint() function. It does exactly the same calculation as conf_interval(), but conf_interval() formats the output into a data frame, making it more suitable for data wrangling the results.
Hill_racing %>% 
  lm(time ~ distance + climb, data=.) %>% 
  conf_interval()
term .lwr .coef .upr
(Intercept) -533.432471 -469.976937 -406.521402
distance 246.387096 253.808295 261.229494
climb 2.493307 2.609758 2.726209

Notice that there is a separate confidence interval for each model coefficient. The sampling variation is essentially the same, but that variation appears different when translated to the various coefficients’ units.

Example: Signal and noise in coefficients

Returning to the example of (adult) children’s height versus the height of the mother, the sampling variation is usually represented by an interval—the confidence interval—on the coefficients.

lm(height ~ mother, data=Galton) |> conf_interval()
term .lwr .coef .upr
(Intercept) 40.300 46.700 53.100
mother 0.213 0.313 0.413

Interpretation: Consider looking at the mean height of many, many (adult-aged) children whose mothers all are 63 inches tall. Compare this to the mean height of many, many (adult-aged) children whose mothers are all 64 inches tall. The mothers differ in height by 1 inch. According to the data in Galton, the means of the two groups will differ by 0.313 inches. But this number is not so precise. It should be written as \(0.3 \pm 0.1\) inches, or, in bottom-to-top format [0.2, 0.4] inches.

The name “confidence interval” is used universally, but it can be a little misleading for those starting out in statistics. The word “confidence” in “confidence interval” has nothing to do with self-assuredness, boldness, or confidentiality. A more descriptive name is “precision interval.” For example, the mass of the Earth is known quite precisely, \(5.9722\pm 0.0005 \times 10^{24}\text{kg}\).

Calculating confidence intervals

In Lesson 22, we repeated trials over and over again to gain some feeling for sampling variation. We quantified the repeatability in any of several closely related ways: the sampling variance or its square root (the “standard error”) or a “margin of error” or a “confidence interval.” Our experiments with simulations demonstrated an important property of sampling variation: the amount of sampling variation depends on the sample size \(n\). In particular, the sampling variance gets smaller as \(n\) increases in proportion to \(1/n\). (Consequently, the standard error gets smaller in proportion to \(1/\sqrt{n}\).)

It is time to take off the DAG simulation training wheels and measure sampling variation from a single data frame. Our first approach will be to turn the single sample into several smaller samples: subsampling. Later, we will turn to another technique, resampling, which draws a sample of full size from the data frame. Sometimes, in particular with regression models, it is possible to calculate the sampling variation from a formula, allowing software to carry out and report the calculations automatically.

The next sections show two approaches to calculating a confidence interval. For the most part, this is background information to show you how it’s possible to measure sampling variation from a single sample. Usually you will use conf_interval() or similar software for the calculation.

Subsampling

Although computing a confidence interval is a simple matter in software, it is helpful to have a conceptual idea of what is behind the computation. This section and Section 23.3.2 describe two methods for calculating a confidence interval from a single sample. The conf_interval() summary function uses yet another method that is more mathematically intricate, but which we won’t describe here.

To “subsample” means to draw a smaller sample from a large one. “Small” and “large” are relative. For our example, we turn to the TenMileRace data frame containing the record of thousands of runners’ times in a race, along with basic information about each runner. There are many ways we could summarize TenMileRace. Any summary would do for the example. We will summarize the relationship between the runners’ ages and their start-to-finish times (variable net), that is, net ~ age. To avoid the complexity of a runner’s improvement with age followed by a decline, we will limit the study to people over 40.

TenMileRace %>% filter(age > 40) %>%
  lm(net ~ age, data = .) %>% conf_interval()
term .lwr .coef .upr
(Intercept) 4014.7081 4278.21279 4541.71744
age 22.8315 28.13517 33.43884

The units of net are seconds, and the units of age are years. The model coefficient on age tells us how the net time changes for each additional year of age: seconds per year. Using the entire data frame, we see that the time to run the race gets longer by about 28 seconds per year. So a 45-year-old runner who completed this year’s 10-mile race in 3900 seconds (about 9.2 mph, a pretty good pace!) might expect that, in ten years, when she is 55 years old, her time will be longer by 280 seconds.

It would be asinine to report the ten-year change as 281.3517 seconds. The runner’s time ten years from now will be influenced by the weather, crowding, the course conditions, whether she finds a good pace runner, the training regime, improvements in shoe technology, injuries, and illnesses, among other factors. There is little or nothing we can say from the TenMileRace data about such factors.

There’s also sampling variation. There are 2898 people older than 40 in the TenMileRace data frame. The way the data was collected (radio-frequency interrogation of a dongle on the runner’s shoe) suggests that the data is a census of finishers. However, it is also fair to treat it as a sample of the kind of people who run such races. People might have been interested in running but had a schedule conflict, lived too far away, or missed their train to the start line in the city.

We see sampling variation by comparing multiple samples. To create those multiple samples from TenMileRace, we will draw, at random, subsamples of, say, one-tenth the size of the whole, that is, \(n=290\)

Over40 <- TenMileRace %>% filter(age > 40)
lm(time ~ age, data = Over40 %>% sample(size=290)) %>% conf_interval()
term .lwr .coef .upr
(Intercept) 3163.95021 4040.56678 4917.18336
age 21.41999 39.02011 56.62023
lm(time ~ age, data = Over40 %>% sample(size=290)) %>% conf_interval()
term .lwr .coef .upr
(Intercept) 4751.21767 5695.660073 6640.10247
age -16.68618 2.420675 21.52753

The age coefficients from these two subsampling trials differ one from the other by about 0.5 seconds. To get a more systematic view, run more trials:

# a sample of summaries
Trials <- do(1000) * {
  lm(time ~ age, data = sample(Over40, size=290)) %>% conf_interval()
}
# a summary of the sample of summaries
Trials %>% 
  group_by(term) %>% 
  dplyr::summarize(se = sd(.coef))
term se
(Intercept) 437.044245
age 8.842183

We used the name se for the summary of samples of summaries because what we have calculated is the standard error of the age coefficient from samples of size \(n=290\).

In Lesson 22 we saw that the standard error is proportional to \(1/\sqrt{\strut n}\), where \(n\) is the sample size. From the subsamples, know that the SE for \(n=290\) is about 9.0 seconds. This tells us that the SE for the full \(n=2898\) samples would be about \(9.0 \frac{\sqrt{290}}{\sqrt{2898}} = 2.85\).

So the interval summary of the age coefficient—the confidence interval— is \[\underbrace{28.1}_\text{age coef.} \pm 2\times\!\!\!\!\!\!\! \underbrace{2.85}_\text{standard error} =\ \ \ \ 28.1 \pm\!\!\!\!\!\!\!\! \underbrace{5.6}_\text{margin of error}\ \ \text{or, equivalently, 22.6 to 33.6}\]

Bootstrapping

There is a trick, called “resampling,” to generate a random subsample of a data frame with the same \(n\) as the data frame: draw the new sample randomly from the original sample with replacement. An example will suffice to show what the “with replacement” does:

example <- c(1,2,3,4,5)
# without replacement
sample(example)
[1] 1 4 3 5 2
# now, with replacement
sample(example, replace=TRUE)
[1] 2 4 3 3 5
sample(example, replace=TRUE)
[1] 3 5 4 4 4
sample(example, replace=TRUE)
[1] 1 1 2 2 3
sample(example, replace=TRUE)
[1] 4 3 1 4 5

The “with replacement” leads to the possibility that some values will be repeated two or more times and other values will be left out entirely.

The calculation of the SE using resampling is called “bootstrapping.”

Demonstration: Bootstrapping the standard error

We will apply bootstrapping to find the standard error of the age coefficient from the model time ~ age fit to the Over40 data frame.

There are two steps:

  1. Run many trials, each of which fits the model time ~ age using lm(). From trial to trial, the data used for fitting is a resampling of the Over40 data frame. The result of each trial is the coefficients from the model.

  2. Summarize the trials with the standard deviation of the age coefficients.

# run many trials
Trials <- do(1000) * {
  lm(time ~ age, data = sample(Over40, replace=TRUE)) %>% 
       conf_interval()
}
# summarize the trials to find the SE
Trials %>% 
  group_by(term) %>%
  summarize(se = sd(.coef))
term se
(Intercept) 141.634107
age 2.859483