20  Confidence intervals

Lesson 19 took a simulation approach to observing sampling variation: generate many trials from a source such as a DAG and observe how the same sample statistic varies from trial to trial. We quantified the sampling variation in the same way we usually quantify variation, taking the variance of the sample statistic across all the trials. We called this measure of variation the sampling variance as a reminder that it comes from repeated trials of sampling.

In this Lesson, we will examine a more informative format for reporting sampling variation: the confidence interval. We will also consider an important general concept for interpreting confidence intervals: precision of measurement. We will contrast precision with accuracy to help you avoid the common error of mistaking precision with accuracy.

Taking into consideration that precision is a general issue in any kind of quantitative reporting, not just statistical modeling, it might have been better if “precision interval” had been used instead of “confidence interval.” The word “confidence” in “confidence interval” has nothing to do with self-assuredness, boldness, or confidentiality. (When “confidence interval” was introduced in the 1930s, the word was chosen to avoid a once-bitter technical dispute in the philosophy of probability.)

Summary: The confidence interval is a measure of precision: the reproducibility from sample to sample. It tells us nothing about accuracy. Without understanding the difference between “precision” and “accuracy,” it is difficult to interpret confidence intervals appropriately.

Formats for confidence intervals

We have been looking at confidence intervals since Lesson 11, were we introduced the conf_interval() function for displaying model coefficients. To illustrate, consider the running time (in seconds, s) for Scottish hill races as a function of the race distance (in km) and overall height climbed (in meters, m):

Hill_racing |> 
  model_train(time ~ distance + climb) |> 
  conf_interval()
term .lwr .coef .upr
(Intercept) -533.00 -470.00 -407.00
distance 246.00 254.00 261.00
climb 2.49 2.61 2.73

As always, there is a model coefficient for each term mentioned in the model specification, time ~ distance + climb. Here, those terms give an intercept, a coefficient on distance, and a coefficient on climb. Each coefficient comes with two other numbers, called .lwr and .upr in the report, standing for “lower” and “upper.” The confidence interval runs from the lower number to the upper number.

Focus for the moment on the distance coefficient: 253.8 s/km. The confidence interval runs from 246 to 261 s/km. In previous Lessons about model values—the output of the model function when given values for the explanatory variables—we have emphasized the coefficient itself..

Statistical thinkers, knowing that there is sampling variation in any coefficient calculated from a data sample, like to use the word “estimate” to refer to the calculated value. Admittedly, the computer carries out the calculation of the coefficient without mistake and reports it with many digits. But those digits do not incorporate the uncertainty due to sampling variation. That’s the role of the confidence interval.

The meaning of a confidence interval such as the 246-to-261 s/km interval shown above is, “Any other estimate of the coefficient (made with other data) is consistent with ours so long as it falls within the confidence interval.”

An alternative, but entirely equivalent format for the confidence interval uses \(\pm\) (plus-or-minus) notation. The interval [246-261] s/km in \(\pm\) format can be written 254 \(\pm\) 8 s/km.

Significant digits?

Another convention for reporting uncertainty—legendarily emphasized by chemistry teachers—involves the number of digits with which to write a number: the “significant digits.” For instance, the distance coefficient reported by the computer is 253.808295 s/km. Were you to put this number in a lab report, you are at risk for a red annotation from your teacher: “Too many digits!”

According to the significant-digits convention, a proper way to write the distance coefficient would be 250 s/km, although some teachers might prefer 254 s/km.

The situation is difficult because the significant-digit convention is attempting to serve three different goals at once. The first goal is to signal the precision of the number. The second goal is to avoid overwhelming human readers with irrelevant digits. The third goal is to allow human readers to redo calculations. These three goals sometimes compete. An example is the [246,261] s/km confidence interval on the distance coefficient reported earlier. For this coefficient, the width of the confidence interval is about 15 s/km. This suggests that there is no value to the human reader in reporting any digits after the decimal point. But a literal translation of [246-261] into \(\pm\) format would be 253.5 \(\pm\) 7.5. Now there is a digit being reported after the decimal point, a digit we previously said isn’t worth reporting!

As a general-purpose procedure, I suggest the following principles for model coefficients:

  1. Always report an interval in either the [lower, upper] format or the center \(\pm\) spread format. It doesn’t much matter which one.
  2. As a guide to the number of digits to print, look to the interval width, calculated as upper \(-\) lower or as 2 \(\times\) spread. Print the number using the interval width as a guide: only the first two digits (neglecting leading zeros) are worth anything.
  3. When interpreting intervals, don’t put much stock in the last digit. For example, is 245 km/s inside the interval [246, 261] km/s. Not mathematically. But remembering that the last digit in 246 is not to be taken as absolute, 245 is for all practical purposes inside the interval.

As I write (2024-01-11), a news notice appeared on my computer screen from the New York Times.

The “Inflation Ticks Higher” in the headline is referring to a change from 3.3% reported in November to 3.4% reported in December. Such reports ought to come with a precision interval. To judge from the small wiggles in the 20-year data, this would be about \(\pm 0.2\)%. A numerical change from 3.3% to 3.4% is, taking the precision into account, no change at all!

Precision versus accuracy

In everyday language the words “precision” and “accuracy” are interchangeable; both describe how well a measurement has been made. Nevertheless there are two distinct concepts in “how well.” The easier concept has to do with reproducibility and reliability: if the measurement is taken many times, how much will the measurements differ from one another? This is the same issue as sampling variation. In the technical lingo of measurement, reproducibility or sampling variation is called “precision. Precision is just about the measurements themselves.

In contrast, in speaking technically we use “accuracy” to refer to a different concept than “precision.” Accuracy cannot be computed with just the measurements. Accuracy refers to something outside the measurements, what we might call the “true” value of what we are trying to measure. Disappointingly, the “true” value is an elusive quantity since all we typically have is our measurements. We can easily measure precision from data, but our data have practically nothing to say about accuracy.

An analogy is often made between precision and accuracy and the patterns seen in archery. Figure 20.1 shows five arrows shot during archery practice. The arrows are in an area about the size of a dinner plate 6 inches in radius: that’s the precision.

Figure 20.1: Results from archery practice

A dinner-plate’s precision is not bad for a beginner archer. Unfortunately, the dinner plate is not centered on the bullseye but about 10 inches higher. In other words, the arrows are inaccurate by about 10 inches.

Since the “true” target is visible, it is easy to know the accuracy of the shooting. The analogy of archery to the situation in statistics would be better if the target was shown in plane white, that is, if the “true” value were not known directly. In that situation, as with data analysis, the spread in the arrows’ locations could tell us only about the precision.

To illustrate the difference between precision and accuracy, let’s look again at the coefficient on distance in the Scottish hill racing model. Our original model was

Hill_racing |> 
  model_train(time ~ distance + climb) |> 
  conf_interval() |>
  filter(term == "distance")
term .lwr .coef .upr
distance 246 254 261

Another possible model uses only distance as an explanatory variable:

Hill_racing |> 
  model_train(time ~ distance) |> 
  conf_interval() |>
  filter(term == "distance")
term .lwr .coef .upr
distance 374 381 388

The second confidence interval, [374, 388] s/km, is utterly inconsistent with the earlier confidence interval [246, 261]. This is a matter of accuracy. The distance coefficient in the first model is aimed at a different target than the distance coefficient in the second model. In exploring hill-racing data, should we look at distance taking into account climb (the first model) or ignoring climb (the second model). The width of the confidence interval addresses only the issue of precision, not whether the model is accurate for the purpose at hand.

The confidence level

The confidence interval is designed to communicate to a human reader the influence of sampling variation as it plays out in the calculation of a model coefficient (or some other sample statistic such as the median or R^2). The two equivalent formats we use for the interval—for example, [374, 388] or equivalently 381 $—are intended to be easy to read and use for the intended purpose.

A more complete picture of sampling variation is provided by treating it as a noise model, as described in Lesson 15. We can choose an appropriate noise model by looking at the distribution shape for sampling variation. Experience has shown that an excellent, general-purpose noise model for sampling variation is the normal noise model. To support this claim we can use a simulation of the sort reported in Figure 19.2, where the distribution of coefficients across the 500 sampling trials has the characteristic shape of the normal model.

To show how that normal noise model relates to confidence intervals, we can calculate a confidence interval from data and compare that interval to a simulation of sampling variation. We will stick with the distance coefficient in the model time ~ distance + climb trained on the Scottish hill racing data in the Hill_racing data frame. But any model of any other data set would show much the same thing.

Recall that the confidence interval on distance is 246 s/km to 261 s/km. We can construct individual trials of sampling variation through a technique called “resampling” that will be described in Chapter 19. In essence, the resampling technique takes a sample of the same size from a data frame. In the simulation, we will use resampling to generate a “new” sample, train a model on that new sample, then report the distance coefficient and its confidence interval. Each trial will look like this:

resample(Hill_racing) |>
  model_train(time ~ distance + climb) |>
  conf_interval() |>
  filter(term == "distance")
term .lwr .coef .upr
distance 248.0934 255.5706 263.0479

Let’s run 10,000 such trials and store them in a data frame we will call Trials:

Trials <- 
  resample(Hill_racing) |>
  model_train(time ~ distance + climb) |>
  conf_interval() |>
  filter(term == "distance") |>
  trials(10000)

Now, let’s plot the 10,000 coefficients, one from each trial:

Trials |>
  point_plot(.coef ~ 1, annot = "violin", point_ink = 0.1, size = 0.5) |>
  gf_errorbar(246 + 261 ~ 1, color = "red") |>
  add_plot_labels(y = "Coefficient on distance (s/km)")
Figure 20.2: Five-hundred trials in which the distance coefficient in the model time ~ distance + climb. The [246, 261] confidence interval from the actual data is drawn in red.

Some things to note from Figure 20.2:

  1. The distribution of the distance coefficient from the resampling trials has the shape of the normal noise model.
  2. The large majority of the trials produced a coefficient that falls inside the confidence interval found from the original data.
  3. Some of the trials fall outside that confidence interval. Sometimes, if rarely, the trial falls far outside the confidence interval.

A complete description of the possible range in the distance coefficient due to sampling variation would be something like Figure 20.2. For pragmatic purposes, however, rather than report 10,000 (or more!) coefficients we report just two values: the bounds of the confidence interval.

By convention, the bounds of the confidence interval are selected to contain 95% of the coefficients. Thus, the confidence interval should more properly be called the “95% confidence interval” or “the confidence interval at a 95% level.” The confidence interval gives us a solid feel for the amount of sampling variation, but it can never encompass all of it.

To calculate a confidence interval at a level other than 95%, use the level= argument to conf_interval(). For instance, for an 80% level, use conf_interval(level = 0.85).

Calculating confidence intervals (optional)

In Lesson 19, we repeated trials over and over again to gain some feeling for sampling variation. We quantified the repeatability in any of several closely related ways: the sampling variance or its square root (the “standard error”) or a “margin of error” or a “confidence interval.” Our experiments with simulations demonstrated an important property of sampling variation: the amount of sampling variation depends on the sample size \(n\). In particular, the sampling variance gets smaller as \(n\) increases in proportion to \(1/n\). (Consequently, the standard error gets smaller in proportion to \(1/\sqrt{n}\).)

It is time to take off the DAG simulation training wheels and measure sampling variation from a single data frame. Our first approach will be to turn the single sample into several smaller samples: subsampling. Later, we will turn to another technique, resampling, which draws a sample of full size from the data frame. Sometimes, in particular with regression models, it is possible to calculate the sampling variation from a formula, allowing software to carry out and report the calculations automatically.

The next sections show two approaches to calculating a confidence interval. For the most part, this is background information to show you how it’s possible to measure sampling variation from a single sample. Usually you will use conf_interval() or similar software for the calculation.

Subsampling

Although computing a confidence interval is a simple matter in software, it is helpful to have a conceptual idea of what is behind the computation. This section and Section 20.4.2 describe two methods for calculating a confidence interval from a single sample. The conf_interval() summary function uses yet another method that is more mathematically intricate, but which we won’t describe here.

To “subsample” means to draw a smaller sample from a large one. “Small” and “large” are relative. For our example, we turn to the TenMileRace data frame containing the record of thousands of runners’ times in a race, along with basic information about each runner. There are many ways we could summarize TenMileRace. Any summary would do for the example. We will summarize the relationship between the runners’ ages and their start-to-finish times (variable net), that is, net ~ age. To avoid the complexity of a runner’s improvement with age followed by a decline, we will limit the study to people over 40.

TenMileRace |> 
  filter(age > 40) |>
  model_train(net ~ age) |> 
  conf_interval()
term .lwr .coef .upr
(Intercept) 4014.7081 4278.21279 4541.71744
age 22.8315 28.13517 33.43884

The units of net are seconds, and the units of age are years. The model coefficient on age tells us how the net time changes for each additional year of age: seconds per year. Using the entire data frame, we see that the time to run the race gets longer by about 28 seconds per year. So a 45-year-old runner who completed this year’s 10-mile race in 3900 seconds (about 9.2 mph, a pretty good pace!) might expect that, in ten years, when she is 55 years old, her time will be longer by 280 seconds.

It would be asinine to report the ten-year change as 281.3517 seconds. The runner’s time ten years from now will be influenced by the weather, crowding, the course conditions, whether she finds a good pace runner, the training regime, improvements in shoe technology, injuries, and illnesses, among other factors. There is little or nothing we can say from the TenMileRace data about such factors.

There’s also sampling variation. There are 2898 people older than 40 in the TenMileRace data frame. The way the data was collected (radio-frequency interrogation of a dongle on the runner’s shoe) suggests that the data is a census of finishers. However, it is also fair to treat it as a sample of the kind of people who run such races. People might have been interested in running but had a schedule conflict, lived too far away, or missed their train to the start line in the city.

We see sampling variation by comparing multiple samples. To create those multiple samples from TenMileRace, we will draw, at random, subsamples of, say, one-tenth the size of the whole, that is, \(n=290\)

Over40 <- TenMileRace |> filter(age > 40)
# Run a trial
Over40 |> sample(n = 290) |>
  model_train(time ~ age) |>
  conf_interval()
term .lwr .coef .upr
(Intercept) 3231.99677 4171.13999 5110.28320
age 15.48389 34.13995 52.79601
# Run another trial
Over40 |> sample(n = 290) |>
  model_train(time ~ age) |>
  conf_interval()
term .lwr .coef .upr
(Intercept) 3696.665595 4509.46904 5322.27250
age 9.834631 26.14115 42.44767

The age coefficients from these two subsampling trials differ one from the other by about 0.5 seconds. To get a more systematic view, run more trials:

# a sample of summaries
Trials <- 
  Over40 |> sample(290) |>
  model_train(time ~ age) |>
  conf_interval() |>
  trials(1000)

There is a distribution of coefficients from the various trials. We can quantify the amount of variation with the variance of the coefficients. Here, we will use the standard deviation, which is (as always) simply the square root of the variance.

Trials |> 
  dplyr::summarize(sd(.coef), .by = term)
term sd(.coef)
(Intercept) 445.225065
age 9.068733

The standard deviation of the variation induced by sampling variability is called the “standard error” (SE) of the coefficient. Calculating the standard error is one of the steps in traditional methods for finding confidence intervals. The SE is very closely related to the width of the confidence interval. For instance, here is the mean width of the CI calculated from the 1000 trials:

Trials |>
  mutate(width = .upr - .lwr) |>
  summarize(mean(width), sd(width), .by = term)
term mean(width) sd(width)
(Intercept) 1803.31996 108.789468
age 36.31536 2.315838

The SE is typically about one-quarter the width of the 95% confidence interval. For our example, the SE is 9 while the width of the CI is 36. The approximate formula for the CI is \[\text{CI} = \text{coefficient} \pm \text{SE}\ .\]

As described in Lesson 19, both the width of the CI and the SE are proportional to \(1/\sqrt{\strut n}\), where \(n\) is the sample size. From the subsamples, know that the SE for \(n=290\) is about 9.0 seconds. This tells us that the SE for the full \(n=2898\) samples would be about \(9.0 \frac{\sqrt{290}}{\sqrt{2898}} = 2.85\).

So the interval summary of the age coefficient—the confidence interval— is \[\underbrace{28.1}_\text{age coef.} \pm 2\times\!\!\!\!\!\!\! \underbrace{2.85}_\text{standard error} =\ \ \ \ 28.1 \pm\!\!\!\!\!\!\!\! \underbrace{5.6}_\text{margin of error}\ \ \text{or, equivalently, 22.6 to 33.6}\]

Bootstrapping

There is a trick, called “resampling,” to generate a random subsample of a data frame with the same \(n\) as the data frame: draw the new sample randomly from the original sample with replacement. An example will suffice to show what the “with replacement” does:

example <- c(1,2,3,4,5)
# without replacement
sample(example)
[1] 1 4 3 5 2
# now, with replacement
sample(example, replace=TRUE)
[1] 2 4 3 3 5
sample(example, replace=TRUE)
[1] 3 5 4 4 4
sample(example, replace=TRUE)
[1] 1 1 2 2 3
sample(example, replace=TRUE)
[1] 4 3 1 4 5

The “with replacement” leads to the possibility that some values will be repeated two or more times and other values will be left out entirely.

The calculation of the SE using resampling is called “bootstrapping.”

Demonstration: Bootstrapping the standard error

We will apply bootstrapping to find the standard error of the age coefficient from the model time ~ age fit to the Over40 data frame.

There are two steps:

  1. Run many trials, each of which fits the model time ~ age using model_train(). From trial to trial, the data used for fitting is a resampling of the Over40 data frame. The result of each trial is the coefficients from the model.

  2. Summarize the trials with the standard deviation of the age coefficients.

# run many trials
Trials <- 
  Over40 |> sample(replace=TRUE) |>
  model_train(time ~ age) |>
  conf_interval() |>
  trials(500)

# summarize the trials to find the SE
Trials |> 
  summarize(se = sd(.coef), .by = term)
term se
(Intercept) 140.354106
age 2.815218

Decision-making with confidence intervals

Consider the situation of testing a new antibiotic “B” intended as a substitute for an antibiotic “A” that is already in use. The clinical trial involves 200 patients each of whom will be randomly assigned to take “A” or “B” as their treatment. The outcome for each patient will be the time from the beginning of treatment to the disappearance of symptoms. The data collected look like this:

Why random? See Lesson 21.
patient age sex severity treatment duration
ID7832 52 F 4 B 5
ID4981 35 F 2 A 3
ID2019 43 M 3 A 2

… and so on for 200 rows altogether.

The outcome of the study is intended to support one of three clinical decisions:

  • Continue preferring treatment A
  • Switch to treatment B
  • Dither, for instance, recommending that a larger study be done.

In the analysis stage of the study, you start with a simple model: [In Lessons 25 through 25 we will see how to take age, sex, and severity into account as well.]

antibiotic_sim |> datasim_run(n=200) |>
model_train(duration ~ treatment) |> 
  conf_interval()
term .lwr .coef .upr
(Intercept) 2.90 3.30 3.60
treatmentB -0.88 -0.36 0.15

Figure 20.3 shows (in red) the confidence interval on treatmentB. The left end of the interval is in the region which would point to using treatment B, but the right end is in the treatment A region. Thus, the confidence interval for \(n=200\) creates an ambiguity about which treatment is to be preferred.

Figure 20.3: Confidence intervals from two differently sized studies.

Which of the three decisions—continue with antibiotic A, switch to B, or dither—would be supported if only the \(n=200\) study results were availble? Noting that the vast majority of the \(n=200\) confidence interval is in the “use B” region, common sense suggests that the decision should be to switch to B, perhaps with a caution that this might turn out to be a mistake. A statistical technique called “Bayesian estimation” ([[[touched on in]]] Lesson 28) can translate the data into a subjective probability that B is better than A, quantifying the “caution” in the previous sentence. Traditional statistical reasoning, however, would point to dithering.

With the larger \(n=400\) study, the confidence interval (blue) is narrower. The two studies are consistent with one another in terms of the treatmentB coefficient, but the larger study results place both ends of the confidence interval in the “use B” region, removing the ambiguity.

Statistical analysis should support decision-making, but often there are other factors that come into play. For instance, switching to antibiotic B might be expensive so that the possible benefit isn’t worth the cost. Or, the option to carry out a larger study may not be feasible. Decision-makers need to act with the information that is in hand and the available options. It’s a happy situation when both ends of the confidence interval land in the same decision region, reducing the ambiguity and uncertainty that is a ever-present element of decision-making.