19  Confidence intervals



Lesson 19 took a simulation approach to observing sampling variation: generate many trials from a source such as a DAG and observe how the same sample statistic varies from trial to trial. We quantified the sampling variation in the same way we usually quantify variation, taking the variance of the sample statistic across all the trials. We called this measure of variation the sampling variance as a reminder that it comes from repeated trials of sampling.

The variance of a quantity has units that are the square of the quantity’s units. For purposes of interpretation, we often present variation using the square root of the variance, that is, the standard deviation. Following this practice, Lesson 19 introduced the square root of the sampling variance. Common sense might suggest that this ought to be called the “sampling standard deviation,” but that is long-winded and awkward. Instead, the square root of the sampling variation is called the “standard error” of the sample statistic. Unfortunately, this traditional name contains no reminder that it refers to sampling variation. So be careful to remember that “standard error” is always about sampling variation.

Precision in measurement

In everyday language the words “precision” and “accuracy” are used more or less interchangeably to describe how well a measurement has been made. Nevertheless there are two distinct concepts in “how well.” The easier concept has to do with reproducibility and reliability: if the measurement is taken many times, how much will the measurements differ from one another. This is the same issue as sampling variation. In the technical lingo of measurement, “precision” is used to express the idea of reproducibility or sampling variation. Precision is just about the measurements themselves.

In contrast, in speaking technically we use “accuracy” to refer to a different concept than “precision.” Accuracy cannot be computed with just the measurements. Accuracy refers to something outside the measurements, what we might call the “true” value of what we are trying to measure. Disappointingly, the “true” value is an elusive quantity since all we typically have is our measurements. We can easily measure precision from data, but our data have practically nothing to say about accuracy.

An analogy is often made between precision and accuracy and the patterns seen in archery. Figure 19.1 shows five arrows shot during archery practice. The arrows are in an area about the size of a dinner plate 6 inches in radius: that’s the precision.

Figure 19.1: Results from archery practice

A dinner-plate’s precision is not bad for a beginner archer. Unfortunately, the dinner plate is not centered on the bullseye but about 10 inches higher. In other words, the arrows are inaccurate by about 10 inches.

Since the “true” target is visible, it is easy to know the accuracy of the shooting. The analogy of archery to the situation in statistics would be better if the target was shown in plane white, that is, if the “true” value were not known directly. In that situation, as with data analysis, the spread in the arrows’ locations could tell us only about the precision.


The standard error is a measure of precision: the reproducibility from sample to sample. It tells us nothing about accuracy.

The confidence interval

The standard error is a perfectly reasonable way to measure precision. Nonetheless, the statistical convention for reporting precision is as an interval called the “confidence interval.” There are two equivalent ways to write the interval, either as [lower, upper] or center\(\pm\)half-width. Both styles are correct. (The preferred style can depend on the field or the journal publishing the report.)

The overall length of the interval is four times the standard error. Or, equivalently, the half-width is twice the standard error. Why twice? Returning to the archery analogy, we want the interval to include almost all the arrows. It turns out that if the standard error were used directly as the half-width of the confidence interval, only about 66% of the arrows would be inside the interval. Using twice the standard error as the half-width means that about 95% of the arrows will be in the interval.

The traditional name for the half-width of the confidence interval is the “margin of error.” The margin of error is twice the standard error.

In practice, confidence intervals are calculated using special-purpose software such as the conf_interval() function, for instance:

Note: Experienced R users may have encountered the confint() function. It does exactly the same calculation as conf_interval(), but conf_interval() formats the output into a data frame, making it more suitable for data wrangling the results.
Hill_racing %>% 
  lm(time ~ distance + climb, data=.) %>% 
term                  .lwr         .coef          .upr
------------  ------------  ------------  ------------
(Intercept)    -533.432471   -469.976937   -406.521402
distance        246.387096    253.808295    261.229494
climb             2.493307      2.609758      2.726209

Notice that there is a separate confidence interval for each model coefficient. The sampling variation is essentially the same, but that variation appears different when translated to the various coefficients’ units.

Example: Signal and noise in coefficients

Returning to the example of (adult) children’s height versus the height of the mother, the sampling variation is usually represented by an interval—the confidence interval—on the coefficients.

lm(height ~ mother, data=Galton) |> conf_interval()
term             .lwr    .coef     .upr
------------  -------  -------  -------
(Intercept)    40.300   46.700   53.100
mother          0.213    0.313    0.413

Interpretation: Consider looking at the mean height of many, many (adult-aged) children whose mothers all are 63 inches tall. Compare this to the mean height of many, many (adult-aged) children whose mothers are all 64 inches tall. The mothers differ in height by 1 inch. According to the data in Galton, the means of the two groups will differ by 0.313 inches. But this number is not so precise. It should be written as \(0.3 \pm 0.1\) inches, or, in bottom-to-top format [0.2, 0.4] inches.

The name “confidence interval” is used universally, but it can be a little misleading for those starting out in statistics. The word “confidence” in “confidence interval” has nothing to do with self-assuredness, boldness, or confidentiality. A more descriptive name is “precision interval.” For example, the mass of the Earth is known quite precisely, \(5.9722\pm 0.0005 \times 10^{24}\text{kg}\).

Calculating confidence intervals

In Lesson 19, we repeated trials over and over again to gain some feeling for sampling variation. We quantified the repeatability in any of several closely related ways: the sampling variance or its square root (the “standard error”) or a “margin of error” or a “confidence interval.” Our experiments with simulations demonstrated an important property of sampling variation: the amount of sampling variation depends on the sample size \(n\). In particular, the sampling variance gets smaller as \(n\) increases in proportion to \(1/n\). (Consequently, the standard error gets smaller in proportion to \(1/\sqrt{n}\).)

It is time to take off the DAG simulation training wheels and measure sampling variation from a single data frame. Our first approach will be to turn the single sample into several smaller samples: subsampling. Later, we will turn to another technique, resampling, which draws a sample of full size from the data frame. Sometimes, in particular with regression models, it is possible to calculate the sampling variation from a formula, allowing software to carry out and report the calculations automatically.

The next sections show two approaches to calculating a confidence interval. For the most part, this is background information to show you how it’s possible to measure sampling variation from a single sample. Usually you will use conf_interval() or similar software for the calculation.


Although computing a confidence interval is a simple matter in software, it is helpful to have a conceptual idea of what is behind the computation. This section and Section 19.3.2 describe two methods for calculating a confidence interval from a single sample. The conf_interval() summary function uses yet another method that is more mathematically intricate, but which we won’t describe here.

To “subsample” means to draw a smaller sample from a large one. “Small” and “large” are relative. For our example, we turn to the TenMileRace data frame containing the record of thousands of runners’ times in a race, along with basic information about each runner. There are many ways we could summarize TenMileRace. Any summary would do for the example. We will summarize the relationship between the runners’ ages and their start-to-finish times (variable net), that is, net ~ age. To avoid the complexity of a runner’s improvement with age followed by a decline, we will limit the study to people over 40.

TenMileRace %>% filter(age > 40) %>%
  lm(net ~ age, data = .) %>% conf_interval()
term                .lwr        .coef         .upr
------------  ----------  -----------  -----------
(Intercept)    4014.7081   4278.21279   4541.71744
age              22.8315     28.13517     33.43884

The units of net are seconds, and the units of age are years. The model coefficient on age tells us how the net time changes for each additional year of age: seconds per year. Using the entire data frame, we see that the time to run the race gets longer by about 28 seconds per year. So a 45-year-old runner who completed this year’s 10-mile race in 3900 seconds (about 9.2 mph, a pretty good pace!) might expect that, in ten years, when she is 55 years old, her time will be longer by 280 seconds.

It would be asinine to report the ten-year change as 281.3517 seconds. The runner’s time ten years from now will be influenced by the weather, crowding, the course conditions, whether she finds a good pace runner, the training regime, improvements in shoe technology, injuries, and illnesses, among other factors. There is little or nothing we can say from the TenMileRace data about such factors.

There’s also sampling variation. There are 2898 people older than 40 in the TenMileRace data frame. The way the data was collected (radio-frequency interrogation of a dongle on the runner’s shoe) suggests that the data is a census of finishers. However, it is also fair to treat it as a sample of the kind of people who run such races. People might have been interested in running but had a schedule conflict, lived too far away, or missed their train to the start line in the city.

We see sampling variation by comparing multiple samples. To create those multiple samples from TenMileRace, we will draw, at random, subsamples of, say, one-tenth the size of the whole, that is, \(n=290\)

Over40 <- TenMileRace %>% filter(age > 40)
lm(time ~ age, 
   data = Over40 %>% sample(size=290)) %>% 
term                  .lwr        .coef         .upr
------------  ------------  -----------  -----------
(Intercept)    4028.482722   4976.70853   5924.93434
age              -2.142959     17.37023     36.88342
lm(time ~ age, 
   data = Over40 %>% sample(size=290)) %>% 
term                  .lwr        .coef         .upr
------------  ------------  -----------  -----------
(Intercept)    3633.781136   4550.76902   5467.75691
age               7.303994     25.68956     44.07512

The age coefficients from these two subsampling trials differ one from the other by about 0.5 seconds. To get a more systematic view, run more trials:

# a sample of summaries
Trials <- do(1000) * {
  lm(time ~ age, data = sample(Over40, size=290)) %>% 
# a summary of the sample of summaries
Trials %>% 
  group_by(term) %>% 
  dplyr::summarize(se = sd(.coef))
term                   se
------------  -----------
(Intercept)    442.002071
age              8.927824

We used the name se for the summary of samples of summaries because what we have calculated is the standard error of the age coefficient from samples of size \(n=290\).

In Lesson 19 we saw that the standard error is proportional to \(1/\sqrt{\strut n}\), where \(n\) is the sample size. From the subsamples, know that the SE for \(n=290\) is about 9.0 seconds. This tells us that the SE for the full \(n=2898\) samples would be about \(9.0 \frac{\sqrt{290}}{\sqrt{2898}} = 2.85\).

So the interval summary of the age coefficient—the confidence interval— is \[\underbrace{28.1}_\text{age coef.} \pm 2\times\!\!\!\!\!\!\! \underbrace{2.85}_\text{standard error} =\ \ \ \ 28.1 \pm\!\!\!\!\!\!\!\! \underbrace{5.6}_\text{margin of error}\ \ \text{or, equivalently, 22.6 to 33.6}\]


There is a trick, called “resampling,” to generate a random subsample of a data frame with the same \(n\) as the data frame: draw the new sample randomly from the original sample with replacement. An example will suffice to show what the “with replacement” does:

example <- c(1,2,3,4,5)
# without replacement
[1] 1 4 3 5 2
# now, with replacement
sample(example, replace=TRUE)
[1] 2 4 3 3 5
sample(example, replace=TRUE)
[1] 3 5 4 4 4
sample(example, replace=TRUE)
[1] 1 1 2 2 3
sample(example, replace=TRUE)
[1] 4 3 1 4 5

The “with replacement” leads to the possibility that some values will be repeated two or more times and other values will be left out entirely.

The calculation of the SE using resampling is called “bootstrapping.”

Demonstration: Bootstrapping the standard error

We will apply bootstrapping to find the standard error of the age coefficient from the model time ~ age fit to the Over40 data frame.

There are two steps:

  1. Run many trials, each of which fits the model time ~ age using lm(). From trial to trial, the data used for fitting is a resampling of the Over40 data frame. The result of each trial is the coefficients from the model.

  2. Summarize the trials with the standard deviation of the age coefficients.

# run many trials
Trials <- do(1000) * {
  lm(time ~ age, data = sample(Over40, replace=TRUE)) %>% 
# summarize the trials to find the SE
Trials %>% 
  group_by(term) %>%
  summarize(se = sd(.coef))
term                   se
------------  -----------
(Intercept)    141.634107
age              2.859483

Decision-making with confidence intervals

Consider the situation of testing a new antibiotic “B” intended as a substitute for an antibiotic “A” that is already in use. The clinical trial involves 200 patients each of whom will be randomly assigned to take “A” or “B” as their treatment. The outcome for each patient will be the time from the beginning of treatment to the disappearance of symptoms. The data collected look like this:

Why random? See Lesson 18.
patient age sex severity treatment duration
ID7832 52 F 4 B 5
ID4981 35 F 2 A 3
ID2019 43 M 3 A 2

… and so on for 200 rows altogether.

The outcome of the study is intended to support one of three clinical decisions:

  • Continue preferring treatment A
  • Switch to treatment B
  • Dither, for instance, recommending that a larger study be done.

In the analysis stage of the study, you start with a simple model: [In Lessons 24 through 24 we will see how to take age, sex, and severity into account as well.]

lm(duration ~ treatment, data=Antibiotic_trial) |> conf_interval()
term            .lwr   .coef   .upr
------------  ------  ------  -----
(Intercept)     2.90    3.30   3.60
treatmentB     -0.88   -0.36   0.15

Figure 19.2 shows (in red) the confidence interval on treatmentB. The left end of the interval is in the region which would point to using treatment B, but the right end is in the treatment A region. Thus, the confidence interval for \(n=200\) creates an ambiguity about which treatment is to be preferred.

Figure 19.2: Confidence intervals from two differently sized studies.

Which of the three decisions—continue with antibiotic A, switch to B, or dither—would be supported if only the \(n=200\) study results were availble? Noting that the vast majority of the \(n=200\) confidence interval is in the “use B” region, common sense suggests that the decision should be to switch to B, perhaps with a caution that this might turn out to be a mistake. A statistical technique called “Bayesian estimation” ([[[touched on in]]] Lesson 29) can translate the data into a subjective probability that B is better than A, quantifying the “caution” in the previous sentence. Traditional statistical reasoning, however, would point to dithering.

With the larger \(n=400\) study, the confidence interval (blue) is narrower. The two studies are consistent with one another in terms of the treatmentB coefficient, but the larger study results place both ends of the confidence interval in the “use B” region, removing the ambiguity.

Statistical analysis should support decision-making, but often there are other factors that come into play. For instance, switching to antibiotic B might be expensive so that the possible benefit isn’t worth the cost. Or, the option to carry out a larger study may not be feasible. Decision-makers need to act with the information that is in hand and the available options. It’s a happy situation when both ends of the confidence interval land in the same decision region, reducing the ambiguity and uncertainty that is a ever-present element of decision-making.

Demonstration: Walking the line

This demonstration is motivated by an experience during one of my early-morning walks. Due to recent seasonal flooding, a 100-yard segment of the quiet, riverside road I often take was covered with sand. The concrete curbs remained in place so I stepped up to the curb to keep up my usual pace. I wondered how close to my regular pace I could walk on the curb, which was plenty wide: about 10 inches.

Imagine studying the matter more generally, assembling a group of people and measuring how much time it takes to walk 100 yards, either on the road surface or the relatively narrow curve. Suppose the ostensible purpose of the experiment is to develop a “handicap,” as in golf, for curve walking. But my reason for including the matter in a statistics text is to demonstrate statistical thinking.

In the spirit of demonstration, we will simulate the situation. Each simulated person will complete the 100-yard walk twice, once on the road surface and once on the curb. The people differ one from the other. We will use \(70 \pm 15\) seconds road-walking time and slow down the pace by 15% on average when curb walking. There will also be a random factor affecting each walk, say \(\pm 2\) seconds.

walking_sim <- datasim_make(
  person_id <- paste0("ID-", round(runif(n, 10000,100000))),
  .road <- 70 + rnorm(n, sd=15/2),
  .curb <- .road*(1 + 0.15 + rnorm(n, sd=0.03)),
  road <- .road*(1 + rnorm(n, sd=.02/2)),
  curb <- .curb*(1 + rnorm(n, sd=(.02/2)))

LET’S Look at the confidence interval for two models

Walks <- datasim_run(walking_sim, n=10) |>
  tidyr::pivot_longer(-person_id, names_to = "condition", values_to = "time")
Walks |> model_train(time ~ condition) |> conf_interval()
term                  .lwr       .coef        .upr
--------------  ----------  ----------  ----------
(Intercept)       73.23544    80.02805   86.820664
conditionroad    -20.36343   -10.75723   -1.151029
Walks |> model_train(time ~ condition + person_id) |> conf_interval()
term                       .lwr        .coef        .upr
------------------  -----------  -----------  ----------
(Intercept)           60.277188    65.287898   70.298608
conditionroad        -13.778802   -10.757230   -7.735659
person_idID-14908      5.369794    12.126234   18.882674
person_idID-15370     19.340146    26.096586   32.853026
person_idID-32461      0.767527     7.523967   14.280407
person_idID-37767     11.328145    18.084585   24.841024
person_idID-43388     13.488393    20.244833   27.001273
person_idID-45600     -5.702839     1.053600    7.810040
person_idID-64277      4.201244    10.957684   17.714123
person_idID-96987     21.549754    28.306194   35.062634
person_idID-98263     16.251437    23.007876   29.764316


There are two equivalent ways of of describing an interval numerically that are widely used:

  1. Specify the lower and upper endpoints of the interval, e.g. 7 to 13.
  2. Specify the center and half-width of the interval, e.g. 10 ± 3, which is just the same as 7 to 13.

Complete the following table to show the equivalences between the two notations.

Interval bottom-to-top plus-or-minus
(a) 3 to 11. Answer: 7 ± 4
(b) Answer: 98 to 118 108 ± 10
(c) Answer: 29 to 31 30 ± 1
(d) 97 to 100 Answer: 98.5 ± 1.5
(e) -4 to 16 Answer: 6 ± 10