Lesson 22: Worksheet

Author

Jane Doe

Objectives

22.1 Describe the logical origin of sampling variation as the variation between multiple samples from the same source.

22.2 Recognize the several formats in which we describe sampling variation—sampling variance, standard error, margin of error, confidence interval—and show how they are related.

22.3 Using repeated sampling trials, observe how sampling variance scales with sample size \(n\).

Part 1

Using dag02, obtain a sample of size 25 and show the values of y.

dag02sample = sample(dag02, size=25)

dag02sample%>%
  select(y)
# A tibble: 25 × 1
        y
    <dbl>
 1  2.29 
 2  8.90 
 3  6.56 
 4  1.29 
 5  6.97 
 6  8.76 
 7  8.23 
 8  0.759
 9  5.18 
10 10.2  
# … with 15 more rows

Compute the mean those 25 values of y in two different, but entirely equivalent ways. (1) Use data wrangling. (2) Construct a model y ~ 1 report the intercept coefficient. Show that these give the same answer.

ANSWER
dag02sample%>%
  summarize(mean(y))
# A tibble: 1 × 1
  `mean(y)`
      <dbl>
1      6.37
dag02sample %>% 
  lm(y~1,data=.)%>%
  conf_interval()
# A tibble: 1 × 4
  term         .lwr .coef  .upr
  <chr>       <dbl> <dbl> <dbl>
1 (Intercept)  5.14  6.37  7.60

Part 2

Create a new chunk that repeats the generation of a sample from dag02 the the two methods for calculating the mean of the y values. Run the new chunk and observe that the calculated value of the mean differs somewhat from that you found in Part 1. Repeat running the chunk over and over again; the mean value will differ each time.

Task 2.1. Each time you run the chunk, you are performing a new sampling trial. Run a dozen or so trials, observing the calculated value of the mean of y in order to get a sense for how much it varies from trial to trial. Then summarizing your observations by giving a rough interval for the range of the mean of y across the trials.

ANSWER

We are going to automate the process of performing sampling trials so that we can run hundreds of them.

Using the do operator, calculate the sampling variance for a set of trials from dag02. The following code chunk shows how to run 500 trials, in each of which the mean of y is calculated using the y ~ 1 method and reporting the intercept coefficient. These will be collected into a data frame named dag02trials25.

dag02trials25 <- do(500) * {
  sample(dag02, size=25) |> 
  lm(y ~ 1, data=_) |>
  conf_interval()
}

Task 2.2. Run the chunk above to create dag02trials25. Then use data wrangling commands to compute three summaries of the trials: i. The mean of the coefficient across the trials. ii. The variance of the coefficient across the trials. iii. The standard deviation of the coefficient across the trials.

ANSWER

The coefficient for each of the 500 trials is stored in the .coef column of dag02trials25. Simple data wrangling provides the summary.

dag02trials25%>%
  summarize(mean_of_means = mean(.coef),
            sampling_variance = var(.coef),
            standard_error = sd(.coef))
# A tibble: 1 × 3
  mean_of_means sampling_variance standard_error
          <dbl>             <dbl>          <dbl>
1          4.98             0.551          0.742

Notice that the names—sampling_variance and standard_error—we used for the different summaries correspond to the standard statistical nomenclature for these quantities.

Task 2.3. Repeat (2) with four different sample sizes (try 50, 100, 200, and 400). Fill in the table below. What do you notice about the standard error as sample size increases?

Sample size Sampling variance Standard error
n=25 0.439 0.663
n=50 0.253 0.503
n=100 0.129 0.359
n=200 0.059 0.243
n=400 0.032 0.178
ANSWER

The sampling variance gets smaller as the sample size increases. Specifically, doubling the sample size tends to halve the sampling variance. The standard error—which is just the square-root of the sampling variance—also gets smaller as \(n\) increases. As in the nature of square roots, to halve the standard error, the sample size must be doubled twice.