`%>% summarize(vheight = var(height), mheight = mean(height)) Galton `

vheight | mheight |
---|---|

12.8373 | 66.76069 |

A “data frame” is a computing term, referring to a particular organization of data. Statisticians often find it useful to treat a data frame as a **sample** from a **source**. In everyday speech, a sample is:

“

A small part or quantity intended to show what the whole is like.” — Oxford Languages

A food market will give you a sample of an item on sale: a tiny cup of a drink or a taste of a piece of fruit or other food item. Laundry-detergent companies sometimes send out a sample of their product in the form of a small foil packet suitable for only a single wash cycle. Paint stores keep small samples on hand to help customers choose from among the possibilities. A fabric sample is a little swatch of cloth cut from a a bigger bolt that a customer is considering buying.

In contrast, a **sample** in statistics is always a *collection* of multiple items. The individual items are **specimens**, each one recorded on its own row of a data frame. The entries in that row record the measured attributes of that specimen. Other words for a specimen are “a case,” “a row,” “an individual,” “a datum,” or, even, “a tuple.”

The collection of specimens is the sample. In museums, the curators put related specimens—fossils or stone tools—into a drawer or shelf. Statisticians use data frames to hold their samples. Think of “a sample” as akin to words like “a herd,” “a flock”, “a pack”, or “a school”: a collective. A single fish is not a school and a single wolf is not a pack. Similarly, a single row is not a sample but a specimen.

The dictionary definition of “sample” uses the word “whole” to describe where the sample comes from. Similarly, a statistical sample is a collection of specimens selected from a larger “whole.” Traditionally, statisticians have used the word “**population**” as the name for the “whole.” This is a nice metaphor; it’s easy to imagine the population of a state being the source of a sample in which each individual is a specific person. But the “whole” from which a sample is collected does not need to be a finite, definite set of individuals like the citizens of a state. For example, you have already seen how to collect a sample of any size you want from a DAG.

An example of a sample is the data frame `Galton`

, which records the heights of a few hundred people sampled from the population of London in the 1880s.

Our *modus operandi* in these Lessons takes a sample in the form of a data frame and summarizes it in the form of one or more numbers: a “**sample summary**.” Typically, the sample summary is the coefficients of a regression model, but it might be something else such as the mean or variance of a variable.

To illustrate, here is a sample summary of `Galton`

.

`%>% summarize(vheight = var(height), mheight = mean(height)) Galton `

vheight | mheight |
---|---|

12.8373 | 66.76069 |

This summary includes two numbers, the variance and the mean of the `height`

variable. Each of these numbers is called a “**sample statistic**.” In other words, the summary consists of two sample statistics.

There are many ways to summarize a sample. Here is another summary of `Galton`

:

`%>% lm(height ~ mother + father + sex, data=.) |> coef() Galton `

```
(Intercept) mother father sexM
15.3447600 0.3214951 0.4059780 5.2259513
```

This summary has four numbers, each of which is a regression coefficient. That is, each of the regression coefficients is a “sample statistic.”

Sometimes a data frame is not a sample. This happens when the data frame contains a row for every member of an actual, finite “population.” Such a complete enumeration—the inventory records of a merchant, the records kept of student grades by the school registrar—has a technical name: a “**census**.” Famously, many countries conduct a census of the population in which they try to record every resident of the country. For example, the US, UK, and China carry out a census every ten years.

In a typical setting, it is unfeasible to record every possible unit of observation.^{1} Such incomplete records constitute a “**sample**.” One of the great successes of statistics is the means to draw useful information from a sample, at least when the sample is collected correctly.

Sampling is called for when we want to find out about a large group but lack time, energy, money, or the other resources needed to contact every group member. For instance, France collects samples at short intervals to collect up-to-date data while staying within a budget. The name used for the process—*recensement en continu* or “rolling census”—signals the intent. Over several years, the French rolling census contacts about 70% of the population.

Sometimes, as in quality control in manufacturing, the measurement process is destructive: the measurement process consumes the item. In a destructive measurement situation, it would be pointless to measure every single item. Instead, a sample will have to do.

Collecting a reliable sample is usually considerable work. An ideal is the “simple random sample” (SRS), where all of the items are available, but only some are selected—completely at random—for recording as data. Undertaking an SRS requires assembling a “sampling frame,” essentially a census. Then, with the sampling frame in hand, a computer or throws of the dice can accomplish the random selection for the sample.

Understandably, if a census is unfeasible, constructing a perfect sampling frame is hardly less so. In practice, the sample is assembled by randomly dialing phone numbers or taking every 10th visitor to a clinic or similar means. Unlike genuinely random samples, the samples created by these practical methods are not necessarily representative of the larger group. For instance, many people will not answer a phone call from a stranger; such people are underrepresented in the sample. Similarly, the people who can get to the clinic may be healthier than those who cannot. Such unrepresentativeness is called “**sampling bias**.”

Professional work, such as collecting unemployment data, often requires government-level resources. Assembling representative samples uses specialized statistical techniques such as stratification and weighting of the results. We will not cover the specialized techniques in this introductory course, even though they are essential in creating representative samples. The table of contents of a classic text, William Cochran’s *Sampling techniques* shows what is involved.

All statistical thinkers, whether expert in sampling techniques or not, should be aware of factors that can bias a sample away from being representative. In political polls, many (most?) people will not respond to the questions. If this non-response stems from, for example, an expectation that the response will be unpopular, then the poll sample will not adequately reflect unpopular opinions. Such **non-response bias** can be significant, even overwhelming, in surveys.

**Survival bias** plays a role in many settings. The `mosaicData::TenMileRace`

data frame provides an example, recording the running times of 8636 participants in a 10-mile road race and including information about each runner’s age. Can such data carry information about changes in running performance as people age? The data frame includes runners aged 10 to 87. Nevertheless, a model of running time as a function of age from this data frame is seriously biased. The reason? As people age, casual runners tend to drop out of such races. So the older runners are skewed toward higher performance. (We can see this by taking a different approach to the sample: collecting data over multiple years and tracking individual runners as they age.

Almost always we work with a single sample, consisting of both signal and noise.As a thought experiment, however, imagine having multiple samples, each collected independently and at random from the same source and stored in its own data frame. Continuing the thought experiment, calculate a sample statistic in the same way for each data frame, say, a particular regression coefficient. In the end, we will have a collection of equivalent sample statistics. We say “equivalent” because each individual sample statistic was computed in the same way. But the sample statistics, although equivalent, will differ from one another to some extent. That is, the sample statistics *vary* one to the other. We call the variation among the summaries “**sampling variation**.”

::: {.callout-tip} In this Lesson, we will focus on individual sample statistics selected, according to our interest, from a sample summary. For interest, we might be interested in the `mother`

coefficient from the model above. Alternatively, we might choose the variance of `height`

as the sample statistic of interest. :::

In this Lesson, we will illustrate sampling variation by a particular technique: generating multiple samples from the same source. To save effort, we will use a DAG as the source. The DAG simulation provides an inexhaustible source of samples. Then we will calculate a sample statistic on each of the many samples. This will enable us to see sampling variation directly.

To quantify the amount of variation in the sample statistic from one sample to another—the “sampling variation”—we will use our standard measure of variation: the variance. And to remind us that the variance we calculate is to measure sampling variation, we will give it a distinct name: the “**sampling variance.**”

The simulation technique will enable us to witness important properties of the sampling variance, in particular how it depends on sample size \(n\).

Usually, we study a sample in order to inform our understanding of the broader process that generated the sample. Or, in the words of the dictionary definition at the start of this Lesson, we use a sample “*to show what the whole is like*.” Because of sampling variation, it would not be correct to say the “whole” is exactly like our sample. By quantifying sampling variation, we give a more complete description of the relationship of our particular sample to the “whole.”

In the spirit of starting simply, we return to `dag01`

. This DAG is \(\mathtt{x}\longrightarrow\mathtt{y}\). The causal formula setting the value of `y`

is `y ~ 4 + 1.5 * x + exo()`

.

It is crucial to remember that sampling variation is not about the row-to-row variation in a single sample. Rather, it is about the variation in the calculated sample statistic from one sample to another. So our initial process for exploring sampling variation will be to carry out many trials, each trial resulting in a sample statistic.

A single sampling trial consists of taking a random sample, computing a summary and from that summary pulling out a sample statistic. To illustrate, here is one trial using a sample size \(n=25\) and a simple model specification, `y ~ 1`

. In this case, the sample statistic is the intercept coefficient.

```
<- sample(dag01, size=25)
Trial_sample %>%
Trial_sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
```

.coef |
---|

4.47276 |

We cannot see sampling variation directly in the above result because there is only one trial. The sampling variation becomes evident when we run *many* trials. In each trial, a new sample (of size \(n=25\) is taken and summarized.)

```
<- do(500) * {
Trials <- sample(dag01, size=25)
Sample %>%
Sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
}
```

Graphics provide a nice way to visualize the sampling variation. Figure 22.2 shows the results from the set of trials.

The sampling variance is:

```
%>%
Trials summarize(sampling_variance = var(.coef), se = sqrt(sampling_variance))
```

sampling_variance | se |
---|---|

0.122632 | 0.3501886 |

Often, statisticians prefer to use the square root of the sampling variance, which has a technical name in statistics: the **standard error**. The standard error is an ordinary standard deviation in a particular context: the standard deviation of a sample of summaries. The words **standard error** should be followed by a description of the summary and the size of the individual samples involved. Here it would be, “The standard error of the Intercept coefficient from a sample of size \(n=25\) is around 0.36.”

We found an SE of 0.36 on the Intercept in a sample of size \(n=25\). We can see how the SE depends on sample size by repeating the trials for several different sizes, say, \(n=25\), 100, 400, 1600, 6400, 25,000, and 100,000.

The following command estimates the SE a sample of size 400:

```
<- do(1000) * {
Trials <- sample(dag01, size=400)
Sample %>%
Sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
}%>%
Trials summarize(svar400 = var(.coef),
se400 = sqrt(svar400))
```

svar400 | se400 |
---|---|

0.0081078 | 0.0900435 |

We repeated this process for each of the other sample sizes. Table 22.1 reports the results.

n | samping_variance | standard_error |
---|---|---|

25 | 0.1296000 | 0.3600 |

100 | 0.0361000 | 0.1900 |

400 | 0.0082810 | 0.0910 |

1600 | 0.0018490 | 0.0430 |

6400 | 0.0005290 | 0.0230 |

25000 | 0.0001210 | 0.0110 |

100000 | 0.0000314 | 0.0056 |

There is a pattern in Table 22.1. Every time we quadruple \(n\), the sampling variance goes down by a factor of four. Consequently, the standard error—which is just the square-root of the sampling variance—goes down by a factor of 2, that is, \(\sqrt{4}\). (The pattern is not exact because there is also sampling variation in the trials, which are really just a sample of all possible trials.)

**Conclusion**: The larger the sample size, the smaller the sampling variance. For a sample of size \(n\), the sampling variance will be proportional to \(1/n\). Or, in terms of the standard error: For a sample size of \(n\), the SE will be proportional to \(1/\sqrt{\strut n}\).

Even a population “census” inevitably leaves out some individuals.↩︎