%>% summarize(vheight = var(height), mheight = mean(height)) Galton
vheight mheight
-------- ---------
12.8373 66.76069
A “data frame” is a computing term, referring to a particular organization of data. Statisticians often find it useful to treat a data frame as a sample from a source. In everyday speech, a sample is:
“A small part or quantity intended to show what the whole is like.” — Oxford Languages
A food market will give you a sample of an item on sale: a tiny cup of a drink or a taste of a piece of fruit or other food item. Laundry-detergent companies sometimes send out a sample of their product in the form of a small foil packet suitable for only a single wash cycle. Paint stores keep small samples on hand to help customers choose from among the possibilities. A fabric sample is a little swatch of cloth cut from a a bigger bolt that a customer is considering buying.
In contrast, a sample in statistics is always a collection of multiple items. The individual items are specimens, each one recorded on its own row of a data frame. The entries in that row record the measured attributes of that specimen. Other words for a specimen are “a case,” “a row,” “an individual,” “a datum,” or, even, “a tuple.”
The collection of specimens is the sample. In museums, the curators put related specimens—fossils or stone tools—into a drawer or shelf. Statisticians use data frames to hold their samples. Think of “a sample” as akin to words like “a herd,” “a flock”, “a pack”, or “a school”: a collective. A single fish is not a school and a single wolf is not a pack. Similarly, a single row is not a sample but a specimen.
The dictionary definition of “sample” uses the word “whole” to describe where the sample comes from. Similarly, a statistical sample is a collection of specimens selected from a larger “whole.” Traditionally, statisticians have used the word “population” as the name for the “whole.” This is a nice metaphor; it’s easy to imagine the population of a state being the source of a sample in which each individual is a specific person. But the “whole” from which a sample is collected does not need to be a finite, definite set of individuals like the citizens of a state. For example, you have already seen how to collect a sample of any size you want from a DAG.
An example of a sample is the data frame Galton
, which records the heights of a few hundred people sampled from the population of London in the 1880s.
Our modus operandi in these Lessons takes a sample in the form of a data frame and summarizes it in the form of one or more numbers: a “sample summary.” Typically, the sample summary is the coefficients of a regression model, but it might be something else such as the mean or variance of a variable.
To illustrate, here is a sample summary of Galton
.
%>% summarize(vheight = var(height), mheight = mean(height)) Galton
vheight mheight
-------- ---------
12.8373 66.76069
This summary includes two numbers, the variance and the mean of the height
variable. Each of these numbers is called a “sample statistic.” In other words, the summary consists of two sample statistics.
There are many ways to summarize a sample. Here is another summary of Galton
:
%>% lm(height ~ mother + father + sex, data=.) |> coef() Galton
(Intercept) mother father sexM
15.3447600 0.3214951 0.4059780 5.2259513
This summary has four numbers, each of which is a regression coefficient. That is, each of the regression coefficients is a “sample statistic.”
Sometimes a data frame is not a sample. This happens when the data frame contains a row for every member of an actual, finite “population.” Such a complete enumeration—the inventory records of a merchant, the records kept of student grades by the school registrar—has a technical name: a “census.” Famously, many countries conduct a census of the population in which they try to record every resident of the country. For example, the US, UK, and China carry out a census every ten years.
In a typical setting, it is unfeasible to record every possible unit of observation.1 Such incomplete records constitute a “sample.” One of the great successes of statistics is the means to draw useful information from a sample, at least when the sample is collected correctly.
Sampling is called for when we want to find out about a large group but lack time, energy, money, or the other resources needed to contact every group member. For instance, France collects samples at short intervals to collect up-to-date data while staying within a budget. The name used for the process—recensement en continu or “rolling census”—signals the intent. Over several years, the French rolling census contacts about 70% of the population.
Sometimes, as in quality control in manufacturing, the measurement process is destructive: the measurement process consumes the item. In a destructive measurement situation, it would be pointless to measure every single item. Instead, a sample will have to do.
Collecting a reliable sample is usually considerable work. An ideal is the “simple random sample” (SRS), where all of the items are available, but only some are selected—completely at random—for recording as data. Undertaking an SRS requires assembling a “sampling frame,” essentially a census. Then, with the sampling frame in hand, a computer or throws of the dice can accomplish the random selection for the sample.
Understandably, if a census is unfeasible, constructing a perfect sampling frame is hardly less so. In practice, the sample is assembled by randomly dialing phone numbers or taking every 10th visitor to a clinic or similar means. Unlike genuinely random samples, the samples created by these practical methods are not necessarily representative of the larger group. For instance, many people will not answer a phone call from a stranger; such people are underrepresented in the sample. Similarly, the people who can get to the clinic may be healthier than those who cannot. Such unrepresentativeness is called “sampling bias.”
Professional work, such as collecting unemployment data, often requires government-level resources. Assembling representative samples uses specialized statistical techniques such as stratification and weighting of the results. We will not cover the specialized techniques in this introductory course, even though they are essential in creating representative samples. The table of contents of a classic text, William Cochran’s Sampling techniques shows what is involved.
All statistical thinkers, whether expert in sampling techniques or not, should be aware of factors that can bias a sample away from being representative. In political polls, many (most?) people will not respond to the questions. If this non-response stems from, for example, an expectation that the response will be unpopular, then the poll sample will not adequately reflect unpopular opinions. Such non-response bias can be significant, even overwhelming, in surveys.
Survival bias plays a role in many settings. The mosaicData::TenMileRace
data frame provides an example, recording the running times of 8636 participants in a 10-mile road race and including information about each runner’s age. Can such data carry information about changes in running performance as people age? The data frame includes runners aged 10 to 87. Nevertheless, a model of running time as a function of age from this data frame is seriously biased. The reason? As people age, casual runners tend to drop out of such races. So the older runners are skewed toward higher performance. (We can see this by taking a different approach to the sample: collecting data over multiple years and tracking individual runners as they age.
An inspiring story about dealing with survival bias comes from a World War II study of the damage sustained by bombers due to enemy guns. The sample, by necessity, included only those bombers that survived the mission and returned to base. The holes in those surviving bombers tell a story of survival bias. Shell holes on the surviving planes were clustered in certain areas, as depicted in Figure 19.1. The clustering stems from survivor bias. The unfortunate planes hit in the middle of the wings, cockpit, engines, and the back of the fuselage did not return to base. Shell hits in those areas never made it into the record.
For the last 20 years, conventional wisdom is that lower socio-economic status families talk to their children less than higher status families. The quoted number is that there is a gap of 30-million words per year between the low-status and high-status families.
The 30-million word gap is due to … mainly, bias. This story from National Public Radio explains some of the sources of bias in counting words spoken. More comes from the original data being collected by spending an hour with families in the early evening. That’s the time, later research has found, that families converse the most. More systematic sampling, using what are effectively “word pedometers,” puts the gap at 4 million words per year.
Almost always we work with a single sample, consisting of both signal and noise.As a thought experiment, however, imagine having multiple samples, each collected independently and at random from the same source and stored in its own data frame. Continuing the thought experiment, calculate a sample statistic in the same way for each data frame, say, a particular regression coefficient. In the end, we will have a collection of equivalent sample statistics. We say “equivalent” because each individual sample statistic was computed in the same way. But the sample statistics, although equivalent, will differ from one another to some extent. That is, the sample statistics vary one to the other. We call the variation among the summaries “sampling variation.”
::: {.callout-tip} In this Lesson, we will focus on individual sample statistics selected, according to our interest, from a sample summary. For interest, we might be interested in the mother
coefficient from the model above. Alternatively, we might choose the variance of height
as the sample statistic of interest. :::
In this Lesson, we will illustrate sampling variation by a particular technique: generating multiple samples from the same source. To save effort, we will use a DAG as the source. The DAG simulation provides an inexhaustible source of samples. Then we will calculate a sample statistic on each of the many samples. This will enable us to see sampling variation directly.
To quantify the amount of variation in the sample statistic from one sample to another—the “sampling variation”—we will use our standard measure of variation: the variance. And to remind us that the variance we calculate is to measure sampling variation, we will give it a distinct name: the “sampling variance.”
The simulation technique will enable us to witness important properties of the sampling variance, in particular how it depends on sample size \(n\).
Usually, we study a sample in order to inform our understanding of the broader process that generated the sample. Or, in the words of the dictionary definition at the start of this Lesson, we use a sample “to show what the whole is like.” Because of sampling variation, it would not be correct to say the “whole” is exactly like our sample. By quantifying sampling variation, we give a more complete description of the relationship of our particular sample to the “whole.”
Pay careful attention to the “ing” ending in “sampling variation” and “sampling variance. The phrase”sample statistic” does not have an “ing” ending. When we use the “ing” in “sampling” it is to emphasize that we are looking at the variation in a sample statistic from one sample to another.
In the spirit of starting simply, we return to dag01
. This DAG is \(\mathtt{x}\longrightarrow\mathtt{y}\). The causal formula setting the value of y
is y ~ 4 + 1.5 * x + exo()
.
It is crucial to remember that sampling variation is not about the row-to-row variation in a single sample. Rather, it is about the variation in the calculated sample statistic from one sample to another. So our initial process for exploring sampling variation will be to carry out many trials, each trial resulting in a sample statistic.
A single sampling trial consists of taking a random sample, computing a summary and from that summary pulling out a sample statistic. To illustrate, here is one trial using a sample size \(n=25\) and a simple model specification, y ~ 1
. In this case, the sample statistic is the intercept coefficient.
<- sample(dag01, size=25)
Trial_sample %>%
Trial_sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
.coef
--------
4.23474
We cannot see sampling variation directly in the above result because there is only one trial. The sampling variation becomes evident when we run many trials. In each trial, a new sample (of size \(n=25\) is taken and summarized.)
<- do(500) * {
Trials <- sample(dag01, size=25)
Sample %>%
Sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
}
Graphics provide a nice way to visualize the sampling variation. Figure 19.2 shows the results from the set of trials.
y~1
is fitted to a sample from dag01
of size \(n=25\).The sampling variance is:
%>%
Trials summarize(sampling_variance = var(.coef), se = sqrt(sampling_variance))
sampling_variance se
------------------ ----------
0.122632 0.3501886
Often, statisticians prefer to use the square root of the sampling variance, which has a technical name in statistics: the standard error. The standard error is an ordinary standard deviation in a particular context: the standard deviation of a sample of summaries. The words standard error should be followed by a description of the summary and the size of the individual samples involved. Here it would be, “The standard error of the Intercept coefficient from a sample of size \(n=25\) is around 0.36.”
It is easy to confuse “standard error” with “standard deviation.” Adding to the potential confusion is another related term, the “margin of error.” To avoid this confusion, we will eventually switch to an interval description of the sampling variation called the “confidence interval.” However, for the present, we will continue with the standard error, sometimes written SE for short.
We found an SE of 0.36 on the Intercept in a sample of size \(n=25\). We can see how the SE depends on sample size by repeating the trials for several different sizes, say, \(n=25\), 100, 400, 1600, 6400, 25,000, and 100,000.
The following command estimates the SE a sample of size 400:
<- do(1000) * {
Trials <- sample(dag01, size=400)
Sample %>%
Sample lm(y ~ 1, data = .) %>%
conf_interval() %>%
select(.coef)
}%>%
Trials summarize(svar400 = var(.coef),
se400 = sqrt(svar400))
svar400 se400
---------- ----------
0.0081078 0.0900435
We repeated this process for each of the other sample sizes. ?tbl-se-sizes reports the results.
?(caption)
n samping_variance standard_error
------- ----------------- ---------------
25 0.1296000 0.3600
100 0.0361000 0.1900
400 0.0082810 0.0910
1600 0.0018490 0.0430
6400 0.0005290 0.0230
25000 0.0001210 0.0110
100000 0.0000314 0.0056
There is a pattern in ?tbl-se-sizes. Every time we quadruple \(n\), the sampling variance goes down by a factor of four. Consequently, the standard error—which is just the square-root of the sampling variance—goes down by a factor of 2, that is, \(\sqrt{4}\). (The pattern is not exact because there is also sampling variation in the trials, which are really just a sample of all possible trials.)
Conclusion: The larger the sample size, the smaller the sampling variance. For a sample of size \(n\), the sampling variance will be proportional to \(1/n\). Or, in terms of the standard error: For a sample size of \(n\), the SE will be proportional to \(1/\sqrt{\strut n}\).
The confidence intervals on the model time ~ distance + climb
, report the results to many digits. Such a report is appropriate for further calculations that might need doing, but it is usually not appropriate for a human reader.
To know how many digits are worth reporting to humans, look toward the standard error. The standard error is a part of a different kind of summary of a model: the “regression report.” We will only need to look at regression reports in the last few Lessons of the course. Here we want to point out how many digits are worth reporting to humans. That requires looking at the standard error itself.
Previously, we looked at the confidence intervals on coefficients from the Hill_racing
model. Now we look at the regression summary, which contains the information on sampling variation in a different format.
%>%
Hill_racing lm(time ~ distance + climb, data=.) %>%
regression_summary()
term estimate std.error statistic p.value
------------ ------------ ----------- ---------- --------
(Intercept) -469.976937 32.3582241 -14.52419 0
distance 253.808295 3.7843322 67.06819 0
climb 2.609758 0.0593826 43.94821 0
Each coefficient’s standard error appears in the std.error
column of the regression summary.
For the human reader, only the first two significant digits of the standard error are worth reporting. (This is true regardless of the data and model design.) Here, the SE is 32 for the Intercept, 3.8 for the distance coefficient, and 0.059 for the climb coefficient. The confidence interval will be the coefficient (column labeled estimate
) plus or minus “twice” the std.error
. It is appropriate to round the confidence interval (for a human reader) to the first two significant digits of the standard error.
For example, the confidence interval on the distance coefficient will be \(253.808295 \pm 2 \times 3.78433220\). Keep only the digits before the first two significant digits of the SE, so the reported interval can be \(253.8 \pm 3.8\).
Beginners sometimes think that each row in a data frame is a sample. Better to say that each row is a “specimen.” A “sample” is a collection of specimens, the set of rows in a data frame.
The “sample size” is the number of rows. “Sampling” is the process of collecting the specimens to be put into the data frame.
The following command illustrates computing a summary of a sample from dag08
.
sample(dag08, size=10000) %>%
lm(y ~ c + x, data = .) %>%
conf_interval()
term .lwr .coef .upr
------------ ---------- ---------- ---------
(Intercept) 2.9982098 3.0180678 3.037926
c 0.9846780 1.0125614 1.040445
x 0.9675331 0.9874018 1.007271
An essential question in statistics is how the summary depends on the incidental specifics of a particular sample. DAGs provide a convenient way to address this question since we can generate multiple samples from the same DAG, summarize each, and compare those summaries.
To generate a sample of summaries, re-run many trials of the summary. The do()
function automates this process, accumulating the results from the trials in a single data frame: a “sample of summaries.” We will use do()
mostly in demonstrations.
do()
In this demonstration, we will revisit a model used earlier in this Lesson to see how much the coefficients vary from one sample to another. Each trial consists of drawing a sample from dag08
, training a model, and summarizing with the model coefficients. Curly braces ({
and }
) surround the commands needed for an individual trial.
Preceding the curly braces, we have placed do(5) *
. This instruction causes the trial to be repeated five times.
do(5) * {
sample(dag08, size=50) %>%
lm(y ~ c + x, data = .) %>%
conf_interval()
}
term .lwr .coef .upr .row .index
------------ ---------- ---------- --------- ----- -------
(Intercept) 2.5641735 2.8762415 3.188310 1 1
c 0.2572171 0.7103165 1.163416 2 1
x 0.6534167 1.0699707 1.486525 3 1
(Intercept) 2.8752965 3.1995753 3.523854 1 2
c 0.6580292 1.1131976 1.568366 2 2
x 0.9314241 1.2080143 1.484605 3 2
(Intercept) 2.6494529 2.8993737 3.149295 1 3
c 0.6703915 1.0293747 1.388358 2 3
x 0.8073321 1.0538471 1.300362 3 3
(Intercept) 2.7636055 3.0676817 3.371758 1 4
c 0.5043663 0.9195001 1.334634 2 4
x 0.6204863 0.8916145 1.162743 3 4
(Intercept) 2.8228223 3.1362843 3.449746 1 5
c 0.6267129 1.0365283 1.446344 2 5
x 0.7030331 1.0289089 1.354785 3 5
The five trials are collected together by do()
into the five rows of a single data frame. Such a data frame can be considered a “sample of summaries.”
One of the things we will do with a “sample of summaries” is to … wait for it … summarize it. For instance, in the following code chunk, a sample of 40 summaries is stored under the name Trials
. Then we will summarize Trials
, in this case, to see how much the values of the a
and b
coefficients vary from trial to trial.
<- do(40) * {
Trials sample(dag08, size=50) %>%
glm(y ~ c + x, data = .) %>%
conf_interval()
} %>% group_by(term) %>%
Trials summarize(mean_c_coef = mean(.coef), variation_a = sd(.coef))
term mean_c_coef variation_a
------------ ------------ ------------
(Intercept) 3.0150059 0.1495756
c 0.9963332 0.2463078
x 1.0311800 0.1617271
The result of summarizing the trials is a “summary of a sample of summaries.” This phrase is admittedly awkward, but we will use this technique often: summarizing trials, where each trial is a “summary of a sample” Often, the clue will be the use of do()
, which repeats trials as many times as you ask.
Even a population “census” inevitably leaves out some individuals.↩︎