4 Precision and confidence
Loading webR...
4.1 Objectives
Understand the origins of variation due to sampling—sampling uncertainty.
Quantify sampling uncertainty in a sample statistic using a confidence interval.
- Convert between the two forms of a confidence interval: [low, high] and center
margin-of-error. - Be aware that the margin of error becomes smaller the larger the sample size. Famously, the margin of error is proportional to
.
- Convert between the two forms of a confidence interval: [low, high] and center
Use a confidence interval to assess whether a model coefficient cannot be distinguished from zero or whether two model coefficients can be distinguished from each other.
Observe the valuable role of iteration and simulation in demonstrating (2) and (3).
- Be able to run a simulation to produce a data frame.
- Be able to use
trials()
to run repeated simulations and collect the results from each trial. - Understand “resampling” as a kind of simulation rooted in actual data and “bootstrapping” as a technique that uses resampling and iteration to make valid statistical estimates from data.
4.2 Introduction
Here is a simple arithmetic problem:
- What is the mean of 7, 4, 9, 2, 6?
- Answer: 5.6
In contrast, here is a statistical problem of the type often encountered in textbooks:
- What is the mean of 7, 4, 9, 2, 6?
-
Answer: 5.6
3.35
How can exactly the same problem have different answers? The issue is one of context. The arithmetic problem treats the numbers 7, 4, 9, 2, 6 as fixed. Those are the numbers and anyone properly calculating the mean will get an identical, exact, certain result.
In contrast, the statistics problem considers that the five numbers 7, 4, 9, 2, 6 were sampled at random from some imaginary, unexhaustable source of numbers. Theoretical statisticians call such a source a “population.” The statistical mean refers to that population, not just the five numbers that happened to be selected from it. Finding the exact mean of the population requires, in principle, a sample of infinite size. For any finite sample (such as our five numbers here), we can only estimate the population mean. The
The arithmetic setting does not involve any uncertainty, so the answer is the single, exact number 5.6. The statistical setting recognizes that uncertainty arises from the random sampling of the population. That’s why we need to add the
This tutorial is about how to calculate the precision of an estimate based on a random sample of data. For instance, in Tutorial 3 we reported, based on the Galton
data, that mothers have children who are (on average) 0.28 inches taller for every additional inch of mother’s height. The corresponding value for fathers is 0.38. Does this mean that fathers’ influence on height is stronger than mothers’? In assessing this, it helps to know the precision of 0.28 and 0.38. For instance, if the values are really 0.28
An important and closely related question is how big a sample needs to be so that the precision of the estimate falls within the bounds of utility. For example, how many people would Galton had to have included in his study to support a claim that the influences of fathers and mothers differs by, say, 0.1?
4.3 Population and simulation
As an example of a statistical project, consider the problem of finding the mean height of adults in the 1880 population of London. Francis Galton has already done the hard work of acquiring a sample, which we have in the Galton
data frame.
To illustrate the consequences of drawing a finite sample from a population, we’re going to carry out a simulation. We will draw a sample of size Galton
as the population. The basic R commands for pulling a sample out of a data frame are in Active R chunk 4.1.
Almost every time you run the command in Active R chunk 4.1 you will get a different sample. Try it a couple of times.
Now a slight modification to the above: we’ll calculate the mean height for the sample:
Again, run the command in Active R chunk 4.2 several times to observe that the result varies somewhat from one run to the next. This is “sampling variation.”
Our task here is to define what “somewhat” means in the previous paragraph. You can get a good idea of this by observing the variation you see when repeating the Active R chunk 4.2 command.
We are going to be systematic, and instruct the computer to run the command many times. That will save labor. Active R chunk 4.3 is set up to repeat the calculation of the sample mean five times. Each of these “times” is called a “trial.” Even better, we’ll tell the computer to repeat the trials for each of two different sample sizes.
We want to get an idea for sampling variation, but we also want to see whether how the sample size is involved. In Active R chunk 4.4 we will run 500 trials for each of three sample sizes, each one being twice the size of the previous. We will also plot out the results. The calculation will take a long time, so be patient.
You can see from the above graph that as the sample size gets bigger, the amount of sampling variation gets smaller.
Perhaps numbers will show the pattern more clearly than graphics. Let’s quantify the amount of variation in the trials with (our usual measure of variation) the variance:
Notice in the results from Active R chunk 4.5 that when the sample size doubles, the sampling variance gets smaller by a factor of two.
The calculations from the simulation are slow. Even 20 years ago they would have been intolerably slow. And 50 years ago they would have been practically impossible.
When the ideas of sampling variation and sampling variance were invented about one-hundred years ago, such a simulation could not have been carried out. So the techniques they invented for quantifying sampling variation relied on algebraic derivation and formulas. There’s no need for you to learn the formulas because computation is cheap and all of the formulas have been encapsulated in software. Aside: We can discuss in class what the formulas look like. A basic result is that the sampling variance in the mean of a sample of size height
in our example) divided by
4.4 Confidence intervals
The theory of sampling variation in the coefficients from regression as worked out a century ago and is programmed into software. We will use the conf_interval()
function which is applied to a regression model. Active R chunk 4.6 shows an example based in the Galton
sample itself, which has sample size
As you saw in Activity C.2, the model coefficient from height ~ 1
is the mean height. The confidence interval, [66.5, 67.0] inches, tells us the precision of this mean. Many people prefer expressing the interval in a different format: mean
To illustrate the role of confidence intervals in scientific work, consider the question of whether the mother and father contribute differently to the child’s height. Strictly speaking, the child’s sex is a genetic inheritance from the father, but let’s put this aside by adjusting for the child’s sex. Active R chunk 4.7 builds the relevant model and displays the confidence intervals on the coefficients.
At first glance, the mother
effect size is only about 80% of the father
. Better, however, to examine the confidence intervals. The two intervals overlap: [0.26, 0.38] and [0.34, 0.46]. Such overlap indicates that the data do not give good support for a claim that the effect sizes are different.
A very common use for confidence intervals is to check whether the data provide good evidence that a coefficient is different from zero. To illustrate, let’s consider whether a grown child’s height is related to the size of his or her family. Some reasons to imagine that it might: a bigger family means more competition for food and perhaps more exposure to contagious illness; or, a bigger family might be a sign of better health generally or better environmental conditions and the avoidance of serious childhood disease. These speculations conflict, but that is often the case when imagining what influences an outcome.
Let’s turn to the data for insight. Here’s a simple model of height as a function of family size:
The coefficient on nkids
from Active R chunk 4.8 is negative. The negative sign suggests that bigger families tend to have shorter kids. We need to look at the confidence interval, however. If both ends of the interval are of the same sign, the precision is adequate to support a claim about the sign. Here, both ends of the nkids
interval are negative, so there is good reason to suggest that the data point to a role for nkids
.
In the previous paragraph I danced around the matter of whether kids from larger families tend to be shorter. The claim is certainly justified by Active R chunk 4.8, but there are two issues of importance. First, a confidence interval is not just about the data but also the model used to describe the data. In Section 4.5 you’ll see some of the factors that go into the margin of error, and it’s reasonable to suspect that height ~ nkids
is not a good way to account for height. Second, many people translate “tend to be” into a causal connection between nkids
and height
. We will need tools from Tutorial 5 in order to make responsible statements about causality. (Responsible doesn’t necessarily mean “true.” Rather it has to do with quality of reasoning.)
Another, more basic use for confidence intervals is to guide the choice of the number of digits to report in written communication. To illustrate, suppose you are reporting a statistic such as the coefficient from Active R chunk 4.6. It’s tempting—but misguided!— merely to copy the digits from the computer’s report: 66.76069 inches. Instead, you should make a correct choice of the number of significant digits. Here are some possible choices:
- 70.0, that is, one significant digit
- 67.0, two significant digits
- 66.8, three significant digits
- 66.76 four significant digits
- 66.761 five significant digits
Reporting too few significant digits creates an unfair representation. A population with a mean height of 70 inches would be very unusual. Reporting too many significant digits distracts the reader from what’s important by appending digits that are unjustified by the data at hand.
A correct choice involves looking at the margin of error. Here, that’s 0.25 inches. Note the position of the leading non-zero digit of the margin of error: the first place after the decimal point. That’s where you should truncate the reporting of digits when reporting the coefficient itself: 66.8 inches, where the last digit is in the same place as the first digit of the margin of error. Even better, from a quantitative reasoning point of view, is to report the full confidence interval. In such reports, use only the leading two digits from the margin of error and report the statistic itself including the corresponding digits, as in 66.76
4.5 What determines the margin of error?
Although you can rely on software to calculate confidence intervals, it’s helpful when interpreting them to have a bit more background. There are four factors at work in determining the size of a margin of error.
- The size of the residuals from the model. The bigger the residuals, the bigger the margin of error. To be more specific, the margin of error is proportional to the square root of the variance of the residuals.
- The sample size
. The margin of error is proportional to .
Putting (1) and (2) together, the size of the margin of error goes as
- and (2) are things that can be influenced by the design of data collection and analysis. For example, in principle, you can take care to make better measurements of the response variable and use a larger sample size to get a smaller margin of error. Including covariates can reduce the size of residuals.
The confidence level. The confidence intervals reported by
conf_interval()
are set to include the central 95% of the sampling distribution (as in Active R chunk 4.4). In other words, the confidence level is 95%. Use of a 95% confidence level is conventional in statistical reporting. But you can use other confidence levels and the convention differs from field to field. A confidence level of 90% is often used in psychology research, whereas 99.9% is common in physics reports. Whenever you encounter a confidence level that’s different from the convention, you should make yourself aware of the reasons for the choice, which are sometimes dubious. For our purposes in QR2, we will always use 95% as the confidence interval. The issue will come up again briefly in Tutorial 6.Another factor is more mathematically subtle and hard to understand even with a background in linear algebra. When an explanatory variable aligns with other explanatory variables, the margin of error gets bigger. In econometrics, this is called the “variance inflation factor.” Software takes this into account automatically, but in econometrics it is sometimes wise to leave out a covariate if it aligns too strongly with an explanatory variable of interest.
4.6 Simulation and iteration
This section is about two computing techniques that make it easy to confirm claims about statistical methods. One such claim is the
The first method is simulation. You’re used to this from video games. The simulations we will construct will be simple by comparison. Here’s an example written using the software for this course (R with the {LSTbook}
package). The simulation produces data from a system with two variables, x and y, where the coefficients on y ~ x
are an intercept of 5 and an x-coefficient of 3.
The functions runif()
and rnorm()
used in the simulation are called “random number generators.” It’s not important that you understand how to construct such simulations; we will provide them to you as needed. What we want to use Sim1
for is to demonstrate that model_train()
works by showing how it recovers the coefficients (5 and 3) that are built-in to the simulation. For instance:
Since the simulation uses random numbers, you’ll get a somewhat different result every time you run Active R chunk 4.9. Examine the results that you got. For simplicity, we’ll focus on the x coefficient. According to the simulation, the x coefficient is 16. Chances are that the confidence interval on x includes the known coefficient value 16.
What do we mean by “chances are?” It means that if you run the simulation and fit the model many times, the x confidence interval will include 3 about 95% of the time. The confidence level used for the confidence interval is what sets the percentage; in our work we use the standard convention of a 95% confidence interval. (If we had used, say, an 80% confidence level, the preceeding statement would be “will include 16 about 80% of the time.)
You may not find this example a compelling demonstration that regression captures the simulation’s coefficients. Chances are (again!) that your x coefficient isn’t particularly close to 16. (For instance, on the first run I tried I got an x coefficient of -1.64.) But the statistical interpretation of “close to” isn’t about the .coef
column in the regression report. Instead, “close to” means “within the confidence interval about 95% of the time.”
The margin of error when I first ran Active R chunk 4.9 was about
Why should you believe the claims I just made about the confidence intervals produced by model_train()
and conf_interval()
? There are several things you can do, alone or in combination:
- Run the confidence interval simulation hundreds of times and see how often the confidence interval includes 3.
- Change the sample size from
n = 5
to something else and confirm that things still work. If you make n ten-thousand times larger, that isn = 50000
, you’ll observe margins of error of about . - Make your own simulation, perhaps including other variables as well.
Human nature suggests that (1) will be unattractive and tedious. The computer can automate things for us. The relevant function is trials()
. Each time you run the simulation is called a “trial,” and trials()
generates as many runs of a calculation as you want. To illustrate, here is a simple simulation of rolling a die. The notation 1:6
means 1,2,3,4,5,6.
According to theory, each of the possible outcomes, 1 through 6, is equally likely. Let’s try this out by running 1000 trials and using wrangling on the results. To get you started, Active R chunk 4.11, shows you five trials of the die roll. This will let you see what the output of trials
looks like.
Then, change the argument to trials()
from 5 to 10000, and uncomment (that is, remove the #
character) the pipe and the two wrangling commands. The proportions likely won’t be exactly the same for all six possible outcomes, but they will be close.
Now to set you up with an instrument you can use to check for yourself that the confidence intervals are on target.
Change the sample size from n = 100
to whatever you like. (It must be at least 3 to have enough data for the regression and confidence interval.)
4.7 Resampling & bootstrapping
Resampling is a very important kind of simulation designed specifically for the construction of confidence interval. The idea is to take a real-world data frame and use it as the source for an infinite number of specimens. The idea might seem like a fantasy at first. Indeed, one of the first academic papers describing the method, introducing confidence intervals to scientists who had trouble understanding them because of the algebra involved, was entitled, “Computers and the Theory of Statistics: Thinking the Unthinkable” by Bradley Efron.
The resampling simulation is simple in concept and practice. Imagine taking a data frame, cutting it up into slips of paper, each with one row. Then put all the slips in a hat.
To generate a specimen metaphorically, pull a slip from the hat and record what it says in a new data frame. Then—this is the important part—put the slip back in the hat before collecting another specimens. In this way, you can collect as many specimens as you like, with each of them being completely faithful to the original data.
The resample()
function does this for you. To illustrate, we will take the numbers 1 through 10 as the rows of our data frame. Resample()
gives you a new random sample. Active R chunk 4.12 will do the calculation for you, as many times as you like.
In practice, instead of the numbers 1 to 10, you would resample from the rows of a data frame. The command would look like Galton |> resample()
.
“Bootstrapping” is the name of the statistical method in which many trials of resampling are run, each of which gives the coefficients from a model. Active R chunk 4.13 shows an example, fitting the model height ~ sex + mother + father
to the Galton data. Since the point of bootstrapping is to calculate confidence intervals, rather than summarizing the model with conf_interval()
, we will use coef()
which reports just the .coef
column that you are used to from conf_interval()
.
The values under each coefficient name show the sampling variability of that coefficient. Typically, one would run about 500 trials rather than the five here.
Starting about 2010, two statistics books were written to explain confidence intervals via resampling and bootstrapping. You might have a friend who is using one of them, for example Statistical Investigations by Tintle et al. or Statistics: Unlocking the Power of Data by the Lock family.
In professional practice, bootstrapping is used for complicated kinds of models, for instance the algorithmic, machine-learning methods described by Spiegelhalter. For our own work, just use conf_interval()
. But be aware, when you hear the odd term “statistical bootstrapping” that it refers to a simulation based method for computing confidence intervals.