30  Null hypothesis testing

Statistics lies at the crossroads of two major themes:

Source: Oxford LanguagesSource: Wikipedia

For our purposes, the key difference between the scientific method and decision-making has to do with the goal. The scientific method is about moving along the road to “truth.” Decision-making (at least, rational decision-making) is about securing the best possible result. The meaning of “best” is subjective: the objectives and values of the decision-makers and the evaluation of trade-offs.

We encountered examples of trade-offs in Lesson 29, for example designing a classifier to balance the trade-off between false-positives and false-negatives. Important part of the design process involve a loss-function that encodes our values and informed assumptions about prevalence of the condition for which the classifier is designed.

This and the following two Lessons are about a statistical process that is oriented toward the scientific method, not decision-making. It’s important to keep this in mind when deciding whether or not the process is appropriate in any given situation.

Hypothetical thinking

Hypothetical thinking is the process of working out consequences conditioned on a hypothesis, that is, a particular view of how things might be. Sometimes the word “assumption” is used rather than “hypothesis.” Examining the distinction between the two is enlightening. An assumption is “a thing that is accepted as true or as certain to happen, without proof.” In contrast, a hypothesis is “a proposition made as a basis for reasoning, without any assumption of its truth.”

Both definitions are from Oxford Languages

A reasonable person might wonder what the point is of drawing out the theoretical consequences of a proposition not necessarily believed to be true. This sounds like game playing, and it is. But it is also part of the scientific method: “the formulation, testing, and modification of hypotheses.” The “testing” part of the method involves comparing the theoretical consequences of the formulated hypothesis to what is observed. The “modification” part is about what to do when the observations are inconsistent with the theoretical consequences.

In science, the formulation of worthwhile hypotheses involves insight, expertise, imagination, and creativity. This is what guides scientists to choose the particular experiments they do and the specific observations they make. Statistics comes into play after the scientist’s creative choice has been made. Thus, the hypotheses involved in the process of statistical “hypothesis testing” are for the most part not the result of insight and creativity. (There is one exception, that we get to below.)

Various phrases are used to signal that a claim is being made through hypothetical thinking. These include: “assuming that _____,” “given that _____,” “under the ____ hypothesis”, and “conditioned on _____.” The blanks are where a specific hypothesis is inserted. Perhaps the most famous of these signals of hypothetical thinking is the phrase “under the Null.”

I think it helps to drive home the centrality of the given hypothesis, to describe statistical hypothetical thinking as taking place on different planets. Once you are on one of the statistical hypothetical planets, everything you do takes place in the context of that planet.

Start with the planet that is not hypothetical and is of acute interest to science and decision-making: Planet Earth.

Figure 30.1: Planet Earth

A statistical planet that we have been operating in throughout these lessons consists of the data we have collected from Planet Earth. Let’s call this Planet Sample; it is made of the data in our sample. All of our data analysis takes place on Planet Sample. Ideally, Planet Sample is a close match to Planet Earth, but that depends on the process used to collect the sample and the sample size \(n\).

Figure 30.2: Planet Sample

Planet Sample is where we calculate sample statistics and fit models. A hugely important statistical process that takes place on Planet Sample is our estimation of sampling variation, for example by calculating confidence intervals. That calculation involves only the sample, although it is informed by our mathematical insight, for instance, that the length of confidence intervals is proportional to \(1/\sqrt{n}\).

Of particular interest in this Lesson is a third planet: Planet Null. On Planet Null, variables are not related to one another and effect sizes are zero. Any pattern observed on Planet Null is purely the product of sampling variation: the accidental alignment of columns of data.

Figure 30.3: Planet Null

On Planet Sample, in contrast, there may well be connections between variables and non-zero effect sizes. We don’t know for sure, but we want to find out. Ultimately, we want to relate what is observed on Planet Sample to the workings of Planet Earth. Confidence intervals are part of that reasoning from Planet Sample back onto to Planet Earth. But it also pays to make a quick trip from Planet Sample to Planet Null. This is very easy to do and it might turn out that our work on Planet Null indicates that any patterns we spotted on Planet Sample might well be the result of the accidental alignments from sampling variation.

How do we move from one planet to another so that we can carry out the work of hypothetical thinking? We start out, of course, on Planet Earth. Getting from there to Planet Sample is easy; just collect a sample from Planet Earth. When we work with that sample we are working on Planet Sample. But a reminder: We want Planet Sample to be as much like Earth as we can manage with the resources available. This is why it is important to avoid the kinds of bias described in Lesson 19.

There are ready means to travel from Planet Sample to Planet Null. These may not be as intuitive as the collect-a-sample rocket that takes us from Earth to Sample, and they often draw on mathematical insight that not everyone shares. Most statistics textbooks are full of formulas whose purpose is to travel to Null and do some relevant work there. However, not all students understand where the formulas come from. Adding to the student’s confusion is that there are different formulas for numerical or categorical data.

In the spirit of helping students understand how one can travel to Planet Null, we will focus here on a simple, intuitive, universal method of transit that applies to both numerical and categorical data. That method is “shuffling.” By this, we do not mean merely moving around randomly the rows of a data frame. Instead, the shuffling to get to Planet Null involves putting in random order the entries within a variable.

To illustrate, let’s construct a small, simple data frame.

x      y
---  ---
A      1
B      2
C      3
D      4
E      5

The original data frame used for illustration.

x      y
---  ---
D      4
C      3
B      2
A      1
E      5

Shuffling the rows of the data frame leaves you on Planet Sample.

x      y
---  ---
D      1
C      2
B      3
A      4
E      5

To get to Planet Null, the shuffling is done within one or more variables.

Shuffling with a variable destroys any relationship that variable might have with any other variable in the data frame. More precisely, shuffling makes it so that any relationship that is detected is solely due to the accidents of sampling variation. To summarize:

Take the Space Shuffle to get from Planet Sample to Planet Null.

Hypothesis testing

The word “test” is familiar to all who have ever been students, but it will be helpful to have a definition. This one seems reasonable:

A procedure intended to establish the quality, performance, or reliability of something, especially before it is taken into widespread use.” – Oxford Languages

The above is a sensible definition, and based on the definition one would expect that a “hypothesis test” will be “a procedure intended to establish the correctness or applicability of a hypothesis, especially before relying on that hypothesis to guide action in the world.” That is, a process relating to decision-making. Nevertheless, in statistics, “hypothesis testing” refers to a procedure that does not involve decision-making. A good name for the statistical procedure is “Null Hypothesis testing” or NHT for short. Other names for more-or-less the same thing are “significance testing” and “null hypothesis significance testing.”

Remember that decision-making involves selecting between competing options for action. Lesson 29 introduced a framework for such decisions which involved two competing hypotheses: (1) the patient has a condition, and (2) the patient does not have the condition. “Sensitivity” is the result of a calculation given hypothesis (1). “Specificity” is a calculation given hypothesis (2).

In contrast, NHT involves only a single hypothesis: the Null hypothesis.

It is straightforward to describe NHT in broad terms and to perform the procedure. It’s much harder to interpret properly the results from NHT. Mis-interpretation is common. This is not a trivial problem. Many of the flaws in the research literature can be blamed on mistaken interpretations of NHT. Perhaps worse, it is sometimes presented as a guide to decision-making even though NHT fails to include essential components of responsible decision-making.

NHT is a devil’s advocate procedure. Data are analyzed and a result produced in the form of a sample statistic such as a regression coefficient on an explanatory variable or an R2 or “partial R2.” If the coefficient is non-zero, it’s tempting to conclude that there is a relationship between the explanatory variable and the response variable. The devil’s advocate claims otherwise; there is no relationship and any appearance of one is just a matter of chance, the play of sampling variation.

In the simulated world of the Null hypothesis—that is, on Planet Null—many samples are generated, just as we did when exploring sampling variation in Lesson 13. Each such sample is modeled in the same way as the original data to find the regression coefficient for the explanatory variable of interest. Since the simulated data are generated in a world where there is no relationship between that explanatory variable and the response, one expects the coefficients to be near zero, deviating from zero because of sampling variation.

Next, one takes the original regression coefficient from the actual data and compares it to the many coefficients calculated from the null-hypothesis simulation trials. If the original coefficient would go unnoticed among the many coefficients from the simulation the conclusion of the test is to “fail to reject the Null hypothesis.” On the other hand, if the original coefficient stands out clearly from the crowd of coefficients from the simulation, the conclusion is to “reject the Null hypothesis.”

Simple pictures can illustrate the meaning of “standing out clearly from the crowd.” In each of the diagrams in Figure 30.4, the sample statistic (calculated on Planet Sample, of course!) is shown as a blue line. The dots are many trials of calculating the sample statistic on Planet Null.

(a) Doesn’t stand out (p = 0.23)

(b) Stands out a little (p=0.05)

(c) Obviously standing out (p=0.001)

Figure 30.4: Examples showing the extent to which the actual sample statistic (blue line) stands out from the crowd of Planet Null trials (dots).

In statistical work, the extent to which the actual sample statistic stands out from the crowd of Planet Null trials is quantified by the “p-value,” a number between zero and one. Small p-values indicate that the actual sample statistic stands out. Lesson 31 covers the calculation of p-values.

Fail to reject?

The language “fail to reject the Null hypothesis” is stilted, but correct. Why don’t we frame the result as “accept the Null?”

I think for many people “fail to reject” and “accept” have practically the same meaning. But the word “fail” is appropriate because it points to potential flaws in the research protocol. For instance, the sample size might have been too small to reveal that the Null should be rejected.

As an illustration of “failing” to reject,” let’s look at the data on heights in Galton, a sample of size \(n=898\). You can confirm for yourself that the confidence interval on the mother coefficient in the model height ~ mother does not include zero when the model is trained on the entire Galton data frame.

But suppose we had been less diligent in collecting data than Francis Galton and we had only \(n=100\). We can easily simulate this:

lm(height ~ mother, data = Galton %>% sample(size=100)) |> conf_interval()
term            .lwr   .coef    .upr
------------  ------  ------  ------
(Intercept)    38.00   60.00   82.00
mother         -0.22    0.12    0.46

The confidence interval on mother does include zero. That’s because we didn’t have enough data: we can lay that failure on not having a big-enough sample size.”

Those two verbal conclusions—“reject the Null” or “fail to reject the Null”—are often supplemented or replaced entirely with a number called the p-value. The p-value is the way of measuring the extent to which the original coefficient goes unnoticed among or stands out from the coefficients from the simulation. A small p-value corresponds to standing out from the simulated coefficients. Conventionally, \(p < 0.05\) corresponds to “rejecting the null,” although the convention can differ from field to field.

A highly unfortunate alternative practice replaces the phrase “rejecting the Null” with “significant result” or, more honestly, “statistically significant result.” The word “significant” is enshrined in statistical vocabulary, but has very little to do with the everyday meaning of “significant.” As we will see in Lesson 32 where we advocate replacing “significant” with “discernible.”

Many researchers like to trumpet very small p-values, for instance \(p < 0.001\), to bolster their claims of “significance” (for instance, using the misleading phrase “highly significant”) or to justify a claim like “the evidence is very strong.”

What’s the purpose of NHT?

A Null hypothesis test is usually easy to conduct (see Lesson 31) which is one reason they are so commonly done. Many people think that the point of an NHT is to measure, using the p-value, the “strength of the evidence” for a claim. But “strength of evidence” is a dodgy concept. There are many kinds of evidence and the statistical thinker takes care to look at them all. A case in point is the prosecutor’s fallacy where the Null hypothesis is taken to be that the accused is innocent and rejecting the Null is indicated by some pattern that would be highly unlikely if the accused were indeed innocent.

A case in point comes from the March 11, 2004 terrorist bombing of trains in Madrid, Spain, resulting in the deaths of 191 people. A fingerprint found on a plastic bag in the truck that transported the terrorists pointed to an American lawyer, Brandon Mayfield, who had previously defended a terrorist convicted of supporting al-Queda. FBI fingerprint experts claimed that the likelihood of a mistake, given the strength of the fingerprint match was zero: impossible. Zero is a very small p-value! Mayfield was accordingly taken into custody, even though he lived in Portland, Oregon and despite no evidence of his ever travelling to Spain or leaving the US in the previous 10 years. The case was resolved only when the Spanish National Police found another match for the fingerprint: an Algerian terror suspect.

I think it is healthier to regard NHT as something other than a measure of “strength.” Instead, I look at NHT as a kind of screening test. The malady being screened for is that the proposed pattern (for instance, a non-zero regression coefficient) is just an accident of sampling variation. Using NHT properly will weed out 95% of such accidents. We don’t have any good idea of the prevalence of such accidents among research findings, so there’s nothing to be said about the probability of a false positive or false negative.

To switch similes, think of NHT as a screen door on a house. Suppose the screen door prevents 95% of bugs from getting into the house. Does that imply that the house will be bug free?

Still, weeding out 95% of accidents is a start. It’s the least we can do.

The Alternative hypothesis

In Null hypothesis testing, there is only the one hypothesis under consideration: the Null hypothesis. Since the Null hypothesis can be enforced by shuffling, the computations for NHT can be done pretty easily even without the probability theory just mentioned.

There has been a controversy since the 1930s about whether hypothesis testing—in the broad sense—should involve two (or more) competing hypotheses. Before the widespread acceptance of the Bayesian approach (described in Lesson 29 and mentioned below), statisticians Jerzy Neyman and Egon Pearson proposed a two-hypothesis framework in 1933. One of their hypotheses is the familiar Null hypothesis. The other is called the “Alternative hypothesis,” a statement of a specific non-null relationship.

Returning to the planet metaphor, this would be a third statistical planet: Planet Alt. Recall that Planet Sample and Planet Null are statistical planets which correspond to the simple mechanics of sampling and shuffling, respectively. In contrast, Planet Alt is the result of scientific insight, expertise, imagination, and creativity.

Figure 30.5: Planet Alt, denoted as, \(\ |\!\!|\ H_a)\) might look like this. We draw it as a cartoon planet, since any particular hypothesis is a product of the imagination.

The situation with two hypotheses would be very similar to that presented in Lessons 29 and ?sec-lesson-35. In those lessons, the two hypotheses were C and H. In developing a classifier, one starts by collecting a training sample which is a mixture of cases of C and H. But, in general, with a competition of hypothesis—\(H_0\) and \(H_a\)—we don’t have any real-world objects to sample that are known to be examples of the two hypotheses. Instead, we have to create them computationally. Instances of \(H_0\) can be made by data shuffling. But instances of \(H_a\) need to be generated by some other mechanism, perhaps one akin to the DAGs we have used in these lessons.

Comparing hypotheses with Bayes’ Rule

With mechanisms to generate data from both the Null and Alternative hypotheses, we would take the statistical summary \(\mathbb{S}\) of the actual data, and compute the likelihoods for each hypothesis: \({\cal L}_{\mathbb{S}}(H_0)\) and \({\cal L}_{\mathbb{S}}(H_a)\). It should not be too controversial in a practical process to set the prior probability for each hypothesis at the same value: \(p(H_0) = p(H_a) = {\small \frac{1}{2}}\). Then, turn the crank of Bayes’ Rule (Section 29.9) to compute the posterior probabilities. If the posterior of one or the other hypothesis is much greater than \({\small \frac{1}{2}}\), we would have compelling evidence in favor of that hypothesis.1

There are specialized methods of Bayesian statistics and whole courses on the topic. An excellent online course is Statistical Rethinking.

An empty Alternative

If you have studied statistics before, you likely have been exposed to NHT. Many textbook descriptions of NHT appear to make use of an “alternative hypothesis” within NHT. This style is traditional and so common in textbooks that it seems disrepectful to state plainly that it is wrong. Nevertheless, there is only one hypothesis being tested in NHT: the Null.

In the textbook presentation of NHT, the “alternative” hypothesis is not a specific claim—for instance, “the drug reduces blood pressure by 10 mmHg”. Instead, the student is given a pointless choice of three versions of the alternative. These are usually written \(H_a \neq H_0\) or as \(H_a < H_0\) or as \(H_a > H_0\), and amount to saying “the effect size is non-zero,” “the effect size is negative,” or “the effect size is positive.”

Outside of textbooks, only \(H_a \neq H_0\) is properly used. The other two textbook choices provide, at best, variations on exam questions. At worst, they are a way to put a thumb on the scale to disadvantage the Null.

31 p-value summaries

Lesson ?sec-lesson-36 introduced the logic of Null Hypothesis testing (NHT). This Lesson covers how to perform an NHT. As you will see, the everyday workhorses of NHT are regression modeling and summarization. We will use the conf_interval(), R2(), regression_summary(), and anova_summary() functions, the last three of which generate a “p-value,” which is the numerical measure of “standing out from the Planet Null crowd” illustrated in Figure 30.4.

Confidence intervals and regression “tables”

We have been using confidence intervals from early on in these Lessons. As you know, the confidence interval presents the precision with which a model coefficient or effect size can be responsibly claimed to be known.

Using a confidence interval for NHT is dead simple. Check whether the confidence interval includes zero. If so, there’s not enough precision in your measurement of the summary statistic to justify a claim that the Null hypothesis might account for your observation. On the other hand, if the confidence interval doesn’t include zero, you are justified in “rejecting the Null.”

To illustrate the use of confidence intervals for NHT, consider again the Galton data on the heights of adult children and their parents. Let’s consider whether the adult child’s height is related to his or her mother’s height.

lm(height ~ mother, data=Galton) |> conf_interval()
term            .lwr   .coef    .upr
------------  ------  ------  ------
(Intercept)    40.00   47.00   53.00
mother          0.21    0.31    0.41

The confidence interval on mother does not include zero so the Null hypothesis is rejected.

Whenever your interest is in a particular model coefficient, I strongly encourage you to look at the confidence interval. This carries out a hypothesis test and, at the same time, tells you about an effect size, important information for decision-makers. When you do this—good for you!—you are likely to encounter a more senior researcher or a journal editor who insists that you report a p-value. The p-value corresponding to a confidence interval not including zero is \(p < 0.05\). The senior researcher of editor might insist on knowing the “exact” value of \(p\). Such knowledge is not as important as most people make it out to be, but if you need such an “exact” value, you can turn to another style of summary of regression models called a “regression table.”

Here’s a regression table on the height ~ mother model:

lm(height ~ mother, data=Galton) |> regression_summary()
term           estimate   std.error   statistic   p.value
------------  ---------  ----------  ----------  --------
(Intercept)       47.00       3.300        14.0         0
mother             0.31       0.051         6.2         0

The regression table has one row for each coefficient in the model. height ~ mother, for instance, has an intercept coefficient as well as a mother coefficient. The p-value for each coefficient is given in the right-most column of the regression table. Other columns in the regression table repeat information from the confidence interval. The “estimate” is the coefficient found when training the model on the data. the “std.error” is the standard error, which is half the margin of error. (See Lesson 19.) You may remember that the confidence interval amounts to

\[\text{estimate} \pm 2\! \times\! \text{standard error}\]

The “statistic” shown in the regression table is an intermediate step in the calculation of the p-value. The full name for it is “t-statistic” and it is merely the estimate divided by the standard error. (For example, the estimate for the mother coefficient is 0.31 while the standard error is 0.051. Divide one by the other and you get the t-statistic, about 6.2 in this case.)

A fanatic for decimal places might report the p-value on mother as \(p=0.0000000011\). That is indeed smaller, consistent with what we already established—\(p < 0.05\)—from a simple examination of the confidence interval. For good reason it is considered bad style to write out a p-value with so many zeros after the decimal point. Except in very specialized circumstances (e.g. gene expression measurements with hundreds of thousands of plasmids) there is no reason to do so: \(p < 0.001\) is enough to satisfy anyone that it’s appropriate to reject the Null hypothesis.

There are three good reasons to avoid writing anything more detailed than \(p < 0.001\) even when the computer tells you that the p-value is zero. First, and simplest, computer arithmetic involves round-off error, so the displayed zero is not really mathematically zero. Second, there are many assumptions made in the calculation of p-values, the correctness of which is impossible to demonstrate from data. Third, the p-value is itself a sample statistic and subject to sampling variability. A complete presentation of the p-value ought therefore to include a confidence interval. An example of such a thing is given in Figure 32.1, but they are never used in practice.

Weeding in action

Section 30.3 points out that the purpose of NHT is to screen out some of the research results that might otherwise—without NHT—be taken seriously. To illustrate the weeding-out effect of NHT, let’s simulate a situation where the Null hypothesis is true. As earlier, we enforce the Null by randomly shuffling the explanatory variable. We’ll continue the example using Galton and the model specification height ~ mother. Keep in mind: since the Null is being enforced, none of the results Like this:

lm(height ~ shuffle(mother), data=Galton) %>% 
term                 .lwr     .coef    .upr
----------------  -------  --------  ------
(Intercept)        60.000   66.0000   73.00
shuffle(mother)    -0.093    0.0091    0.11

Now, let’s conduct many random trials and show the confidence interval for each of them.

Trials <- do(200) * {
  lm(height ~ shuffle(mother), data=Galton) %>% 
    conf_interval() %>%
    filter(term == "shuffle(mother)")

Trials %>%
  mutate(misses = ifelse(.upr < 0 | .lwr > 0, "red", "black")) %>%
  ggplot(aes(x = .index, xend = .index, y = .lwr, yend = .upr)) +
  geom_segment(width=.15, aes(color=misses)) +
  geom_point(aes(x=.index, y=.coef), size=0.5) +
  geom_hline(yintercept=0, color="blue") +
  scale_color_identity() +
  scale_x_continuous(breaks=c()) +
  ylab("Confidence interval on shuffle(mother).") +
  xlab("200 trials") +

Figure 31.1: The confidence intervals from 200 trials of the shuffle(mother) coefficient. The ones that do not cover zero are colored in red. The black dots show the point-estimate of the coefficient.

Think of the simulation shown in in Figure 31.1 as 200 researchers, each on a wild-goose hunt and eager to make a name for themselves in science. If even 100 published their work, the literature would be crowded with meaningless results. The NHT screen removes 188 of the researchers from further consideration. Of the 200 trials displayed in Figure 31.1, 188 include zero within the confidence interval. But the net is not perfect and some get through (shown in red). The meaning of \(p < 0.05\) is that in a situation like this, fewer than 5% should get through. In Figure 31.1 12 of the trials evaded the net. Owing to sampling variation, that’s entirely consistent with an NHT failure rate of 5% (which would be 10 of the 200 trials).


We’ve seen that conf_interval() and regression_summary() can be used to conduct NHT. Now let’s go into the reason why there are more model summaries used for NHT: R2() and anova_summary.

There is an important setting where confidence intervals cannot be used to perform NHT. That is when it is not a single coefficient that is of interest, but a group of coefficients. As an example, consider the data in College_grades, which contains the authentic transcript information for all students who were graduated in 2006 from a small liberal-arts college in the US.

The data have been anonymized, so that it is not possible to identify individual students, professors, or departments. The data are used with permission of the college’s registrar.

The numerical response variable, gradepoint, corresponds to the letter grade of each student in each course taken. This is the sort of data used to calculate grade-point averages (GPA) a statistic familiar to college students! The premise behind GPA is that students are different in a way that can be measured by their GPA. If that’s so, then the student (encoded in the sid variable) should account for the variation in gradepoint.

We can use gradepoint ~ sid to look at the matter. In the following code block, we fit the model and look at the confidence interval on each student’s GPA. We’ll show just the first 6 out of the 443 graduating students.

GPAs <- lm(gradepoint ~ sid - 1, data=College_grades) |> conf_interval()
term         .lwr   .coef   .upr
----------  -----  ------  -----
sidS31185     2.1     2.4    2.8
sidS31188     2.8     3.0    3.3
sidS31191     2.9     3.2    3.5
sidS31194     3.1     3.4    3.6
sidS31197     3.1     3.3    3.6
sidS31200     1.9     2.2    2.5

Interpreting a report like this is problematic. For instance, consider the first and last students (sidS31185 and sidS31200) in the report. Their gradepoints are different (2.4 versus 2.2) but only if you ignore the substantial overlap in the confidence intervals. (Colleges don’t ever report a confidence interval on a GPA. Perhaps this is because they don’t believe there is any component to a grade that depends on anything but the student.)

A graphical display, as in Figure 31.2 helps put the 443 confidence intervals in context. The graph shows that only roughly 100 of the weaker students and 100 of the stronger students have a GPA that is discernible from the college-wide average. In other words, the middle 60% of the students are indistinguishable from one another based on their GPA.

orderedGPAs <- GPAs |> arrange(.upr) |> mutate(row = row_number()) 
orderedGPAs |>  ggplot() +
    geom_segment(aes(x=row, xend=row, y=.lwr, yend=.upr), alpha=0.3) +
  geom_errorbar(aes(x=row, ymin=.lwr, ymax=.upr), data= orderedGPAs |> filter(row%%20==0)) +

  geom_hline(yintercept=3.41, color="blue")

Figure 31.2: Confidence intervals on the GPA of 443 students at a small college. The overall average grade is shown in blue. Solid black error bars are drawn for every 20th student to remind you that the gray region is actually discrete vertical lines, one for each student.

Another way to answer the question of whether sid accounts for gradepoint is to look at the R2:

lm(gradepoint ~ sid, data=College_grades) |> R2()
    n     k   Rsquared     F   adjR2    p   df.num   df.denom
-----  ----  ---------  ----  ------  ---  -------  ---------
 5700   440       0.32   5.6    0.27    0      440       5200

The R2 is only about 0.3, so the student-to-student differences account for less than half of the variation in the course grades. This is perhaps not as much as might be expected, but there are so many grades in College_grades that the p-value is small, leading us to reject the Null hypothesis that gradepoint is independent of sid.

The process by which R2 is converted to a p-value involves a clever summary statistic called F. For our purposes, it suffices to say that R2 can be translated to a p-value, taking into account the sample size \(n\) and the number of coefficients (\(k\)) in the model.

Carrying out such tests on a whole model—with many coefficients—is so simple that we can look at another possible explanation for gradepoint: that it depends on the professor. The iid variable encodes the professor. There are 364 different professors who assigned grades to the Class of 2006.

lm(gradepoint ~ iid, data=College_grades) |> R2()
    n     k   Rsquared     F   adjR2    p   df.num   df.denom
-----  ----  ---------  ----  ------  ---  -------  ---------
 5700   360       0.17   3.1    0.12    0      360       5300

We can reject the Null hypothesis that the professor is unrelated to the variation in grades.

But let’s be careful, maybe the professor only appears to account for the variation in grade. It might be that students with low grades cluster around certain professors, and students with high grades cluster around other professors. We can explore this possibility by looking at a model that includes both students and professors as explanatory variables.

lm(gradepoint ~ sid + iid, data=College_grades) |> R2()
    n     k   Rsquared     F   adjR2    p   df.num   df.denom
-----  ----  ---------  ----  ------  ---  -------  ---------
 5700   800       0.48   5.6    0.39    0      800       4900

Alas for the idea that grades are only about the student, the R2 report shows that, even taking into account the variability in grades associated with the students, there is still substantial variation associated with the professors.

This process of handling groups of coefficients (442 for students and 358 for professors) using R2 can be so useful that a special format of report, called the ANOVA report, is available. Here it is for the college grades:

lm(gradepoint ~ sid + iid, data=College_grades) |> anova_summary()
term           df   sumsq   meansq   statistic   p.value
----------  -----  ------  -------  ----------  --------
sid           440     650     1.50         6.8         0
iid           360     310     0.86         4.0         0
Residuals    4900    1100     0.22          NA        NA

This style of report provides a quick way to look at whether an explanatory variable is contributing to a model. The ANOVA report looks at each variable, each in turn, to see if given the variables already in the model whether the next variable along explains enough of the residual variance in the model to be regarded as something other than an accident due to sampling variation. In the report above, sid is associated with a very small p-value. Then, granting sid credit for all the variation it can explain, iid still shows up as a strong enough “eater of variance” that the Null hypothesis can is rejected for it as well.

Example: Explaining book discounts

To illustrate how ANOVA can be used to build a model, consider the moderndive::amazon_books data frame. The Amazon company has a reputation for discounting books, let’s see if we can detect patterns in how much (or how little) discount they give. To start, we will wrangle a new variable, discount, into the amazon_books data. This will give the discount in percent. A negative discount means that the Amazon price is higher than the publisher’s list price.

Books <- moderndive::amazon_books %>% 
  mutate(discount = 100 * (1 - amazon_price / list_price))

Now we can check which variables account for the variation in the discount. For instance, discount might depend on list_price (say, cheap books are hardly discounted), and perhaps on hard-cover vs paperback, the number of pages, and the weight of the book. We can throw in many variables on a first pass through the problem.

lm(discount ~ hard_paper + num_pages + thick + weight_oz + list_price, data = Books) |>
term           df   sumsq   meansq   statistic    p.value
-----------  ----  ------  -------  ----------  ---------
hard_paper      1    1050     1050       3.900   4.91e-02
num_pages       1    6430     6430      23.800   1.70e-06
thick           1    1340     1340       4.960   2.66e-02
weight_oz       1     198      198       0.733   3.93e-01
list_price      1     124      124       0.459   4.98e-01
Residuals     308   83100      270          NA         NA

Each variable is given a p-value. Of the ones here, hard_paper, num_pages and thick have \(p < 0.05\). That suggests including only those variables in a model.

tentative_model <- lm(discount ~ hard_paper + num_pages + thick, data = Books)
tentative_model |> R2()
   n    k    Rsquared          F       adjR2         p   df.num   df.denom
----  ---  ----------  ---------  ----------  --------  -------  ---------
 321    3   0.0912905   10.61545   0.0826907   1.1e-06        3        317

The three variables account for just 9% of the variation in discount. Not a very powerful explanation!

STILL IN DRAFT: Traditional tests

Statistics textbooks usually include several different settings for “hypothesis tests.” I’ve just pulled a best-selling book off my shelf and find listed the following tests spread across eight chapters occupying about 250 pages.

  • hypothesis test on a single proportion
  • hypothesis test on the mean of a variable
  • hypothesis test on the difference in mean between two groups (with 3 test varieties in this category)
  • hypothesis test on the paired difference (meaning, for example, measurements made both before and after)
  • hypothesis test on counts of a single categorical variable
  • hypothesis test on independence between two categorical variables
  • hypothesis test on the slope of a regression line
  • hypothesis test on differences among several groups
    • use Births2015 and check whether month or day of the week explain the daily variation.
  • hypothesis test on R2

As statistics developed, early in the 20th century, distinct tests were developed for different kinds of situations. Each such test was given its own name, for example, a “t-test” or a “chi-squared test.” Honoring this history, statistics textbooks present hypothesis testing as if each test were a new and novel kind of animal.

In these Lessons, we’ve focussed on that one method, rather than introducing all sorts of different formulas and calculations which, in the end, are just special cases of regression. Nonetheless, most people who are taught statistics were never told that the different methods fit into a single unified framework. Consequently, they use different names for the different methods.

These traditional names are relevant to you because you will need to communicate in a world where people learned the traditional names, you have to be able to recognize those names know which regression model they refer to. In the table below, we will use different letters to refer to different kinds of explanatory and response variables.

  • x and y: quantitative variables

  • group: a categorical variable with multiple (\(\geq 3\)) levels.

  • yesno: a categorical variable with exactly two levels (which can always be encoded as a zero-one quantitative variable)

Model specification traditional name
y ~ 1 t-test on a single mean
yesno ~ 1 p-test on a single proportion.
y ~ yesno t-test on the difference between two means
yesno1 ~ yesno2 p-test on the difference between two proportions
y ~ x t-test on a slope
y ~ group ANOVA test on the difference among the means of multiple groups
y ~ group1 * group2 Two-way ANOVA
y ~ x * yesno t-test on the difference between two slopes. (Note the *, indicating interaction)
y ~ group + x ANCOVA

Another named test, the z-test, is a special kind of t-test where you know the variance of a variable without having to calculate it from data. This situation hardly every arises in practice, and mostly it is used as a soft introduction to the t-test.

Still another test, named ANCOVA, is considered too advanced for inclusion in traditional textbooks. It amounts to looking at whether the variable group helps to account for the variation in y in the model y ~ group + x.

P-values and covariates

Use cancer/grass-treatment example from Lesson 24 to illustrate how failing to think about covariates before the study analysis can lead to false discovery.

Use age in marriage data.

So, standard operating procedures were based on the tools at hand. We will return to the mismatch between hypothesis testing and the contemporary world in Lesson 32.

\[ \begin{array}{cc|cc} & & \textbf{Test Conclusion} &\\ & & \text{do not reject } H_0 & \text{reject } H_0 \text{ in favor of }H_A \\ \textbf{Truth} & \hline H_0 \text{ true} & \text{Correct Decision} & \text{Type 1 Error} \\ & H_A \text{true} & \text{Type 2 Error} & \text{Correct Decision} \\ \end{array} \]

A Type 1 error, also called a false positive, is rejecting the null hypothesis when \(H_0\) is actually true. Since we rejected the null hypothesis in the gender discrimination (from the Case Study) and the commercial length studies, it is possible that we made a Type 1 error in one or both of those studies. A Type 2 error, also called a false negative, is failing to reject the null hypothesis when the alternative is actually true. A Type 2 error was not possible in the gender discrimination or commercial length studies because we rejected the null hypothesis.

The chi-squared test

MAKE THIS A MORE NEUTRAL DESCRIPTION SHOWING two settings where Chi-squared is appropriate: Is a die fair? Does the observed frequency of phenotypes follow the theoretical frequency of genotypes, as in Punnett squares.

THEN, in the exercises, show poisson regression.

Most statistics books include two versions of a test invented around 1900 that deals with counts at different levels of a categorical variable. This chi-squared test is genuinely different from regression. And, in theoretical statistics the chi-squared distribution has an important role to play.

The chi-squared test of independence could be written, in regression notation, as group1 ~ group2. But regression does not handle the case of a categorical variable with multiple levels.

However, in practice the chi-squared test of independence is very hard to interpret except when one or both of the variables has two levels. This is because there is nothing analogous to model coefficients or effect size that comes from the chi-squared test.

The tendency in research, even when group1 has more than two levels, is to combine groups to produce a yesno variable. Chi-squared can be used with the response variable being yesno and almost all textbook examples are of this nature.

But for a yesno response variable, a superior, more flexible and more informative method is logistic regression.

32 Putting p-values in context

Lesson ?sec-lesson-19 presented an example about the effect of COVID on childhood development. We quoted from a news article from The Economist summarizing one study that looked at infant and toddler vocal interactions with parents:

“During the pandemic the number of such "conversations" declined. ….”[g]etting lots of interaction in the early years of life is essential for healthy development, so these kinds of data "are a red flag".”

Part of statistical thinking involves replacing vague words like “lots” and “essential” with quantitative measures. The next score of Lessons introduced methods for extracting quantitative information from data and ways to present meaningful information to human decision-makers.

This Lesson is about one way of formatting information—statistical significance—that is widely used throughout the sciences and appears widely in the press, but which often obscures much more than it reveals. The “statistically significant” format appears, for instance, in a statement from the research report on which the above quote was based.

“Children from the COVID-era sample produced significantly fewer vocalizations than their pre-COVID peers.”

The link between “fewer vocalizations” and “healthy development” is based on a highly cited research paper which states that “The amount of parenting per hour and the quality of the verbal content associated with that parenting were strongly related to the social and economic status of the family and the subsequent IQ of the child,” and characterizes correlation between IQ and a toddler’s exposure to verbal content as “highly significant.”

Betty Hart and Todd Risley (1992) “American parenting of language-learning children: Persisting differences in family-child interactions observed in natural home environments” Developmental Psychology 28(6) 1096-1105 link

Significance does not measure magnitude

The magnitude of the link between vocalizations and IQ is expressed by an effect size which has units of IQ-points per vocalization-point. Many researchers prefer, however, to “standardize” both variables so that they have mean 0 and a standard deviation of 1 with no units. Hart & Risley reported an effect size of 0.63 for standardized IQ to standardize vocalization score.

Such an effect size, on standardized variables, is called the “correlation” or, in the context of multiple regression, the “partial correlation.”

Imagine that the social importance of a finding such as in Hart and Risley is that it introduces the prospect of a child-development intervention where parents are trained and encouraged to improve the level of verbal interaction with their infants and toddlers. The improvement in verbal interaction would—if the effect size reported is truly a causal connection—create an improvement in IQ.

The impact of educational interventions is often measured in units of “SDs.” For instance, if school reading scores have a mean of 100 and a standard deviation of 15, then an intervention—giving out books for kids to read at home, say—which increases the mean score to 105 would have an effect size of 1/3 SD, since the mean gain (5 points) is one-third the standard deviation of the reading scores. Examination of interventions in education indicates that the median effect size is 0.1 SD and the 90th-percentile is at 0.4 SDs.

Suppose we are mildly optimistic about the impact of an intervention to improve parent-toddler verbal interaction and imagine that the effect size might be around 0.2 SD. This improvement would then be filtered through the Hart & Risley effect size (0.63), to produce a standardized change in IQ of \(0.2 \times 0.63 = 0.13\) SD. Consider the impact this would have on a child at the 10th-percentile of IQ. After the intervention, the child could be expected to move to the 12.5-percentile of IQ. That is not a big change; certainly not enough to justify characterizing child-parent interactions as “essential” to development, and not one to merit calling the link “highly significant” if significant is interpreted in its everyday meaning.

More recent work suggests the effect size of childhood language exposure on “language outcomes in late childhood” is about 0.40-0.50 and about 0.25 when looking at language outcomes at age 50. (Gilkerson J, Richards JA, Warren SF, et al. Language Experience in the Second Year of Life and Language Outcomes in Late Childhood. Pediatrics. 2018;142(4):e20174276)

For the decision-maker trying to evaluate how to allocate resources, the magnitude of an effect carries important information. The p-value combines together the effect size and the sample size. It can be made as small as you like by making the sample size large enough.

::: {.callout-note} ## Example: p-value and sample size

To explore how the p-value depends on sample size, let’s use dag01 which implements a simple linear relationship between two variables, x and y.

x ~ exo()
y ~ 1.5 * x + 4 + exo()

The formula for y shows an effect size of x on y of 1.5.

We can also estimate the effect size from a data sample from dag01. We anticipate getting the same result, but there will be sampling variation. For instance:

lm(y ~ x, data = sample(dag01, size=100)) %>% conf_interval()
term               .lwr      .coef       .upr
------------  ---------  ---------  ---------
(Intercept)    3.855413   4.053851   4.252289
x              1.466651   1.658881   1.851111

Using a larger sample will reduce sampling variation; the confidence interval will be tighter but the effect size will remain in about the same place.

lm(y ~ x, data = sample(dag01, size=400)) %>% conf_interval()
term               .lwr      .coef       .upr
------------  ---------  ---------  ---------
(Intercept)    3.835342   3.936412   4.037483
x              1.440449   1.542529   1.644609

The p-value works differently. It depends systematically on the sample size. For example:

lm(y ~ x, data = sample(dag01, size=10)) %>% regression_summary() %>% filter(term == "x")
term     estimate   std.error   statistic     p.value
-----  ----------  ----------  ----------  ----------
x       0.8699924   0.5572954    1.561098   0.1571238
lm(y ~ x, data = sample(dag01, size=20)) %>% regression_summary() %>% filter(term == "x")
term    estimate   std.error   statistic   p.value
-----  ---------  ----------  ----------  --------
x       1.652977   0.2331389    7.090097   1.3e-06
lm(y ~ x, data = sample(dag01, size=30)) %>% regression_summary() %>% filter(term == "x")
term    estimate   std.error   statistic   p.value
-----  ---------  ----------  ----------  --------
x       1.490128   0.1839867    8.099108         0

To show the relationship more systematically, Figure 32.1 shows many trials involving taking a sample from dag01 and calculating a p-value.

Figure 32.1: Many trials of the p-value from the model y ~ x using data from dag01. As the sample size becomes larger, the p-value becomes very small. There is also a lot of sampling variation.

Each of these trials involves taking a sample from a system whose effect size is 1.5. Yet the p-values become tremendously small even at moderate sample sizes. Notice that at a given sample size, the p-value can differ by a factor of 100 or more between samples.

The “Power” of a test

Notice in Figure 32.1 that many of the trials fail to meet the standard of \(p < 0.05\) even though all of the trials involve samples from a system where the Null hypothesis is false. For example, with a sample size of \(n=5\), only about half of the trials generated \(p<0.05\). In contrast, when \(n=10\), almost all of the trials produced \(p < 0.05\). The probability that a sample from a system with a given effect size will produce \(p<0.05\) is called the “power” of the test.

NHT as a screening test?

Null Hypothesis Testing was invented to serve a genuine but very limited need of the research workplace. The workplace need is this: When working with small samples it is plausible to imagine that the results we get are just the outcome of sampling variation, the play of chance. If the methods for working with the small samples frequently cause us to mistake sampling variation for genuine, meaningful findings of scientific import, science workers would waste a lot of time on wild goose chases and the scientific enterprise would progress only slowly.

Lesson ?sec-lesson-35 discussed screening tests: imperfect but low-cost procedures intended to separate low-risk patients from those at a higher risk. Keep in mind that “higher risk” does not mean “high risk.” In the breast-cancer example in Section 29.10, we found that digital mammography separates a population whose risk is 2% into two sub-populations: those with a \(\mathbb{P}\) result have a risk of 11% of cancer, those with a \(\mathbb{N}\) result have a 0.2% risk.

It is useful for understanding the potential problems of NHT to reframe it as a screening test. Every screening test involves at least two hypotheses. For mammography, the hypotheses are “the patient has cancer” and “the patient does not have cancer.” In contrast, NHT involves only one hypotheses: the Null. To create a screening-test-like situation for NHT, we need to introduce another hypothesis. This second hypothesis is that, unlike the Null, something interesting or worthwhile is going on. Generically, the second hypothesis is called the “Alternative hypothesis.” This name on its own doesn’t describe the hypothesis; more on that later.

As a historical note, NHT was introduced by Ronald Fisher in 1925 in his famous book, *Statistical Methods for Research Workers. He called it “significance testing” and it involves only the Null hypothesis (which is why we call it “Null hypothesis testing”). Already in 1928, Jerzy Neyman and Egon Pearson put forward an improvement involving both a Null and an Alternative hypothesis.

In the following, we will treat the Alternative hypothesis as akin to “no cancer” and the Null hypothesis as akin to “cancer.” Similarly, we will label “reject the Null hypothesis” as the \(\mathbb{N}\) result and “fail to reject” as \(\mathbb{P}\). Putting these together into a 2x2 table let’s us label the various possibilities in terms of false positives, false negatives, and so on.

. Reject (\(\mathbb{N}\)) Fail to reject (\(\mathbb{P}\))
Null hyp. false negative true positive
Alternative hyp. true negative false positive

The false-negative result corresponds to the scientific wild-goose chase. This occurs when the Null is true but the test result is to fail to reject the Null. The point of NHT is to reduce the frequency of false negatives by making it hard to get a \(\mathbb{N}\) result unless the Alternative hypothesis is true.

Recall that screening tests are intended to be cheap: easy to apply without imposing pain or cost or effort. Certainly, the NHT is cheap and easy. It doesn’t require any creativity or knowledge of the setting. The Null hypothesis is easy to state (“nothing is going on”) and the raw test result—the p-value—is easy to compute, for example by using a computer to shuffle the order of an explanatory variable or reading the result from a regression summary. The threshold for turning the raw test result into a \(\mathbb{P}\) test output is also stated very simply in NHT: \(p < 0.05\).

In order to calculate the false-negative rate, we need two pieces of information:

  1. The sensitivity of the test. We will frame this as the probability of a \(\mathbb{P}\) result given that the Null is actually true. We know this exactly, since the threshold for rejection is \(p < 0.05\). This puts the sensitivity at 95%.

  2. The prevalence of the condition. This is a tough one. We are interested in, out of all the research projects that a research worker might consider worth undertaking, in how many of them there really is no effect. It is tempting to believe that a skilled researcher knows where to focus her effort: a high prevalence. But anyone who has worked in a laboratory knows different. For example, Thomas Edison said, “Ninety-eight per cent. of genius is hard work. As for genius being inspired, inspiration is in most cases another word for perspiration.” In the following, we will use what seems to us a highly optimistic rate, and set the prevalence at 60%.

Notice that we don’t need to know the specificity of the test in order to calculate the false-negative rate. The specificity would come into play only if we wanted to know what a \(\mathbb{N}\) test has to say about the possibility of the Alternative being true.

To show the theoretically beneficial effect of NHT, let’s imagine a set of 1000 research projects that a worker might undertake.

  1. False-negative rate: Since the prevalence is 60%, 600 of these 1000 research projects wild-goose chases. Of these, 95% will get a correct \(\mathbb{N}\) result, leaving only 5% as false negatives. So, of the 1000 research projects, \(600 \times 0.05 = 30\) will be false negatives, a rate of 3% of the original 1000 research projects.

Now, let’s consider the true-negative rate. “Negative” refers to “rejecting the Null,” while “true” means that something interesting is going on in the system under study. To calculate the true-negative rate, we need to know the specificity of the test. Specificity is the probability of a \(\mathbb{N}\) result (that is, “reject the Null”) in a world where the Alternative hypothesis is true. This is the “power” of the test. (See ?sec-power-definition.) Naturally, to calculate this probability we need some statement about what the p-value will look like when the Alternative is true. ?sec-bcm shows p-values when the Alternative is the mechanism of dag01. Setting the power to be at least 80% is an appropriate standard in scientific work. However, since NHT does not involve an Alternative hypothesis, usually the power is (unfortunately) not presented in research reports. We will stipulate here that the power (that is, the specificity) is 80%. Later, we will consider what happens if the power is lower.

  1. True-negative rate: The calculation here is very similar to that of the false-negative rate. Since the (assumed) prevalence of Null is 60%, 400 of the 1000 research projects have the potential to produce genuinely meaningful results. With the power at 80%, 320 of these 400 tests will (correctly) produce a “reject the Null” result.

Putting together the results from (1) and (2), we expect to have \(320 + 30 = 350\) of the research projects producing “reject the Null.” Imagine that these 350 projects get published because of their \(p < 0.05\) result. Of these 350, 320 are genuinely meaningful. Consequently, a researcher to reads the published literature can expect that only \(30/350 = 9\%\) would be wild-goose chases. Seen another way, 91% of the published research findings will correctly point to something of genuine interest.

That is a pretty strong recommendation in favor of NHT. A simple, low-cost screening test will stratify the research so that ninety percent of the published research will be of genuine interest. In this light, it is no wonder that NHT is a standard practice in scientific research.

EXERCISE: Calculate the fraction of published research papers that are correct if the prevalence of the null hypothesis is 90%. (It will be 64%.)

Pitfalls of p-values

The idea that a simple calculation of a p-value can produce a substantially correct research literature is attractive. But there are a number of factors that make this a fantasy. We can start with the idea that roughly the prevalence of “no effect” is only 60%. There is, in principle, an easy way to estimate the prevalence. Suppose it were required to register every hypothesis test undertaken in a research environment. The registered studies would include both those that reject the Null and those that fail to reject the Null. The prevalence will be the fraction that fail to reject the Null.

Any such system can be gamed. Researchers have strong professional incentives not to report investigations where the fail to reject the Null. Among other things, it is unusual to publish such failures; journal editors don’t want to fill their pages with reports of “nothing here.”

The workers making use of Fisher’s 1925 Research Methods were often conducting laborious bench-top or field experiments. Presumably, they tried to invest their labor on matters where they believed they had a reasonable chance of success.

Today, however, the situation is different. Data with multiple variables are readily available and testing is so easy that there is no real barrier to looking for a “significant” link by trying all of the possibilities of explanatory and response variables. This might mean that the prevalence of “no effect” is very high. In the spirit of Edison’s 98% of genius being “hard work,” consider the consequence if the prevalence of “no effect” is 98%. The calculation works this way: of 1000 hypothetical studies, 20 will be of genuine importance, 980 not. The \(p < 0.05\) criterion means that 49 of the 980 will lead to (falsely) rejecting the Null. A power of 80% implies that of the 20 important studies, 16 will lead to (correctly) rejecting the Null. Thus, the fraction of correct results among the reject-the-Null studies will be \(16/65 = 25\%\). That is, three-quarters of publishable research will be wrong. Imagine working with a research literature which consists mostly of incorrect results!

The above calculation is based on an assumption of 98% prevalence. That assumption may well be wrong. But we do not know what the prevalence of “no effect” might be. Consequently, we have no way to evaluate the effectiveness of NHT in screening out incorrect results. Put this together with the finding that p-values, no matter how small, are not a measure of the magnitude of an effect (see Section 32.1) and it is difficult to treat p-values as an indication of scientific import. This is not to say that NHT has no use—it is still a screening test that performs a valuable service in avoiding wild-goose chases. That is, however reliable the research literature may be, it would be still less reliable if NHT were not used. Still, attaching phrases like “highly significant” to a mere rejection of the Null is unjustified.

Statistician Jeffrey Witmer, in an editorial in the Journal of Statistics Education offers a simple solution to the problem of people misinterpreting “statistically significant” as related to the everyday meaning of “significant.” Replace the term “statistically significant” with “statistically discernible.” There is no difference between the everyday sense of “discernible”—able to be perceived—and the statistical implications. In conveying statistical information, “discernible” is more descriptive than “significant.” For example, it would be appropriate to describe the implications of a p-value \(p < 0.03\) as, “the relationship is barely discernible from the sampling variation.”

Try, try, try until you succeed

As an indication of the prevalence of “no effect,” let’s look at a 2008 study which examined the possible relationship between a woman’s diet before conception and the sex of the conceived child.

Women producing male infants consumed more breakfast cereal than those with female infants. The odds ratio for a male infant was 1.87 (95% CI 1.31, 2.65) for women who consumed at least one bowl of breakfast cereal daily compared with those who ate less than or equal to one bowlful per week.

The model here is a classifier of the sex of the baby based on the amount of breakfast cereal eaten. The effect size tells the change in the odds of a male when the explanatory variable changes from one bowlful of cereal per week to one bowl per day (or more). This effect size is sensibly reported as a ratio of the two odds. A ratio bigger than one means that boys are more likely outcomes for the one-bowl-a-day potential mother than the one-bowl-a-week potential mother. The 95% confidence interval is given as 1.31 to 2.65. Under the Null hypothesis, the odds ratio should be 1; the confidence interval doesn’t include the Null.

The confidence interval is the preferred way of conveying both the magnitude of the effect and the precision of the measurement. Since this is a Lesson about p-values, however, let’s translate the confidence interval to a less informative p-value. That the confidence interval does not include the Null hypothesis value of 1 means that \(p < 0.05\). A more detailed calculation indicates that \(p < 0.01\). In a conventional NHT interpretation, this provides compelling evidence that the relationship between cereal consumption and sex is not a false pattern.

But the confidence interval is not the complete story. The authors are admirably clear in stating their methodology: “Data of the 133 food items from our food frequency questionnaire were analysed, and we also performed additional analyses using broader food groups.” In other words, the authors had available more than 133 potential explanatory variables. For each of these explanatory variables, the study’s authors constructed a confidence interval on the odds ratio. Most of the confidence intervals included 1, providing no compelling evidence of a relationship between that food item and the sex of the conceived child. As it happens, breakfast cereal produced the confidence interval that was the most distant from an odds ratio of 1.

Carrying out more than 100 tests in order to get one significant result suggests that the prevalence of “no effect” is about 99%. This suggests that the probability that the reported result is wrong is about 85%, not a good basis for deciding what foods to eat.

Fiona Mathews et al. (2008) “You are what your mother eats: evidence for maternal preconception diet influencing foetal sex in humans” Proceedings of the Royal Society B 275: 1661-1668

The Alternative hypothesis

The p-value examines only one hypothesis: the Null. A p-value is a numerical presentation of the plausibility of a claim that the Null can account for the observed data. If so, there is no pressing need to examine other hypotheses to account for the data. But what happens when \(p<0.05\). Of course, this means that we “reject the Null.”

In the typical textbook presentation of p-values, the Alternative plays no meaningful role. This leads to an over-interpretation of the p-value as something it is not.

The situation is diagrammed in Figure 32.2, which presents two scenarios: a) with large sample size \(n\) and b) with small sample size. In each of these scenarios, a point estimate of the effect size is made: the black dot. Following the appropriate practice, an interval estimate is also made: the confidence interval drawn as a horizontal line: .

Figure 32.2: Estimates of effect size with large and small sample size.

The Null hypothesis is a claim that the effect size is really zero. Clearly the observed effect size is different from the Null. The p-value measures how far the observed effect size is from the Null. The units in which we measure that distance are defined by the confidence interval. In scenario (a), the confidence interval is small, as expected for large sample size. In scenario (b), the confidence interval is large because \(n\) is small.

The question of whether \(p<0.05\) is equivalent to asking whether the confidence interval overlaps the Null hypothesis. In scenario (a), the confidence interval does not overlap the Null. Consequently, \(p<0.05\) and we reject the Null. But in scenario (b), we “fail to reject” the Null since the confidence interval includes the Null.

A bit more can be said about how the confidence interval and the p-value are related. In particular, for confidence intervals that don’t touch the Null, the further the is the closer end of the confidence interval from the Null, the smaller the p-value. For example, this situation corresponds to \(p < 0.00001\).

Many people interpret a very small p-value as meaning that the observed effect size is very far from the Null. This is true in a very limited sense: when one measures distance using the confidence interval as a ruler. Since, the length of the confidence interval scales as \(1/\sqrt{n}\), by making \(n\) very large the distance from the Null is also made very large.

Measuring “large distance” with respect to the confidence interval leads to a an over-interpretation. This is enhanced by the misleading use of the word “significant” to describe small p-values. From a decision-maker’s point of view, “large distance” ought to be measured in some way that reflects meaningful differences in the real world.

The proper use of an Alternative hypothesis involves staking a claim about real-world relevance. For instance, in considering IQ as a function of conversational turns, we might look at how a plausible intervention to improve conversational turns might translate into a change in IQ. Perhaps a compelling Alternative would be that the relationship between IQ and conversational turns should play out to at least a 5 point increase in IQ.

Another example: Imagine a study of the link between the use of a class of medicinal drugs called “statins” with a reduction in heart disease. The appropriate form for an Alternative hypothesis would be stated in natural units, perhaps like this: The Alternative is that use of statins will lead to a reduction in yearly all-cause mortality by at least 1 percentage-point.

Specifying a meaningful Alternative hypothesis requires knowing how the research relates to the real world. This is often difficult and is usually a matter of judgment about which different decision-makers may disagree. Still, stating a definite Alternative at least provides an opportunity for discussion of whether that is the appropriate Alternative.

Early in a research program, as in Hart and Risley’s 1992 research into a possible link between conversational turns and IQ, it is perfectly reasonable to focus on whether a link is statistically discernible. This can be measured by a p-value. But once that initial work is done, later researchers are in a position to know the relationship between sample size and the length of the confidence interval so that they can reliably design studies that will reject the Null. At that point, the researchers should be expected to say something about the minimum size of effect that would be interesting and to describe the importance of their work in terms of whether they have detected that minimum size or bigger.


  1. Replication.

See the paper “a-little-replication-helps.pdf”

  1. Pre-specification of endpoint

  2. Have a definite Alternative. With that, you can compute a Bayes factor. Commonly, Bayes factors over 20 are interpreted as strong evidence for a hypothesis, those of 3-5 as week evidence, and intermediate value as “moderate” evidence.

  1. In the same spirit, we might simply look at the likelihood ratio, \({\cal L}_{\mathbb{S}}(H_a) \div {\cal L}_{\mathbb{S}}(H_0)\) and draw a confident conclusion only when the ratio turns out to be much greater than 1, say, 5 or 10.↩︎