36  Hypothesis testing

Statistics lies at the crossroads of two major themes:

Source: Oxford LanguagesSource: Wikipedia

For our purposes, the key difference between the scientific method and decision-making has to do with the goal. The scientific method is about moving along the road to “truth.” Decision-making (at least, rational decision-making) is about securing the best possible result. The meaning of “best” is subjective: the objectives and values of the decision-makers and the evaluation of trade-offs.

We encountered examples of trade-offs in Lessons 34 and 35, for example designing a classifier to balance the trade-off between false-positives and false-negatives. Important part of the design process involve a loss-function that encodes our values and informed assumptions about prevalence of the condition for which the classifier is designed.

This and the following two Lessons are about a statistical process that is oriented toward the scientific method, not decision-making. It’s important to keep this in mind when deciding whether or not the process is appropriate in any given situation.

Hypothetical thinking

Hypothetical thinking is the process of working out consequences conditioned on a hypothesis, that is, a particular view of how things might be. Sometimes the word “assumption” is used rather than “hypothesis.” Examining the distinction between the two is enlightening. An assumption is “a thing that is accepted as true or as certain to happen, without proof.” In contrast, a hypothesis is “a proposition made as a basis for reasoning, without any assumption of its truth.”

Both definitions are from Oxford Languages

A reasonable person might wonder what the point is of drawing out the theoretical consequences of a proposition not necessarily believed to be true. This sounds like game playing, and it is. But it is also part of the scientific method: “the formulation, testing, and modification of hypotheses.” The “testing” part of the method involves comparing the theoretical consequences of the formulated hypothesis to what is observed. The “modification” part is about what to do when the observations are inconsistent with the theoretical consequences.

In science, the formulation of worthwhile hypotheses involves insight, expertise, imagination, and creativity. This is what guides scientists to choose the particular experiments they do and the specific observations they make. Statistics comes into play after the scientist’s creative choice has been made. Thus, the hypotheses involved in the process of statistical “hypothesis testing” are for the most part not the result of insight and creativity. (There is one exception, that we get to below.)

Various phrases are used to signal that a claim is being made through hypothetical thinking. These include: “assuming that _____,” “given that _____,” “under the ____ hypothesis”, and “conditioned on _____.” The blanks are where a specific hypothesis is inserted. Perhaps the most famous of these signals of hypothetical thinking is the phrase “under the Null.”

I think it helps to drive home the centrality of the given hypothesis, to describe statistical hypothetical thinking as taking place on different planets. Once you are on one of the statistical hypothetical planets, everything you do takes place in the context of that planet.

Start with the planet that is not hypothetical and is of acute interest to science and decision-making: Planet Earth.

Figure 36.1: Planet Earth

A statistical planet that we have been operating in throughout these lessons consists of the data we have collected from Planet Earth. Let’s call this Planet Sample; it is made of the data in our sample. All of our data analysis takes place on Planet Sample. Ideally, Planet Sample is a close match to Planet Earth, but that depends on the process used to collect the sample and the sample size \(n\).

Figure 36.2: Planet Sample

Planet Sample is where we calculate sample statistics and fit models. A hugely important statistical process that takes place on Planet Sample is our estimation of sampling variation, for example by calculating confidence intervals. That calculation involves only the sample, although it is informed by our mathematical insight, for instance, that the length of confidence intervals is proportional to \(1/\sqrt{n}\).

Of particular interest in this Lesson is a third planet: Planet Null. On Planet Null, variables are not related to one another and effect sizes are zero. Any pattern observed on Planet Null is purely the product of sampling variation: the accidental alignment of columns of data.

Figure 36.3: Planet Null

On Planet Sample, in contrast, there may well be connections between variables and non-zero effect sizes. We don’t know for sure, but we want to find out. Ultimately, we want to relate what is observed on Planet Sample to the workings of Planet Earth. Confidence intervals are part of that reasoning from Planet Sample back onto to Planet Earth. But it also pays to make a quick trip from Planet Sample to Planet Null. This is very easy to do and it might turn out that our work on Planet Null indicates that any patterns we spotted on Planet Sample might well be the result of the accidental alignments from sampling variation.

How do we move from one planet to another so that we can carry out the work of hypothetical thinking? We start out, of course, on Planet Earth. Getting from there to Planet Sample is easy; just collect a sample from Planet Earth. When we work with that sample we are working on Planet Sample. But a reminder: We want Planet Sample to be as much like Earth as we can manage with the resources available. This is why it is important to avoid the kinds of bias described in Lesson 22.

There are ready means to travel from Planet Sample to Planet Null. These may not be as intuitive as the collect-a-sample rocket that takes us from Earth to Sample, and they often draw on mathematical insight that not everyone shares. Most statistics textbooks are full of formulas whose purpose is to travel to Null and do some relevant work there. However, not all students understand where the formulas come from. Adding to the student’s confusion is that there are different formulas for numerical or categorical data.

In the spirit of helping students understand how one can travel to Planet Null, we will focus here on a simple, intuitive, universal method of transit that applies to both numerical and categorical data. That method is “shuffling.” By this, we do not mean merely moving around randomly the rows of a data frame. Instead, the shuffling to get to Planet Null involves putting in random order the entries within a variable.

To illustrate, let’s construct a small, simple data frame.

x y
A 1
B 2
C 3
D 4
E 5

The original data frame used for illustration.

x y
D 4
C 3
B 2
A 1
E 5

Shuffling the rows of the data frame leaves you on Planet Sample.

x y
D 1
C 2
B 3
A 4
E 5

To get to Planet Null, the shuffling is done within one or more variables.

Shuffling with a variable destroys any relationship that variable might have with any other variable in the data frame. More precisely, shuffling makes it so that any relationship that is detected is solely due to the accidents of sampling variation. To summarize:

Take the Space Shuffle to get from Planet Sample to Planet Null.

Hypothesis testing

The word “test” is familiar to all who have ever been students, but it will be helpful to have a definition. This one seems reasonable:

A procedure intended to establish the quality, performance, or reliability of something, especially before it is taken into widespread use.” – Oxford Languages

The above is a sensible definition, and based on the definition one would expect that a “hypothesis test” will be “a procedure intended to establish the correctness or applicability of a hypothesis, especially before relying on that hypothesis to guide action in the world.” That is, a process relating to decision-making. Nevertheless, in statistics, “hypothesis testing” refers to a procedure that does not involve decision-making. A good name for the statistical procedure is “Null Hypothesis testing” or NHT for short. Other names for more-or-less the same thing are “significance testing” and “null hypothesis significance testing.”

Remember that decision-making involves selecting between competing options for action. Lessons 34 and 35 introduced a framework for such decisions which involved two competing hypotheses: (1) the patient has a condition, and (2) the patient does not have the condition. “Sensitivity” is the result of a calculation given hypothesis (1). “Specificity” is a calculation given hypothesis (2).

In contrast, NHT involves only a single hypothesis: the Null hypothesis.

It is straightforward to describe NHT in broad terms and to perform the procedure. It’s much harder to interpret properly the results from NHT. Mis-interpretation is common. This is not a trivial problem. Many of the flaws in the research literature can be blamed on mistaken interpretations of NHT. Perhaps worse, it is sometimes presented as a guide to decision-making even though NHT fails to include essential components of responsible decision-making.

NHT is a devil’s advocate procedure. Data are analyzed and a result produced in the form of a sample statistic such as a regression coefficient on an explanatory variable or an R2 or “partial R2.” If the coefficient is non-zero, it’s tempting to conclude that there is a relationship between the explanatory variable and the response variable. The devil’s advocate claims otherwise; there is no relationship and any appearance of one is just a matter of chance, the play of sampling variation.

In the simulated world of the Null hypothesis—that is, on Planet Null—many samples are generated, just as we did when exploring sampling variation in Lesson 21. Each such sample is modeled in the same way as the original data to find the regression coefficient for the explanatory variable of interest. Since the simulated data are generated in a world where there is no relationship between that explanatory variable and the response, one expects the coefficients to be near zero, deviating from zero because of sampling variation.

Next, one takes the original regression coefficient from the actual data and compares it to the many coefficients calculated from the null-hypothesis simulation trials. If the original coefficient would go unnoticed among the many coefficients from the simulation the conclusion of the test is to “fail to reject the Null hypothesis.” On the other hand, if the original coefficient stands out clearly from the crowd of coefficients from the simulation, the conclusion is to “reject the Null hypothesis.”

Simple pictures can illustrate the meaning of “standing out clearly from the crowd.” In each of the diagrams in Figure 36.4, the sample statistic (calculated on Planet Sample, of course!) is shown as a blue line. The dots are many trials of calculating the sample statistic on Planet Null.

(a) Doesn’t stand out (p = 0.23)

(b) Stands out a little (p=0.05)

(c) Obviously standing out (p=0.001)

Figure 36.4: Examples showing the extent to which the actual sample statistic (blue line) stands out from the crowd of Planet Null trials (dots).

In statistical work, the extent to which the actual sample statistic stands out from the crowd of Planet Null trials is quantified by the “p-value,” a number between zero and one. Small p-values indicate that the actual sample statistic stands out. Lesson 37 covers the calculation of p-values.

Fail to reject?

The language “fail to reject the Null hypothesis” is stilted, but correct. Why don’t we frame the result as “accept the Null?”

I think for many people “fail to reject” and “accept” have practically the same meaning. But the word “fail” is appropriate because it points to potential flaws in the research protocol. For instance, the sample size might have been too small to reveal that the Null should be rejected.

As an illustration of “failing” to reject,” let’s look at the data on heights in Galton, a sample of size \(n=898\). You can confirm for yourself that the confidence interval on the mother coefficient in the model height ~ mother does not include zero when the model is trained on the entire Galton data frame.

But suppose we had been less diligent in collecting data than Francis Galton and we had only \(n=100\). We can easily simulate this:

lm(height ~ mother, data = Galton %>% sample(size=100)) |> conf_interval()
term .lwr .coef .upr
(Intercept) 38.00 60.00 82.00
mother -0.22 0.12 0.46

The confidence interval on mother does include zero. That’s because we didn’t have enough data: we can lay that failure on not having a big-enough sample size.”

Those two verbal conclusions—“reject the Null” or “fail to reject the Null”—are often supplemented or replaced entirely with a number called the p-value. The p-value is the way of measuring the extent to which the original coefficient goes unnoticed among or stands out from the coefficients from the simulation. A small p-value corresponds to standing out from the simulated coefficients. Conventionally, \(p < 0.05\) corresponds to “rejecting the null,” although the convention can differ from field to field.

A highly unfortunate alternative practice replaces the phrase “rejecting the Null” with “significant result” or, more honestly, “statistically significant result.” The word “significant” is enshrined in statistical vocabulary, but has very little to do with the everyday meaning of “significant.” As we will see in Lesson 38 where we advocate replacing “significant” with “discernible.”

Many researchers like to trumpet very small p-values, for instance \(p < 0.001\), to bolster their claims of “significance” (for instance, using the misleading phrase “highly significant”) or to justify a claim like “the evidence is very strong.”

What’s the purpose of NHT?

A Null hypothesis test is usually easy to conduct (see Lesson 37) which is one reason they are so commonly done. Many people think that the point of an NHT is to measure, using the p-value, the “strength of the evidence” for a claim. But “strength of evidence” is a dodgy concept. There are many kinds of evidence and the statistical thinker takes care to look at them all. A case in point is the prosecutor’s fallacy where the Null hypothesis is taken to be that the accused is innocent and rejecting the Null is indicated by some pattern that would be highly unlikely if the accused were indeed innocent.

A case in point comes from the March 11, 2004 terrorist bombing of trains in Madrid, Spain, resulting in the deaths of 191 people. A fingerprint found on a plastic bag in the truck that transported the terrorists pointed to an American lawyer, Brandon Mayfield, who had previously defended a terrorist convicted of supporting al-Queda. FBI fingerprint experts claimed that the likelihood of a mistake, given the strength of the fingerprint match was zero: impossible. Zero is a very small p-value! Mayfield was accordingly taken into custody, even though he lived in Portland, Oregon and despite no evidence of his ever travelling to Spain or leaving the US in the previous 10 years. The case was resolved only when the Spanish National Police found another match for the fingerprint: an Algerian terror suspect.

I think it is healthier to regard NHT as something other than a measure of “strength.” Instead, I look at NHT as a kind of screening test. The malady being screened for is that the proposed pattern (for instance, a non-zero regression coefficient) is just an accident of sampling variation. Using NHT properly will weed out 95% of such accidents. We don’t have any good idea of the prevalence of such accidents among research findings, so there’s nothing to be said about the probability of a false positive or false negative.

To switch similes, think of NHT as a screen door on a house. Suppose the screen door prevents 95% of bugs from getting into the house. Does that imply that the house will be bug free?

Still, weeding out 95% of accidents is a start. It’s the least we can do.

The Alternative hypothesis

In Null hypothesis testing, there is only the one hypothesis under consideration: the Null hypothesis. Since the Null hypothesis can be enforced by shuffling, the computations for NHT can be done pretty easily even without the probability theory just mentioned.

There has been a controversy since the 1930s about whether hypothesis testing—in the broad sense—should involve two (or more) competing hypotheses. Before the widespread acceptance of the Bayesian approach (described in Lesson 35 and mentioned below), statisticians Jerzy Neyman and Egon Pearson proposed a two-hypothesis framework in 1933. One of their hypotheses is the familiar Null hypothesis. The other is called the “Alternative hypothesis,” a statement of a specific non-null relationship.

Returning to the planet metaphor, this would be a third statistical planet: Planet Alt. Recall that Planet Sample and Planet Null are statistical planets which correspond to the simple mechanics of sampling and shuffling, respectively. In contrast, Planet Alt is the result of scientific insight, expertise, imagination, and creativity.

Figure 36.5: Planet Alt, denoted as, \(\ |\!\!|\ H_a)\) might look like this. We draw it as a cartoon planet, since any particular hypothesis is a product of the imagination.

The situation with two hypotheses would be very similar to that presented in Lessons 34 and 35. In those lessons, the two hypotheses were C and H. In developing a classifier, one starts by collecting a training sample which is a mixture of cases of C and H. But, in general, with a competition of hypothesis—\(H_0\) and \(H_a\)—we don’t have any real-world objects to sample that are known to be examples of the two hypotheses. Instead, we have to create them computationally. Instances of \(H_0\) can be made by data shuffling. But instances of \(H_a\) need to be generated by some other mechanism, perhaps one akin to the DAGs we have used in these lessons.

Comparing hypotheses with Bayes’ Rule

With mechanisms to generate data from both the Null and Alternative hypotheses, we would take the statistical summary \(\mathbb{S}\) of the actual data, and compute the likelihoods for each hypothesis: \({\cal L}_{\mathbb{S}}(H_0)\) and \({\cal L}_{\mathbb{S}}(H_a)\). It should not be too controversial in a practical process to set the prior probability for each hypothesis at the same value: \(p(H_0) = p(H_a) = {\small \frac{1}{2}}\). Then, turn the crank of Bayes’ Rule (Section 35.4) to compute the posterior probabilities. If the posterior of one or the other hypothesis is much greater than \({\small \frac{1}{2}}\), we would have compelling evidence in favor of that hypothesis.1

There are specialized methods of Bayesian statistics and whole courses on the topic. An excellent online course is Statistical Rethinking.

An empty Alternative

If you have studied statistics before, you likely have been exposed to NHT. Many textbook descriptions of NHT appear to make use of an “alternative hypothesis” within NHT. This style is traditional and so common in textbooks that it seems disrepectful to state plainly that it is wrong. Nevertheless, there is only one hypothesis being tested in NHT: the Null.

In the textbook presentation of NHT, the “alternative” hypothesis is not a specific claim—for instance, “the drug reduces blood pressure by 10 mmHg”. Instead, the student is given a pointless choice of three versions of the alternative. These are usually written \(H_a \neq H_0\) or as \(H_a < H_0\) or as \(H_a > H_0\), and amount to saying “the effect size is non-zero,” “the effect size is negative,” or “the effect size is positive.”

Outside of textbooks, only \(H_a \neq H_0\) is properly used. The other two textbook choices provide, at best, variations on exam questions. At worst, they are a way to put a thumb on the scale to disadvantage the Null.


  1. In the same spirit, we might simply look at the likelihood ratio, \({\cal L}_{\mathbb{S}}(H_a) \div {\cal L}_{\mathbb{S}}(H_0)\) and draw a confident conclusion only when the ratio turns out to be much greater than 1, say, 5 or 10.↩︎