29  Hypothesis testing

Standard operating procedures” (SOPs) are methods, rules, or actions established for performing routine duties or in designated situations. SOPs are usually associated with organizations and bureaucratic operations such as hiring a new worker or responding to a report of broken equipment or a spilled chemical. SOPs are intended to routinize operations and help coordinate different components of the organization.

There are SOPs important to scientific work. For example, research with human subjects undergoes an “institutional review” SOP that ensures safety and that subjects are thoroughly informed about risk.

This Lesson is about an SOP called “hypothesis testing,” a statistical procedure intended to inform decision-making by researchers, readers, and editors of scientific publications. For example, an individual researcher or research team needs to assess whether the data collected in an investigation is adequate to serve the intended purpose. A reader of the scientific literature needs a quick way to assess the validity of claims made in a report. A journal editor, needs a straightforward means to screen submitted manuscripts to check that the claims are supported by data and that the claims are novel to the journal’s field. The hypothesis testing SOP is designed to serve these needs. The words “adequate,” “quick,” and “straightforward” in the previous sentences correctly reflect the tone of hypothesis testing.

Science is often associated with ingenuity, invention, creativity, deep understanding, and the quest for new knowledge. Perhaps understandably, “SOP” rarely appears in reports of new scientific findings. Unfortunately, failing to see “hypothesis testing” as an “SOP” results in widespread misunderstanding and inappropriate use of the results from hypothesis testing.

The Null hypothesis

One of the best descriptions of hypothesis tests comes from the 1930s, when they were just starting to gain acceptance. Ronald Fisher, who can be credited as the inventor, wrote this description, using the name he preferred: “significance test.

[Significance testing] is a technical term, standing for an idea very prevalent in experimental science, which no one need fail to understand, for it can be made plain in very simple terms. Let us suppose, for example, that we have measurements of the stature of a hundred Englishmen and a hundred Frenchmen. It may be that the first group are, on the average, an inch taller than the second, although the two sets of heights will overlap widely. … [E]ven if our samples are satisfactory in the manner in which they have been obtained, the further question arises as to whether a difference of the magnitude observed might not have occurred by chance, in samples from populations of the same average height. If the probability of this is considerable, that is, if it would have occurred in fifty, or even ten, per cent. of such trials, the difference between our samples is said to be ”insignificant.” If its probability of occurrence is small, such as one in a thousand, or one in a hundred, or even one in twenty trials, it will usually be termed ”significant,” and be regarded as providing substantial evidence of an average difference in stature between the two populations sampled. In the first case the test can never lead us to assert that the two populations are identical, even in stature. We can only say that the evidence provided by the data is insufficient to justify the assertion that they are different. In the second case we may be more positive. We know that either our sampling has been exceptionally unfortunate, or that the populations really do differ in the sense indicated by the available data. The chance of our being deceived in the latter conclusion may be very small and, what is more important, may be calculable with accuracy, and without reliance on personal judgment. Consequently, while we require a more stringent test of significance for some conclusions than for others, no one doubts, in practice, that the probability of being led to an erroneous conclusion by the chances of sampling only, can, by repetition or enlargement of the sample, be made so small that the reality of the difference must be regarded as convincingly demonstrated.” (Emphasis added.)

The possibility that “a difference of the magnitude observed … occurred by chance” came to be called the “Null hypothesis.”

The hypothesis testing SOP centers around the Null hypothesis. To understand what the Null is, it may help to start by pointing out what it is not. Consider this mainstream definition of the scientific method:

a method of procedure that has characterized natural science since the 17th century, consisting in systematic observation, measurement, and experiment, and the formulation, testing, and modification of hypotheses.” - Oxford Languages

It would be easy to conclude from the definition that “testing … of hypotheses” is central to science. However, the “hypotheses” involved in the scientific method are not the “Null hypothesis” that is implicated in the statistical “hypothesis testing” SOP.

A famous example of a scientific hypothesis is Newton’s law of gravitation from 1666. This hypothesis was tested in various ways: predictions of the orbits of the planets and moons around planets, laboratory detection of minute attractions in experimental apparatus, and so on. In the mid-1800s, it was observed that the movement of Mercury was not entirely in accord with Newton’s law. This is an example of scientific testing of a hypothesis; Newton’s law failed the test. In response, various theoretical modifications were offered, such as the presence of a hidden planet called Vulcan. These were ultimately unsuccessful. However, in 1915, Einstein published a modification of Newton’s gravitation called “the theory of general relativity.” This correctly accounts for the motion of Mercury. Additional evidence (such as the bending of starlight around the sun observed during the 1919 total eclipse) led to the acceptance of general relativity. The theory has continued to be tested, for example looking for the actual existence of “black holes” predicted by general relativity.

The Null hypothesis is different. The same Null hypothesis is used in diverse fields: biology, chemistry, economics, geology, clinical trials of drugs, and so on, more than can be named. This is why the Null is taught in statistics courses rather than as a principle of science.

A common sort of question in widely ranging fields is whether one variable is related to another. To be concise, we will call the two variables Y and X. The statistical method often used to address this question is regression modeling: Y ~ X. The Null hypothesis is that Y and X are unrelated, that is, that the X coefficient is zero. Throughout these Lessons, you have the Null tested using confidence intervals: Does the confidence interval on the X coefficient contain zero? If not, your data provide evidence that X and Y are related.

Relationships, differences, and effects

We have been using the general term “relationship” to name the connection between Y and X. Other words are also used.

For example, when X is categorical, it effectively divides the data frame into groups of specimens. A relationship between Y and X can then be stated in everyday terms: “Are the groups different in terms of their Y values?”

When X is quantitative, a relationship between Y and X can be phrased, “Does X have an effect on Y?”

The Null hypothesis in statistics is the presumption that there are no group differences due to categorical X or, similarly, that there is no effect of X on Y. We can test the Null. To illustrate, consider Y to be the height of the children represented in the Galton data frame and X to be the sex of the child. The Null is that the sexes do not differ by height. Here’s the test:

Galton |> 
  model_train(height ~ sex) |>
  conf_interval()
term .lwr .coef .upr
(Intercept) 63.873518 64.110162 64.346806
sexM 4.789798 5.118656 5.447513

The confidence interval on sexM does not contain zero. The Galton data refute the presumption that the two groups do not differ in height. In the language of statistical hypothesis testing, one “rejects the Null hypothesis.” The hypothesis testing process is identical when X is quantitative. For instance, does the mother’s height have a non-zero effect on the child’s height?

Galton |> 
  model_train(height ~ mother) |>
  conf_interval()
term .lwr .coef .upr
(Intercept) 40.2951224 46.6907659 53.0864094
mother 0.2134437 0.3131795 0.4129153

The confidence interval on mother does not include zero, so one “rejects the Null hypothesis.”

If a confidence interval includes zero, then we can’t rule out (based on our data) that there might be no-difference/no-effect. The language used in hypothesis testing is: “We fail to reject the Null.”

It might have been better if the Null hypothesis were called the “null presumption.” That would properly put more distance between the statistical test of a presumption and the sort of genuine, contentful hypotheses used to define the “scientific method.”

The phrase, “We fail to reject the Null,” however, hits the nail right on the head. When you take a fair test in school, the failed test indicates that your understanding or knowledge is inadequate. A failed test says something about the student, not the truth or falsity of the contents of the test itself.

Similarly, a hypothesis test is not really about the Null hypothesis. Instead, it is a test of the researcher’s method for experiment, measurement, data collection (e.g. sample size), analysis of the data (e.g. consideration of covariates), and so on. The test determines whether these methods are fit for the purpose of scientific discovery. The passing grade is called “reject the Null.” The failing grade is “fail to reject the Null.”

It’s common sense that your research methods should be fit for the purpose of scientific discovery. If you can’t demonstrate this, then there is no point in continuing down the same road in your research. You might decide to change your methods, for instance increasing the sample size or guarding more carefully against contamination or other experimental pitfalls. Or, you might decide to follow another avenue of research.

Few readers or journal editors are interested in a report that your research methods are not fit for purpose. Consequently, the demonstration of methodological fitness—rejecting the Null—is often a requirement for publication of your work.

There are rare occasions when there is genuine interest in demonstrating no difference between groups or no effect of one variable on another. On these occasions, a hypothesis test is misplaced. Even if your research methods are sound, you would properly fail to reject the Null. Taken literally, the test results would (wrongly) show that your methods are unsound. Instead, it’s appropriate to demonstrate the fitness off your methods in a setting where there is an actual difference or effect to detect. (This issue will come up again when we look at Neyman-Pearson hypothesis testing.)

Formats for NHT results

Hypothesis tests were originally (and still are in some cases) called “significance tests.” They are also called “Null hypothesis tests” or even “Null hypothesis significance tests.” We will use the abbreviation NHT.

Only two qualitative statements are allowed for conveying the results of NHT: “reject the Null” or “fail to reject the Null.”

Confidence intervals provide a valid quantitative statement of the NHT result: an interval excluding zero corresponds to “reject the Null,” an interval incorporating zero indicates that the work “fails to reject the Null.” Of course, the primary role of confidence intervals is to indicate the precision of your measurement of the difference/effect-size. It’s a bonus that confidence intervals fit in with the NHT SOP.

However, for historical reasons, the use of confidence intervals to quantify NHT has become common only in the last few decades. This may be because NHT was introduced before confidence intervals were invented.

A widespread way to quantify the result of NHT is a number called a “p-value” that is between zero and one. A small p-value, near zero, signifies rejection of the Null hypothesis. Typically, “small” means less than 0.05, but other values are preferred in some fields. The numerical value of “small” is called the “significance level.” Often, instead of “reject the Null,” reports state that the results are “significant at the 0.05” level or at whatever significance level is used for the field of research. Even more consisely, in place of “reject the Null,” many researchers like to say that their results are “significant.” Such researchers also tend to replace “fail to reject the Null” with “non-significant.”

There have been persistent calls by statisticians to stop using the word “significant” in NHT because it can easily mislead. The ordinary, everyday meaning of “significance” tricks people into thinking that “statistically significant” results are also “important,” “useful,” or “notable” in practice. NHT is merely an SOP for documenting that research methods are fit for purpose, not a reckoning that the results have practical importance. Understandably, scientists are flattered by the misleading implications of “significance.” For journalists, quoting a scientist’s claim of “significance” is a magic wand to charm the unaware reader into concluding that a news item is worth reading. Consider, for instance, a clinical trial of a drug where the confidence interval points to a reduction in high blood pressure by 0.5 to 1.5 mmHg. This reduction is so trivial that the drug has no medical use. However, since the confidence interval excludes zero, the reduction can be reported as “significant” in the technical sense of NHT. A genuine demonstration of practical significance requires a large effect size, not merely a narrow confidence interval.

This confusing situation could be avoided entirely by switching from “significant” to a word that conveys the correct meaning. For example, statistician Jeffrey Witmer has proposed the word “discernible” be used in place, as in, “The difference between groups is statistically discernible,” or, “We found a discernible difference.”

Calculating “significance”

Let’s return to Ronald Fisher’s account of “significance testing” given in Section 29.1. In the paragraph quoted there, he wrote:

The chance of our being deceived [by sampling variation] may be calculable with accuracy, and without reliance on personal judgment.

How is this calculation to be performed? Fisher gives this description, which follows the paragraph quoted in Section 29.1.

The simplest way of understanding quite rigorously, yet without mathematics, what the calculations of the test of significance amount to, is to consider what would happen if our two hundred actual measurements were written on cards, shuffled without regard to nationality, and divided at random into two new groups of a hundred each. This division could be done in an enormous number of ways, but though the number is enormous it is a finite and a calculable number. We may suppose that for each of these ways the difference between the two average statures is calculated. Sometimes it will be less than an inch, sometimes greater. If it is very seldom greater than an inch, in only one hundredth, for example, of the ways in which the sub-division can possibly be made, the statistician will have been right in saying that the samples differed significantly.

Fisher wrote before the availability of general-purpose computers. Consequently, for his technical work he relied on algebraic formulas. Standard statistical textbooks will offer half-a-dozen formulas, which misleadingly suggests that the p-value is technically difficult and highly precise. However, the underlying logic is straightforward and the assumed precision of formula-based methods is misleading. Or, as Fisher continued,

Actually, the statistician does not carry out this very simple and very tedious process, but his conclusions have no justification beyond the fact that they agree with those which could have been arrived at by this elementary method.

With software, the “tedious process” can easily be carried out. First, we’ll imagine Fisher’s two hundred actual measurements in the form of a modern data frame, which, lacking Fisher’s actual playing cards, we’ll simulate:

height nationality
68.0 French
68.5 English
71.5 French
68.0 French
68.0 English
69.0 English
      ... for 200 men altogether

The calculation of the difference in average heights between the nationalities is computed in the way we have used so often in these Lessons:

Height_data |> 
  model_train(height ~ nationality) |>
  conf_interval() |>
  select(term, .coef)
term .coef
(Intercept) 69.2083333
nationalityFrench -0.7844203

In our sample, the Frenchmen are shorter than the Englishmen by -0.8 inches on average.

Let’s continue with the process described by Fisher, and “shuffle without regard to nationality, and divide at random into two new groups of a hundred each.”

Height_data |> 
1  model_train(height ~ shuffle(nationality)) |>
2  conf_interval() |>
  select(term, .coef)
1
This is “shuffling without regard to nationality” and “dividing at random”, all in one step!
2
And calculate the mean difference in heights.
term .coef
(Intercept) 69.000
shuffle(nationality)French -0.072

You can see that the shuffling has created a much smaller coefficient that we got on the actual data. Also, note that the confidence interval now includes zero, as expected when the “nationality” is randomized.

Fisher instructed us to do this randomization “in an enormous number of ways.” In our language, this means to do a large number of trials in each of which random shuffling is performed and the coefficient calculated. Like this:

Trials <-
  Height_data |> 
  model_train(height ~ shuffle(nationality)) |>
  conf_interval() |>
  trials(500) |>
  filter(term == "shuffle(nationality)French")

What remains is to compare the results from the shuffling trials to the coefficient nationalityFrench that we got with the non-randomized data: -0.8 inches.

Trials |>
  filter(abs(.coef) >= abs(-0.8)) |>
  nrow()
[1] 10

In only 7 of 500 trials, did the shuffled data produce a coefficient as large in magnitude than observed in the non-randomized data. Again, to quote Fisher, “If it is very seldom greater than an inch [0.8 inches in our data], in only one hundredth of the [trials], the statistician will have been right in saying that the samples differed significantly.” More conventionally, nowadays, and at Fisher’s recommendation, the threshold of one-in-twenty (or, equivalently, 25 in 500 trials) is reckoned adequate to declare “significance.”

Of course, remember that “[significance] is a technical term.” There is nothing in the calculation to suggest that the “significant” result is important for any practical purpose. For instance, knowing that Frenchmen are on average 0.8 inch shorter than Englishmen would not enable us to predict from a man’s height whether he is French or English.

The p-value

In the previous section, we calculated that in 7 of 500 trials the shuffled coefficient is at least as big in magnitude as the 0.8 difference seen in the actual data. This fraction—7 of 500—is now called the p-value. For regression modeling, there are formulas to find the p-values for coefficients without conducting many random trials. conf_interval() will show these p-values, if you request it.

Height_data |> 
  model_train(height ~ nationality) |>
  conf_interval(show_p = TRUE)
term .lwr .coef .upr p.value
(Intercept) 68.775492 69.2083333 69.6411751 0.0000000
nationalityFrench -1.422611 -0.7844203 -0.1462299 0.0162548

The p-value on the intercept is effectively zero. As it should be. No amount of shuffling the height data will produce an average English height of 0 inches! The p-value on nationalityFrench is p=0.016. That’s a little bigger than 5 out of 700. But simulation results are always random to some extent.

The p-value calculated from the formula seemingly has no such random component, yet we know that every summary statistic, even a p-value, has sampling variation. To paraphrase Fisher, “[p-values have] no justification beyond the fact that they agree with those [produced by simulation].”

A p-value can be calculated for any sample statistic by the shuffling method. There are also formulas for them when dealing with R2 :

Height_data |> 
  model_train(height ~ nationality) |>
  R2()
n k Rsquared F adjR2 p df.num df.denom
200 1 0.0288174 5.875146 0.0239124 0.0162457 1 198

It is not an accident that the p-value on R2 reported from the model is identical to the p-value calculated on the only non-intercept coefficient term in a model.

Hypothesis test with confidence interval

Confidence intervals had not come into widespread use when Fisher wrote the material quoted above. But confidence intervals provide a shortcut to hypothesis testing, at least when it comes to model coefficients. Simply check whether the confidence interval includes zero. If so, the conclusion is “failure to reject the Null hypothesis.” But if zero is outside the range of the confidence interval, “reject the Null.”

The confidence interval hypothesis test does not always agree exactly with the test as done using a p-value. But the precision of formula-based p-values is illusory. Many statisticians recommend using confidence intervals instead of p-values, particularly because they provide information about the effect size. That’s been our practice throughout these Lessons.

As it happened, Fisher denigrated the idea of confidence intervals. In this, he is utterly out of step with mainstream statistics today.

Power and the alternative hypothesis

NHT was introduced early in the 1920s. By the end of the decade an extension was proposed that incorporated into the reasoning a second, scientific hypothesis called the “Alternative hypothesis.” We will call the extended version of hypothesis testing NP, after the two statisticians Jerzy Neyman (1894-1981) and Egon Pearson (1895-1980).

The point of NP was two-fold:

  1. To provide some guidance in interpreting a “fail to reject the Null” result.
  2. To guide scientists in designing studies, for example, deciding on an appropriate sample size.

Recall, in the context of Y ~ X, that the Null hypothesis is the presumption that Y and X are not connected to one another. In modeling terms, the Null is that the coefficient on X is zero.

The alternative hypothesis can also be framed in terms of the coefficient on X. In its simplest form, the alternative is a specific, non-zero, numerical value for the coefficient on X. One purpose of the alternative is to provide an idea about what motivates the research. For instance, a study of a drug that reduces blood pressure might have an alternative that the drug reduces pressure, on average, by 10 mmHg. In many cases, the alternative is set to be an effect size or difference between groups of the smallest that would be of interest in the application of the research. (You’ll see the logic behind “smallest” in a bit.)

Another purpose for the alternative hypothesis is to deal with situations where the Null or something like it might actually be true. In such situations, the result of NHT will be to “fail to reject the Null.” It would be nice to know, however, whether the failure should be ascribed to inadequate methods or to the Null being true.

Stating an alternative hypotheses draws, ideally, on expertise in the subject matter and the hoped-for implications of the research if it is successful.

The alternative framed before data are collected. It is part of the SOP of study design. For our purposes here, it suffices to think of the alternative being implemented as a simulation of the sort discussed in Lesson 14 built to implement the smallest effect of interest and incorporating what is know of subject to subject variation.

Setting an appropriate sample size is an important part of the study design phase. For the sake of economy, the sample size should be small. But to have better precision—i.e., tighter confidence intervals—a larger sample size is better. One way to resolve this trade-off is in the spirit of NHT: aim for a precision that is just tight enough to make it likely that the Null will be rejected.

Likelihood, as we saw in Lesson 16, is a probability calculated given a stated hypothesis. The relative hypothesis here is the alternative hypothesis. To find the likelihood of rejecting the Null for a proposed sample size, run many trials of the simulation and carry out the NHT calculations for each. Then count the proportion of trials in which the Null is rejected. This fraction of successfully rejected trials. This fraction, a number between zero and one, is called the “power.”

Example: Get out the vote!

Consider the situation of the political scientists who designed the study in which the Go_vote data frame was assembled.

To carry out the study, they needed to decide how many postcards to send out. To inform this decision they looked at existing data from the 2004 primary election to determine the voting rate:

Go_vote |> count(primary2004) |>
  mutate(proportion = n / nrow(Go_vote))
primary2004 n proportion
abstained 183098 0.5986216
voted 122768 0.4013784

A turn-out rate of 40%.

Next, the researchers would speculate about the effect of the postcards might be. Such speculation can be informed by previous work in the field. This is one reason that research reports often contain a “literature survey.” Here’s an excerpt from the literature survey in the journal article reporting the experiment and its results:

Prior experimental investigation of publicizing vote history to affect turnout is extremely limited. Our work builds on two pilot studies, which appear to be the only prior studies to examine the effect of providing subjects information on their own vote history and that of their neighbors (Gerber et al. 2006). These two recent experiments, which together had treatment groups approximately [2000 voters], found borderline statistically significant evidence that social pressure increases turnout.

Such experiments indicated an increase in turnout of approximately 1-2 percentage points. For demonstration purposes, let’s set the alternative hypothesis to be an increase by 1 percentage point from the baseline voting level of 40% observed in the 2004 primary. This gives us the essential information to build the simulation implementing the alternative hypothesis.

Alternative_sim <- datasim_make(
1  postcard <- bernoulli(n, prob=0.33, labels = c("control", "card")),
2  vote <- bernoulli(n, prob = ifelse(postcard == "card", 0.41, 0.40),
                    labels = c("abstained", "voted"))
) 
1
Send a postcard to one-third of households in the experiment.
2
For postcard recipients, simulated voting rate will be 0.41. For the control group, 0.40 is the rate.

A single trial of the simulation and the follow-up analysis of the data looks like this. We will start with an overall sample size of n=1000.

set.seed(102)
Alternative_sim |> sample(n=1000) |>
  model_train(zero_one(vote, one="voted") ~ postcard) |>
  conf_interval()
term .lwr .coef .upr
(Intercept) -0.4965456 -0.2729759 -0.0516453
postcardcontrol -0.4777669 -0.2075090 0.0636514

We are interested only in the coefficient on postcard, specifically whether the confidence interval excludes zero, corresponding to rejecting the Null hypothesis. Let’s run 500 trials of the simulation:

Sim_results <- Alternative_sim |> sample(n = 1000) |>
  model_train(zero_one(vote, one = "voted") ~ postcard) |>
  conf_interval() |>
  trials(500) |>
  filter(term == "postcardcontrol")
.trial term .lwr .coef .upr
1 postcardcontrol -0.430 -0.160 0.1000
2 postcardcontrol -0.250 0.013 0.2800
3 postcardcontrol -0.084 0.190 0.4600
4 postcardcontrol -0.190 0.077 0.3500
5 postcardcontrol -0.540 -0.280 -0.0098
      ... for 500 trials altogether.

We can count the number of trials in which the confidence interval on postcardcontrol excludes zero. Multiplying .lwr by .upr will give a positive number if both are on the same side of zero. The power for the simulated sample size is the faction of trials that exclude zero.

Sim_results |> 
  mutate(excludes = (.lwr * .upr) > 0) |>
  summarize(power = mean(excludes))
power
0.072

This is a power of about 7%.

Ideally, the power of a study should be close to one. This can be accomplished, for any alternative hypothesis, by making the sample size very large. However, large sample sizes are expensive or impractical, so researchers have to settle for power less than one. A power of 80% is considered adequate in many settings. Why 80%? SOP.

In the voting simulation with n=1000, the power is about 7%. That’s very small compared to the target power of about 80%. In ?exr-Q29-201 you can explore how big a sample size is needed to reach 80%.

Go_vote looked at whether postcards sent to registered voters led to an increase in the rate of voting.A simulation isn’t the only way to calculate the power. In this simple setting it can also be done using algebra.

False discovery

Negative results request

Negative results request

Hypothesis testing interpreted by a Bayesian

Null hypothesis testing (NHT) and Neyman-Pearson (NP) have similarities.

  1. Both are centered on the Null hypothesis, and for both that hypothesis amounts to a claim that the coefficient on X is zero in the model Y ~ X.

  2. Both produce a result that has two possible values: “reject the Null” or “fail to reject the Null.” Often, this result is stated as a p-value.

  3. In both, the test result refers only to the Null hypothesis. Once the alternative and sample size has been selected, NP works the same as Bayes. At this point, only the Null is involved in the calculations.

  4. NP involves an alternative hypothesis which is used only to assess the “power” of the null hypothesis test. The concept of power doesn’t apply in NHT since there is no alternative hypothesis in NHT. In NP, the calculation of power does not refer to the data actually collected. Power is calculated in the setup to the study, prior to the collection of data. Power is typically used to guide the selection of sample size.

There are similarities and difference between both NHT and NP compared to Bayesian reasoning.

  1. All three forms involve a summary of the data. This summary might be a model coefficient, or an R2, or sometimes something analogous to these two.

  2. Bayesian analysis always involves (at least) two hypotheses. It is perfectly reasonable to use the Null as one of the hypotheses and the alternative as the other. For comparison to NHT and NP, we will use those two hypotheses.

  3. The output of the Bayesian analysis is the posterior odds of the Alternative hypothesis. The posterior odds of the Null come for free, since odds of the Null is simply the reciprocal of the odds of the Alternative. The odds refer to both of the hypotheses.

Consider the Bayes formula for computing the posterior probability in odds form: \[posterior\ odds\ \text{for Alternative} = \text{Likelihood ratio}_{A/N}\text{(summary)} \times \ prior\ odds\ \text{for Alternative}\]

Neither NHT or NP makes any reference to a prior odds. NHT doesn’t even involve an Alternative hypothesis. NP does involve an Alternative, but this contributes not at all to the outcome of the test.

Any statement of prior odds necessarily refers to the beliefs of the researchers. NHT and NP are often regarded as more objectives, since the beliefs don’t enter in to the calculation. Or, at least, the beliefs don’t enter explicitly. Presumably the reason the researchers took on the study in the first place is some level of subjective belief that the Alternative is a better description of the real-world situation than the Null.

Even though the prior odds are subjective, the likelihood ratio is not. The likelihood ratio multiplies the prior odds to produce the posterior odds. In the realm of medical diagnosis, a likelihood ratio of 10 or greater is considered “strong evidence” in favor of the Alternative. The bigger the likelihood ratio, the stronger the claim of the Alternative hypothesis.

Deeks, J. J., & Altman, D. G. (2004). “Statistics Notes: Diagnostic tests 4: Likelihood ratios.” British Medical Journal 329(7458) 168-169.link

Since the likelihood ratio encodes what the data has to say, irrespective of prior beliefs, it’s tempting to look at NHP and NP with an eye to a possible analog of the likelihood ratio. For reference, let’s write the likelihood ratio in terms of the individual likelihoods:

\[\text{Likelihood ratio}_{A/N}\text{(summary)} = \frac{{\cal L}_{Alt}(summary)}{{\cal L}_{Null}(summary)}\]

It turns out that the p-value is in the form of a likelihood. The assumed hypothesis is the Null.

\[\text{p-value} = {\cal L}_{Null}(??)\] The ?? has been put in \({\cal L}_{Null}(??)\) to indicate that the quantity does not exactly refer to the actual data. Instead, for NHT and NP, the ?? should be replaced by “the summary or more extreme.” For simplicity, let’s refer to this as \(\geq summary\), with the p-value being

\[\text{p-value} = {\cal L}_{Null}(\geq summary)\] For NP, we can also refer to another likelihood: \({\cal L}_{Alt}(\geq summary)\). The NHT/NP analog to the likelihood ratio is

\[\frac{{\cal L}_{Alt}(\geq summary)}{{\cal L}_{Null}(\geq summary)} = \frac{{\cal L}_{Alt}(\geq summary)}{\text{p-value}} \approx \frac{0.5}{\text{p-value}}\ .\]

In NP, it would be straightforward to calculate \({\cal L}_{Alt}(\geq summary)\). The calculation would be just like how power is calculated: many trials of generating data from the simulation and summarizing it. For the power, NP summarizes the trial by checking whether the Null is rejected. To find \({\cal L}_{Alt}(\geq summary)\) summarize the trial by comparing it to the value stated for the Alternative. Typically this will be a number near 0.5. (In the hypothetical world where the Alternative is true, a model coefficient is about equally likely to be greater or less than the value set for the Alternative.)

To draw an inference in favor of the Alternative hypothesis, we want the likelihood ratio to be large, say 10 or higher. This can happen if the p-value is small. In both NHT and NP, a standard threshold is \(p < 0.05\). Plugging \(p=0.05\) into the NHT/NP analog to the likelihood ratio gives \(0.5/0.05 = 10\).

Thus, the p-value in NHT and NP is closely related in form to a likelihood ratio, with the standard cutoff of \(p < 0.05\) corresponding to a likelihood ratio of 10.

An NHT is always straightforward to carry out, since the form of the Null doesn’t involve any knowledge of the area of application. NP forces the researcher to state an Alternative hypothesis and have a way to simulate it (either with random number generators or, in some cases, with algebra). The Alternative is a stake in the ground, a statement of the effect size that would be interesting to the researchers. NHT lacks this statement. But in either NHT or NP, the p-value translates to an approximate odds ratio.

In engineering-like disciplines, it’s often possible to make a reasonable statement about the prior odds. In basic research, the situation is sketchier. All three forms of hypothetical reasoning—Bayes, NP, and NHT—produce something very much like a likelihood ratio, with a ratio of 10 corresponding to a p-value of 0.05.

The textbook version of the alternative hypothesis

Almost all introductory textbook cover hypothesis testing. Formally, they cover the NP style, in that they bring an Alternative hypothesis into the discussion. But there is a serious shortcoming. To state a meaningful Alternative requires knowledge of the field of study, but textbooks prefer to make mathematical discussions, not field-specific ones. Perhaps this reflects the background of many introductory statistics instructors: mathematics.

As a substitute for a genuine, field-specific Alternative, it’s conventional to offer an “anything but the Null” Alternative. For instance, if the Null is that the relevant model coefficient is zero, the Alternative is stated as “the coefficient is non-zero.” This is regretable in two ways. First, it misleads students into thinking that an Alternative hypothesis in science is something mathematical rather than field-specific. Second, it’s not possible to calculate a power when the Alternative is “anything but the Null.”

Also regrettable is the attempt made by such textbooks to create a role for the Alternative other than the calculation of power. After all, why would one mention an Alternative if it has nothing to do with the calculations? So textbooks have created an alternative to the anything-but-the-Null Alternative. This is that the Alternative is “anything greater than the Null.” In other words, the made up, pseudo-useful Alternative amounts to “a model coefficient greater than zero.” This sort of Alternative translates easily into the corresponding p-value calculation. Take the p-value from the “anything but the Null” situation and divide it by two.

Giving researchers a license to divide at whim their p-values by two distorts the meaning of a p-value. Better to report a real p-value: no division by two.