We are spending this week on a topic that constitutes about one-quarter of the consensus Stat 101 course (e.g. “AP Statistics”). We are calling the topic “Null hypothesis testing” (NHT), but other names are also used:
“Significance testing”—the name used by the originator of the method, Ronald Fisher when he introduced it in 1926.
“Null hypothesis significance testing” (HST—what to say if you can’t decide between calling it NHT or “significance testing.”
“Hypothesis testing,” the name almost always used in statistics textbooks, but which is misleading in that it suggests something broader than NHT actually is.
Our agenda today is to describe the terminology of NHT and give an example of an NHT calculation (which is easy to do with any statistical software at all).
NHT involves a quantity called the “p-value” which is a number between zero and one.
We have talked previously about tests that give a result of either \(\mathbb P\) or \(\mathbb N\). In NHT, the test results are stated differently: either “reject the Null” or “fail to reject the Null.”
Once you have the numerical p-value, translation into test results is trivial: if \(p < 0.05\) the conclusion is “reject the Null.” Otherwise, that is if p is bigger than 0.05, the conclusion is “fail to reject the Null.” This is admittedly stilted language, and we owe you and explanation for why things are this way. That will come later.
When using linear regression (our main method in 300Z), the software for summarizing models always provides a p-value: you just have to ask for it. To illustrate, consider the model height ~ nkids with respect to Galton’s height data. (Galton (1822-1911) and Fisher (1890-1962) were near contemporaries.)
Think of height ~ nkids as asking a question: Is the adult height of a child correlated with the number of siblings? (Why might someone offer the hypothesis that the number of siblings has a connection to adult height? Perhaps the growing children had to compete for food. Or perhaps contagious disease is more prevalent in large families, and childhood disease might be correlated with height. But in NHT, there’s no requirement to explain why one is interested to test a hypothesis.) Here’s the calculation, done in four different ways of summarizing a model.
model <-lm(height ~ nkids, data=Galton)model |>conf_interval(show_p=TRUE)
term
.lwr
.coef
.upr
p.value
(Intercept)
67.2185704
67.7997464
68.380922
0.0000000
nkids
-0.2561222
-0.1693416
-0.082561
0.0001372
model |>R2()
n
k
Rsquared
F
adjR2
p
df.num
df.denom
898
1
0.0161062
14.66737
0.0150081
0.0001372
1
896
model |>regression_summary()
term
estimate
std.error
statistic
p.value
(Intercept)
67.7997464
0.2961233
228.9578
0.0000000
nkids
-0.1693416
0.0442168
-3.8298
0.0001372
model |>anova_summary()
term
df
sumsq
meansq
statistic
p.value
nkids
1
185.4636
185.46365
14.66737
0.0001372
Residuals
896
11329.5987
12.64464
NA
NA
In the regression summary report and the confidence interval report, a p-value is listed for each coefficient. We are interested in the nkids coefficient.
In the R-squared and ANOVA report, there is no p-value on the intercept term; only nkids is at issue.
Note that the p-value is the same for all four reports.
An NHT consists of calculating a p-value, comparing it to 0.05, and drawing the corresponding conclusion. Since \(p = 0.000137 < 0.05\), the proper conclusion is to “reject the Null.”
“The Null” is short for “the Null hypothesis.” The dictionary definitions for “null” relevant here include “having or associated with the value zero” or “amounting to nothing.” In the context of linear regression the Null for a given coefficient always means that, with a sufficiently large (“infinite”) sample, the coefficient would be zero.
The p-value calculation is done in a mathematical world where the Null is true. Other expressions often used: “assuming the Null,” “given the Null hypothesis,” “under the Null.”
Naturally, our samples are finite in size. Consequently, because of sampling variation, we cannot expect the coefficient to be exactly zero even under the Null. Instead, we expect the coefficient to be small.
We have already discussed one operational definition of “small,” that the confidence interval includes zero. If so, then “fail to reject the Null.” Otherwise (as with the nkids example above), then “reject the Null.”
The p-value is just another way of encoding the notion of “small.” Indeed, in linear regression we can calculate the p-value from the same information used to construct the confidence interval.
Many types of statistical tests?
A Stat 101 course will cover many hypothesis tests among which are the one-sample t-test, the two-sample t-test, the one and two sample p-tests, and ANOVA. All these different tests are in reality just linear regression. See this blog post.
There is one hypothesis test that is not exactly equivalent to regression: the chi-squared test. However, in the context where chi-squared often appears, the result corresponds to the z ~ g model specification. [Blog post not yet available.]