`lm(height ~ mother, data=Galton) |> conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 40.00 | 47.00 | 53.00 |

mother | 0.21 | 0.31 | 0.41 |

Lesson 36 introduced the logic of Null Hypothesis testing (NHT). This Lesson covers how to perform an NHT. As you will see, the everyday workhorses of NHT are regression modeling and summarization. We will use the `conf_interval()`

, `R2()`

, `regression_summary()`

, and `anova_summary()`

functions, the last three of which generate a “p-value,” which is the numerical measure of “standing out from the Planet Null crowd” illustrated in Figure 36.4.

We have been using confidence intervals from early on in these Lessons. As you know, the confidence interval presents the **precision** with which a model coefficient or effect size can be responsibly claimed to be known.

Using a confidence interval for NHT is dead simple. Check whether the confidence interval includes zero. If so, there’s not enough precision in your measurement of the summary statistic to justify a claim that the Null hypothesis might account for your observation. On the other hand, if the confidence interval doesn’t include zero, you are justified in “rejecting the Null.”

To illustrate the use of confidence intervals for NHT, consider again the `Galton`

data on the heights of adult children and their parents. Let’s consider whether the adult child’s `height`

is related to his or her `mother`

’s height.

`lm(height ~ mother, data=Galton) |> conf_interval()`

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 40.00 | 47.00 | 53.00 |

mother | 0.21 | 0.31 | 0.41 |

The confidence interval on `mother`

does not include zero so the Null hypothesis is rejected.

Whenever your interest is in a particular model coefficient, I strongly encourage you to look at the confidence interval. This carries out a hypothesis test and, at the same time, tells you about an effect size, important information for decision-makers. When you do this—good for you!—you are likely to encounter a more senior researcher or a journal editor who insists that you report a p-value. The p-value corresponding to a confidence interval not including zero is \(p < 0.05\). The senior researcher of editor might insist on knowing the “exact” value of \(p\). Such knowledge is not as important as most people make it out to be, but if you need such an “exact” value, you can turn to another style of summary of regression models called a “**regression table**.”

Here’s a regression table on the `height ~ mother`

model:

`lm(height ~ mother, data=Galton) |> regression_summary()`

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 47.00 | 3.300 | 14.0 | 0 |

mother | 0.31 | 0.051 | 6.2 | 0 |

The regression table has one row for each coefficient in the model. `height ~ mother`

, for instance, has an intercept coefficient as well as a `mother`

coefficient. The p-value for each coefficient is given in the right-most column of the regression table. Other columns in the regression table repeat information from the confidence interval. The “estimate” is the coefficient found when training the model on the data. the “std.error” is the **standard error**, which is half the margin of error. (See Lesson 22.) You may remember that the confidence interval amounts to

\[\text{estimate} \pm 2\! \times\! \text{standard error}\]

The “statistic” shown in the regression table is an intermediate step in the calculation of the p-value. The full name for it is “**t-statistic**” and it is merely the estimate divided by the standard error. (For example, the estimate for the `mother`

coefficient is 0.31 while the standard error is 0.051. Divide one by the other and you get the t-statistic, about 6.2 in this case.)

A fanatic for decimal places might report the p-value on `mother`

as \(p=0.0000000011\). That is indeed smaller, consistent with what we already established—\(p < 0.05\)—from a simple examination of the confidence interval. For good reason it is considered bad style to write out a p-value with so many zeros after the decimal point. Except in very specialized circumstances (e.g. gene expression measurements with hundreds of thousands of plasmids) there is no reason to do so: \(p < 0.001\) is enough to satisfy anyone that it’s appropriate to reject the Null hypothesis.

There are three good reasons to avoid writing anything more detailed than \(p < 0.001\) even when the computer tells you that the p-value is zero. First, and simplest, computer arithmetic involves round-off error, so the displayed zero is not really mathematically zero. Second, there are many assumptions made in the calculation of p-values, the correctness of which is impossible to demonstrate from data. Third, the p-value is itself a sample statistic and subject to sampling variability. A complete presentation of the p-value ought therefore to include a confidence interval. An example of such a thing is given in Figure 38.1, but they are never used in practice.

We’ve seen that `conf_interval()`

and `regression_summary()`

can be used to conduct NHT. Now let’s go into the reason why there are more model summaries used for NHT: `R2()`

and `anova_summary`

.

There is an important setting where confidence intervals cannot be used to perform NHT. That is when it is not a single coefficient that is of interest, but a group of coefficients. As an example, consider the data in `College_grades`

, which contains the authentic transcript information for all students who were graduated in 2006 from a small liberal-arts college in the US.

The data have been anonymized, so that it is not possible to identify individual students, professors, or departments. The data are used with permission of the college’s registrar.

The numerical response variable, `gradepoint`

, corresponds to the letter `grade`

of each student in each course taken. This is the sort of data used to calculate grade-point averages (GPA) a statistic familiar to college students! The premise behind GPA is that students are different in a way that can be measured by their GPA. If that’s so, then the student (encoded in the `sid`

variable) should account for the variation in `gradepoint`

.

We can use `gradepoint ~ sid`

to look at the matter. In the following code block, we fit the model and look at the confidence interval on each student’s GPA. We’ll show just the first 6 out of the 443 graduating students.

```
<- lm(gradepoint ~ sid - 1, data=College_grades) |> conf_interval()
GPAs head(GPAs)
```

term | .lwr | .coef | .upr |
---|---|---|---|

sidS31185 | 2.1 | 2.4 | 2.8 |

sidS31188 | 2.8 | 3.0 | 3.3 |

sidS31191 | 2.9 | 3.2 | 3.5 |

sidS31194 | 3.1 | 3.4 | 3.6 |

sidS31197 | 3.1 | 3.3 | 3.6 |

sidS31200 | 1.9 | 2.2 | 2.5 |

Interpreting a report like this is problematic. For instance, consider the first and last students (sidS31185 and sidS31200) in the report. Their gradepoints are different (2.4 versus 2.2) but only if you ignore the substantial overlap in the confidence intervals. (Colleges don’t ever report a confidence interval on a GPA. Perhaps this is because they don’t believe there is any component to a grade that depends on anything but the student.)

A graphical display, as in Figure 37.2 helps put the 443 confidence intervals in context. The graph shows that only roughly 100 of the weaker students and 100 of the stronger students have a GPA that is discernible from the college-wide average. In other words, the middle 60% of the students are indistinguishable from one another based on their GPA.

```
<- GPAs |> arrange(.upr) |> mutate(row = row_number())
orderedGPAs |> ggplot() +
orderedGPAs geom_segment(aes(x=row, xend=row, y=.lwr, yend=.upr), alpha=0.3) +
geom_errorbar(aes(x=row, ymin=.lwr, ymax=.upr), data= orderedGPAs |> filter(row%%20==0)) +
geom_hline(yintercept=3.41, color="blue")
```

Another way to answer the question of whether `sid`

accounts for `gradepoint`

is to look at the R^{2}:

`lm(gradepoint ~ sid, data=College_grades) |> R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

5700 | 440 | 0.32 | 5.6 | 0.27 | 0 | 440 | 5200 |

The R^{2} is only about 0.3, so the student-to-student differences account for less than half of the variation in the course grades. This is perhaps not as much as might be expected, but there are so many grades in `College_grades`

that the p-value is small, leading us to reject the Null hypothesis that `gradepoint`

is independent of `sid`

.

The process by which R^{2} is converted to a p-value involves a clever summary statistic called F. For our purposes, it suffices to say that R^{2} can be translated to a p-value, taking into account the sample size \(n\) and the number of coefficients (\(k\)) in the model.

Carrying out such tests on a whole model—with many coefficients—is so simple that we can look at another possible explanation for `gradepoint`

: that it depends on the *professor*. The `iid`

variable encodes the professor. There are 364 different professors who assigned grades to the Class of 2006.

`lm(gradepoint ~ iid, data=College_grades) |> R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

5700 | 360 | 0.17 | 3.1 | 0.12 | 0 | 360 | 5300 |

We can reject the Null hypothesis that the professor is unrelated to the variation in grades.

But let’s be careful, maybe the professor only appears to account for the variation in grade. It might be that students with low grades cluster around certain professors, and students with high grades cluster around other professors. We can explore this possibility by looking at a model that includes both students and professors as explanatory variables.

`lm(gradepoint ~ sid + iid, data=College_grades) |> R2()`

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

5700 | 800 | 0.48 | 5.6 | 0.39 | 0 | 800 | 4900 |

Alas for the idea that grades are only about the student, the R^{2} report shows that, even taking into account the variability in grades associated with the students, there is still substantial variation associated with the professors.

This process of handling *groups of coefficients* (442 for students and 358 for professors) using R^{2} can be so useful that a special format of report, called the ANOVA report, is available. Here it is for the college grades:

`lm(gradepoint ~ sid + iid, data=College_grades) |> anova_summary()`

term | df | sumsq | meansq | statistic | p.value |
---|---|---|---|---|---|

sid | 440 | 650 | 1.50 | 6.8 | 0 |

iid | 360 | 310 | 0.86 | 4.0 | 0 |

Residuals | 4900 | 1100 | 0.22 | NA | NA |

This style of report provides a quick way to look at whether an explanatory variable is contributing to a model. The ANOVA report looks at each variable, each in turn, to see if *given the variables already in the model* whether the next variable along explains enough of the residual variance in the model to be regarded as something other than an accident due to sampling variation. In the report above, `sid`

is associated with a very small p-value. Then, granting `sid`

credit for all the variation it can explain, `iid`

still shows up as a strong enough “eater of variance” that the Null hypothesis can is rejected for it as well.

Statistics textbooks usually include several different settings for “hypothesis tests.” I’ve just pulled a best-selling book off my shelf and find listed the following tests spread across eight chapters occupying about 250 pages.

- hypothesis test on a single proportion
- hypothesis test on the mean of a variable
- hypothesis test on the difference in mean between two groups (with 3 test varieties in this category)
- hypothesis test on the paired difference (meaning, for example, measurements made both before and after)
- hypothesis test on counts of a single categorical variable
- hypothesis test on independence between two categorical variables
- hypothesis test on the slope of a regression line
- hypothesis test on differences among several groups
- use Births2015 and check whether month or day of the week explain the daily variation.

- hypothesis test on R
^{2}

As statistics developed, early in the 20th century, distinct tests were developed for different kinds of situations. Each such test was given its own name, for example, a “t-test” or a “chi-squared test.” Honoring this history, statistics textbooks present hypothesis testing as if each test were a new and novel kind of animal.

In these Lessons, we’ve focussed on that one method, rather than introducing all sorts of different formulas and calculations which, in the end, are just special cases of regression. Nonetheless, most people who are taught statistics were never told that the different methods fit into a single unified framework. Consequently, they use different names for the different methods.

These traditional names are relevant to you because you will need to communicate in a world where people learned the traditional names, you have to be able to recognize those names know which regression model they refer to. In the table below, we will use different letters to refer to different kinds of explanatory and response variables.

`x`

and`y`

: quantitative variables`group`

: a categorical variable with multiple (\(\geq 3\)) levels.`yesno`

: a categorical variable with exactly two levels (which can always be encoded as a zero-one quantitative variable)

Model specification | traditional name |
---|---|

`y ~ 1` |
t-test on a single mean |

`yesno ~ 1` |
p-test on a single proportion. |

`y ~ yesno` |
t-test on the difference between two means |

`yesno1 ~ yesno2` |
p-test on the difference between two proportions |

`y ~ x` |
t-test on a slope |

`y ~ group` |
ANOVA test on the difference among the means of multiple groups |

`y ~ group1 * group2` |
Two-way ANOVA |

`y ~ x * yesno` |
t-test on the difference between two slopes. (Note the `*` , indicating interaction) |

`y ~ group + x` |
ANCOVA |

Another named test, the **z-test**, is a special kind of t-test where you know the variance of a variable without having to calculate it from data. This situation hardly every arises in practice, and mostly it is used as a soft introduction to the t-test.

Still another test, named ANCOVA, is considered too advanced for inclusion in traditional textbooks. It amounts to looking at whether the variable `group`

helps to account for the variation in `y`

in the model `y ~ group + x`

.

Use cancer/grass-treatment example from Lesson 30 to illustrate how failing to think about covariates *before* the study analysis can lead to false discovery.

Use age in marriage data.

So, standard operating procedures were based on the tools at hand. We will return to the mismatch between hypothesis testing and the contemporary world in Lesson 38.

\[ \begin{array}{cc|cc} & & \textbf{Test Conclusion} &\\ & & \text{do not reject } H_0 & \text{reject } H_0 \text{ in favor of }H_A \\ \textbf{Truth} & \hline H_0 \text{ true} & \text{Correct Decision} & \text{Type 1 Error} \\ & H_A \text{true} & \text{Type 2 Error} & \text{Correct Decision} \\ \end{array} \]

A

Type 1 error, also called afalse positive, is rejecting the null hypothesis when \(H_0\) is actually true. Since we rejected the null hypothesis in the gender discrimination (from the Case Study) and the commercial length studies, it is possible that we made a Type 1 error in one or both of those studies. AType 2 error, also called afalse negative, is failing to reject the null hypothesis when the alternative is actually true. A Type 2 error was not possible in the gender discrimination or commercial length studies because we rejected the null hypothesis.

MAKE THIS A MORE NEUTRAL DESCRIPTION SHOWING two settings where Chi-squared is appropriate: Is a die fair? Does the observed frequency of phenotypes follow the theoretical frequency of genotypes, as in Punnett squares.

THEN, in the exercises, show poisson regression.

Most statistics books include two versions of a test invented around 1900 that deals with counts at different levels of a categorical variable. This chi-squared test is genuinely different from regression. And, in theoretical statistics the chi-squared distribution has an important role to play.

The chi-squared test of independence could be written, in regression notation, as `group1 ~ group2`

. But regression does not handle the case of a categorical variable with multiple levels.

However, in practice the chi-squared test of independence is very hard to interpret except when one or both of the variables has two levels. This is because there is nothing analogous to model coefficients or effect size that comes from the chi-squared test.

The tendency in research, even when `group1`

has more than two levels, is to combine groups to produce a `yesno`

variable. Chi-squared can be used with the response variable being `yesno`

and almost all textbook examples are of this nature.

But for a `yesno`

response variable, a superior, more flexible and more informative method is logistic regression.