# 37  p-value summaries

Lesson 36 introduced the logic of Null Hypothesis testing (NHT). This Lesson covers how to perform an NHT. As you will see, the everyday workhorses of NHT are regression modeling and summarization. We will use the conf_interval(), R2(), regression_summary(), and anova_summary() functions, the last three of which generate a “p-value,” which is the numerical measure of “standing out from the Planet Null crowd” illustrated in Figure 36.4.

## Confidence intervals and regression “tables”

We have been using confidence intervals from early on in these Lessons. As you know, the confidence interval presents the precision with which a model coefficient or effect size can be responsibly claimed to be known.

Using a confidence interval for NHT is dead simple. Check whether the confidence interval includes zero. If so, there’s not enough precision in your measurement of the summary statistic to justify a claim that the Null hypothesis might account for your observation. On the other hand, if the confidence interval doesn’t include zero, you are justified in “rejecting the Null.”

To illustrate the use of confidence intervals for NHT, consider again the Galton data on the heights of adult children and their parents. Let’s consider whether the adult child’s height is related to his or her mother’s height.

lm(height ~ mother, data=Galton) |> conf_interval()
term .lwr .coef .upr
(Intercept) 40.00 47.00 53.00
mother 0.21 0.31 0.41

The confidence interval on mother does not include zero so the Null hypothesis is rejected.

Whenever your interest is in a particular model coefficient, I strongly encourage you to look at the confidence interval. This carries out a hypothesis test and, at the same time, tells you about an effect size, important information for decision-makers. When you do this—good for you!—you are likely to encounter a more senior researcher or a journal editor who insists that you report a p-value. The p-value corresponding to a confidence interval not including zero is $$p < 0.05$$. The senior researcher of editor might insist on knowing the “exact” value of $$p$$. Such knowledge is not as important as most people make it out to be, but if you need such an “exact” value, you can turn to another style of summary of regression models called a “regression table.”

Here’s a regression table on the height ~ mother model:

lm(height ~ mother, data=Galton) |> regression_summary()
term estimate std.error statistic p.value
(Intercept) 47.00 3.300 14.0 0
mother 0.31 0.051 6.2 0

The regression table has one row for each coefficient in the model. height ~ mother, for instance, has an intercept coefficient as well as a mother coefficient. The p-value for each coefficient is given in the right-most column of the regression table. Other columns in the regression table repeat information from the confidence interval. The “estimate” is the coefficient found when training the model on the data. the “std.error” is the standard error, which is half the margin of error. (See Lesson 22.) You may remember that the confidence interval amounts to

$\text{estimate} \pm 2\! \times\! \text{standard error}$

The “statistic” shown in the regression table is an intermediate step in the calculation of the p-value. The full name for it is “t-statistic” and it is merely the estimate divided by the standard error. (For example, the estimate for the mother coefficient is 0.31 while the standard error is 0.051. Divide one by the other and you get the t-statistic, about 6.2 in this case.)

A fanatic for decimal places might report the p-value on mother as $$p=0.0000000011$$. That is indeed smaller, consistent with what we already established—$$p < 0.05$$—from a simple examination of the confidence interval. For good reason it is considered bad style to write out a p-value with so many zeros after the decimal point. Except in very specialized circumstances (e.g. gene expression measurements with hundreds of thousands of plasmids) there is no reason to do so: $$p < 0.001$$ is enough to satisfy anyone that it’s appropriate to reject the Null hypothesis.

There are three good reasons to avoid writing anything more detailed than $$p < 0.001$$ even when the computer tells you that the p-value is zero. First, and simplest, computer arithmetic involves round-off error, so the displayed zero is not really mathematically zero. Second, there are many assumptions made in the calculation of p-values, the correctness of which is impossible to demonstrate from data. Third, the p-value is itself a sample statistic and subject to sampling variability. A complete presentation of the p-value ought therefore to include a confidence interval. An example of such a thing is given in Figure 38.1, but they are never used in practice.

Weeding in action

Section 36.3 points out that the purpose of NHT is to screen out some of the research results that might otherwise—without NHT—be taken seriously. To illustrate the weeding-out effect of NHT, let’s simulate a situation where the Null hypothesis is true. As earlier, we enforce the Null by randomly shuffling the explanatory variable. We’ll continue the example using Galton and the model specification height ~ mother. Keep in mind: since the Null is being enforced, none of the results Like this:

lm(height ~ shuffle(mother), data=Galton) %>%
conf_interval()
term .lwr .coef .upr
(Intercept) 60.000 66.0000 73.00
shuffle(mother) -0.093 0.0091 0.11

Now, let’s conduct many random trials and show the confidence interval for each of them.

Code
set.seed(104)
Trials <- do(200) * {
lm(height ~ shuffle(mother), data=Galton) %>%
conf_interval() %>%
filter(term == "shuffle(mother)")
}

Trials %>%
mutate(misses = ifelse(.upr < 0 | .lwr > 0, "red", "black")) %>%
ggplot(aes(x = .index, xend = .index, y = .lwr, yend = .upr)) +
geom_segment(width=.15, aes(color=misses)) +
geom_point(aes(x=.index, y=.coef), size=0.5) +
geom_hline(yintercept=0, color="blue") +
scale_color_identity() +
scale_x_continuous(breaks=c()) +
ylab("Confidence interval on shuffle(mother).") +
xlab("200 trials") +
theme_minimal()

Think of the simulation shown in in Figure 37.1 as 200 researchers, each on a wild-goose hunt and eager to make a name for themselves in science. If even 100 published their work, the literature would be crowded with meaningless results. The NHT screen removes 188 of the researchers from further consideration. Of the 200 trials displayed in Figure 37.1, 188 include zero within the confidence interval. But the net is not perfect and some get through (shown in red). The meaning of $$p < 0.05$$ is that in a situation like this, fewer than 5% should get through. In Figure 37.1 12 of the trials evaded the net. Owing to sampling variation, that’s entirely consistent with an NHT failure rate of 5% (which would be 10 of the 200 trials).

## ANOVA

We’ve seen that conf_interval() and regression_summary() can be used to conduct NHT. Now let’s go into the reason why there are more model summaries used for NHT: R2() and anova_summary.

There is an important setting where confidence intervals cannot be used to perform NHT. That is when it is not a single coefficient that is of interest, but a group of coefficients. As an example, consider the data in College_grades, which contains the authentic transcript information for all students who were graduated in 2006 from a small liberal-arts college in the US.

The data have been anonymized, so that it is not possible to identify individual students, professors, or departments. The data are used with permission of the college’s registrar.

The numerical response variable, gradepoint, corresponds to the letter grade of each student in each course taken. This is the sort of data used to calculate grade-point averages (GPA) a statistic familiar to college students! The premise behind GPA is that students are different in a way that can be measured by their GPA. If that’s so, then the student (encoded in the sid variable) should account for the variation in gradepoint.

We can use gradepoint ~ sid to look at the matter. In the following code block, we fit the model and look at the confidence interval on each student’s GPA. We’ll show just the first 6 out of the 443 graduating students.

GPAs <- lm(gradepoint ~ sid - 1, data=College_grades) |> conf_interval()
head(GPAs)
term .lwr .coef .upr
sidS31185 2.1 2.4 2.8
sidS31188 2.8 3.0 3.3
sidS31191 2.9 3.2 3.5
sidS31194 3.1 3.4 3.6
sidS31197 3.1 3.3 3.6
sidS31200 1.9 2.2 2.5

Interpreting a report like this is problematic. For instance, consider the first and last students (sidS31185 and sidS31200) in the report. Their gradepoints are different (2.4 versus 2.2) but only if you ignore the substantial overlap in the confidence intervals. (Colleges don’t ever report a confidence interval on a GPA. Perhaps this is because they don’t believe there is any component to a grade that depends on anything but the student.)

A graphical display, as in Figure 37.2 helps put the 443 confidence intervals in context. The graph shows that only roughly 100 of the weaker students and 100 of the stronger students have a GPA that is discernible from the college-wide average. In other words, the middle 60% of the students are indistinguishable from one another based on their GPA.

Code
orderedGPAs <- GPAs |> arrange(.upr) |> mutate(row = row_number())
orderedGPAs |>  ggplot() +
geom_segment(aes(x=row, xend=row, y=.lwr, yend=.upr), alpha=0.3) +
geom_errorbar(aes(x=row, ymin=.lwr, ymax=.upr), data= orderedGPAs |> filter(row%%20==0)) +

geom_hline(yintercept=3.41, color="blue")

Another way to answer the question of whether sid accounts for gradepoint is to look at the R2:

lm(gradepoint ~ sid, data=College_grades) |> R2()
n k Rsquared F adjR2 p df.num df.denom
5700 440 0.32 5.6 0.27 0 440 5200

The R2 is only about 0.3, so the student-to-student differences account for less than half of the variation in the course grades. This is perhaps not as much as might be expected, but there are so many grades in College_grades that the p-value is small, leading us to reject the Null hypothesis that gradepoint is independent of sid.

The process by which R2 is converted to a p-value involves a clever summary statistic called F. For our purposes, it suffices to say that R2 can be translated to a p-value, taking into account the sample size $$n$$ and the number of coefficients ($$k$$) in the model.

Carrying out such tests on a whole model—with many coefficients—is so simple that we can look at another possible explanation for gradepoint: that it depends on the professor. The iid variable encodes the professor. There are 364 different professors who assigned grades to the Class of 2006.

lm(gradepoint ~ iid, data=College_grades) |> R2()
n k Rsquared F adjR2 p df.num df.denom
5700 360 0.17 3.1 0.12 0 360 5300

We can reject the Null hypothesis that the professor is unrelated to the variation in grades.

But let’s be careful, maybe the professor only appears to account for the variation in grade. It might be that students with low grades cluster around certain professors, and students with high grades cluster around other professors. We can explore this possibility by looking at a model that includes both students and professors as explanatory variables.

lm(gradepoint ~ sid + iid, data=College_grades) |> R2()
n k Rsquared F adjR2 p df.num df.denom
5700 800 0.48 5.6 0.39 0 800 4900

Alas for the idea that grades are only about the student, the R2 report shows that, even taking into account the variability in grades associated with the students, there is still substantial variation associated with the professors.

This process of handling groups of coefficients (442 for students and 358 for professors) using R2 can be so useful that a special format of report, called the ANOVA report, is available. Here it is for the college grades:

lm(gradepoint ~ sid + iid, data=College_grades) |> anova_summary()
term df sumsq meansq statistic p.value
sid 440 650 1.50 6.8 0
iid 360 310 0.86 4.0 0
Residuals 4900 1100 0.22 NA NA

This style of report provides a quick way to look at whether an explanatory variable is contributing to a model. The ANOVA report looks at each variable, each in turn, to see if given the variables already in the model whether the next variable along explains enough of the residual variance in the model to be regarded as something other than an accident due to sampling variation. In the report above, sid is associated with a very small p-value. Then, granting sid credit for all the variation it can explain, iid still shows up as a strong enough “eater of variance” that the Null hypothesis can is rejected for it as well.

Example: Explaining book discounts

To illustrate how ANOVA can be used to build a model, consider the moderndive::amazon_books data frame. The Amazon company has a reputation for discounting books, let’s see if we can detect patterns in how much (or how little) discount they give. To start, we will wrangle a new variable, discount, into the amazon_books data. This will give the discount in percent. A negative discount means that the Amazon price is higher than the publisher’s list price.

Books <- moderndive::amazon_books %>%
mutate(discount = 100 * (1 - amazon_price / list_price))

Now we can check which variables account for the variation in the discount. For instance, discount might depend on list_price (say, cheap books are hardly discounted), and perhaps on hard-cover vs paperback, the number of pages, and the weight of the book. We can throw in many variables on a first pass through the problem.

lm(discount ~ hard_paper + num_pages + thick + weight_oz + list_price, data = Books) |>
anova_summary()
term df sumsq meansq statistic p.value
hard_paper 1 1050 1050 3.900 4.91e-02
num_pages 1 6430 6430 23.800 1.70e-06
thick 1 1340 1340 4.960 2.66e-02
weight_oz 1 198 198 0.733 3.93e-01
list_price 1 124 124 0.459 4.98e-01
Residuals 308 83100 270 NA NA

Each variable is given a p-value. Of the ones here, hard_paper, num_pages and thick have $$p < 0.05$$. That suggests including only those variables in a model.

tentative_model <- lm(discount ~ hard_paper + num_pages + thick, data = Books)
tentative_model |> R2()
n k Rsquared F adjR2 p df.num df.denom
321 3 0.0912905 10.61545 0.0826907 1.1e-06 3 317

The three variables account for just 9% of the variation in discount. Not a very powerful explanation!

## STILL IN DRAFT: Traditional tests

Statistics textbooks usually include several different settings for “hypothesis tests.” I’ve just pulled a best-selling book off my shelf and find listed the following tests spread across eight chapters occupying about 250 pages.

• hypothesis test on a single proportion
• hypothesis test on the mean of a variable
• hypothesis test on the difference in mean between two groups (with 3 test varieties in this category)
• hypothesis test on the paired difference (meaning, for example, measurements made both before and after)
• hypothesis test on counts of a single categorical variable
• hypothesis test on independence between two categorical variables
• hypothesis test on the slope of a regression line
• hypothesis test on differences among several groups
• use Births2015 and check whether month or day of the week explain the daily variation.
• hypothesis test on R2

As statistics developed, early in the 20th century, distinct tests were developed for different kinds of situations. Each such test was given its own name, for example, a “t-test” or a “chi-squared test.” Honoring this history, statistics textbooks present hypothesis testing as if each test were a new and novel kind of animal.

In these Lessons, we’ve focussed on that one method, rather than introducing all sorts of different formulas and calculations which, in the end, are just special cases of regression. Nonetheless, most people who are taught statistics were never told that the different methods fit into a single unified framework. Consequently, they use different names for the different methods.

These traditional names are relevant to you because you will need to communicate in a world where people learned the traditional names, you have to be able to recognize those names know which regression model they refer to. In the table below, we will use different letters to refer to different kinds of explanatory and response variables.

• x and y: quantitative variables

• group: a categorical variable with multiple ($$\geq 3$$) levels.

• yesno: a categorical variable with exactly two levels (which can always be encoded as a zero-one quantitative variable)

y ~ 1 t-test on a single mean
yesno ~ 1 p-test on a single proportion.
y ~ yesno t-test on the difference between two means
yesno1 ~ yesno2 p-test on the difference between two proportions
y ~ x t-test on a slope
y ~ group ANOVA test on the difference among the means of multiple groups
y ~ group1 * group2 Two-way ANOVA
y ~ x * yesno t-test on the difference between two slopes. (Note the *, indicating interaction)
y ~ group + x ANCOVA

Another named test, the z-test, is a special kind of t-test where you know the variance of a variable without having to calculate it from data. This situation hardly every arises in practice, and mostly it is used as a soft introduction to the t-test.

Still another test, named ANCOVA, is considered too advanced for inclusion in traditional textbooks. It amounts to looking at whether the variable group helps to account for the variation in y in the model y ~ group + x.

## P-values and covariates

Use cancer/grass-treatment example from Lesson 30 to illustrate how failing to think about covariates before the study analysis can lead to false discovery.

Use age in marriage data.

So, standard operating procedures were based on the tools at hand. We will return to the mismatch between hypothesis testing and the contemporary world in Lesson 38.

$\begin{array}{cc|cc} & & \textbf{Test Conclusion} &\\ & & \text{do not reject } H_0 & \text{reject } H_0 \text{ in favor of }H_A \\ \textbf{Truth} & \hline H_0 \text{ true} & \text{Correct Decision} & \text{Type 1 Error} \\ & H_A \text{true} & \text{Type 2 Error} & \text{Correct Decision} \\ \end{array}$

A Type 1 error, also called a false positive, is rejecting the null hypothesis when $$H_0$$ is actually true. Since we rejected the null hypothesis in the gender discrimination (from the Case Study) and the commercial length studies, it is possible that we made a Type 1 error in one or both of those studies. A Type 2 error, also called a false negative, is failing to reject the null hypothesis when the alternative is actually true. A Type 2 error was not possible in the gender discrimination or commercial length studies because we rejected the null hypothesis.

## The chi-squared test

MAKE THIS A MORE NEUTRAL DESCRIPTION SHOWING two settings where Chi-squared is appropriate: Is a die fair? Does the observed frequency of phenotypes follow the theoretical frequency of genotypes, as in Punnett squares.

THEN, in the exercises, show poisson regression.

Most statistics books include two versions of a test invented around 1900 that deals with counts at different levels of a categorical variable. This chi-squared test is genuinely different from regression. And, in theoretical statistics the chi-squared distribution has an important role to play.

The chi-squared test of independence could be written, in regression notation, as group1 ~ group2. But regression does not handle the case of a categorical variable with multiple levels.

However, in practice the chi-squared test of independence is very hard to interpret except when one or both of the variables has two levels. This is because there is nothing analogous to model coefficients or effect size that comes from the chi-squared test.

The tendency in research, even when group1 has more than two levels, is to combine groups to produce a yesno variable. Chi-squared can be used with the response variable being yesno and almost all textbook examples are of this nature.

But for a yesno response variable, a superior, more flexible and more informative method is logistic regression.