:::: {.column-page-right}

Figure 28.1 (reprinted below) in the Lessons shows the mean SAT score across all students in a state versus the public school expenditures (per pupil) in that state.

A. What is the gray band in the graph, a confidence interval or a prediction interval? Explain your answer.

B. Whichever kind of interval is already shown in the graph, sketch out what you think the other kind of interval will be.

Consider dag02 in which y is caused by both x and c.

dag_draw(dag02)

Both nodes a and c are exogenous, that is, they have no inputs.

Suppose we want to quantify the relationship between x and y. There are two models that might be appropriate: y ~ x or y ~ x + a.

Here are the confidence intervals on the coefficients from the two models.

Samp <- sample(dag02, size=100)
lm(y ~ x, data = Samp) |> conf_interval()
term .lwr .coef .upr
(Intercept) 4.650359 5.025492 5.400624
x 2.375741 2.720940 3.066139
lm(y ~ x + a, data = Samp) |> conf_interval()
term .lwr .coef .upr
(Intercept) 4.958423 5.157811 5.357199
x 2.730815 2.915266 3.099717
a -1.693244 -1.505199 -1.317154

A. Does including a as a covariate change the apparent relationship between x and y? In answering, make sure to consider the whole confidence interval and not merely the point estimate .coef.

B. How does including a as a covariate alter the width of the confidence interval on x?

A. Interpret the coefficients in the model a ~ x to say whether there is evidence of a causal flow from x to a.

lm(a ~ x + y, data=sample(dag02, size=1000)) |> conf_interval()
term .lwr .coef .upr
(Intercept) 2.248535 2.3486953 2.4488552
x 1.365374 1.4300450 1.4947156
y -0.489715 -0.4709623 -0.4522096

The math300::Clock_auction data frame records the auctions of 32 antique grandfather clocks. We would like to examine the possibility that the price paid depends on the age of the clock.

A. Calculate the confidence interval on the age coefficient from the model price ~ age.

lm(price ~ age, data=Clock_auction) |> conf_interval()
term .lwr .coef .upr
(Intercept) -589.828952 -67.167662 455.49363
age 5.857338 9.402623 12.94791
i. Does the confidence interval indicate a relationship between price and age?
ii. What is the effect size? In particular, what does the model suggest about the price difference between two clocks that differ in age by a decade?
iii. The intercept is near zero. Explain whether this means that a brand-new grandfather clock would have an auction price of near zero.

B. Conventional wisdom is that auction prices are higher when there are many competing buyers. What do the data have to say about this? Look at the confidence interval for the bidders coefficient in the model price ~ bidders?

lm(price ~ bidders, data=Clock_auction) |> conf_interval()
term .lwr .coef .upr
(Intercept) 486.213201 944.05426 1401.89531
bidders -9.208742 36.88611 82.98096

C. The confidence interval (from part B) on bidders is very wide and even includes zero. Perhaps using age as a covariate will eat some variance and narrow the interval. Does it?

lm(price ~ bidders + age, data = Clock_auction) |> conf_interval()
term .lwr .coef .upr
(Intercept) -1450.57576 -921.50278 -392.42981
bidders 37.45739 64.02686 90.59633
age 8.33264 11.08665 13.84067

D. (Optional) Fit the model price ~ age * bidders, which has an additional, “interaction” term, and look at the confidence intervals. (Notice the * in the model specification.) Does adding the additional term narrow the confidence intervals?

lm(price ~ bidders * age, data = Clock_auction) |> conf_interval()
term .lwr .coef .upr
(Intercept) -1876.7626698 -512.8101698 851.142330
bidders -118.2520921 19.8876621 158.027416
age -1.2262344 8.1650785 17.556391
bidders:age -0.6616203 0.3196439 1.300908

Explanation: The big broadening in the confidence intervals is due to a situation called “multi-collinearity.” The interaction term, bidders:age is strongly aligned with bidders and somewhat aligned with age. The alignment creates an ambiguity; bidders:age can explain almost as much as the two variables bidders and age. It is somewhat arbitrary how to assign coefficients to two terms that are almost the same as a third term: the confidence intervals reflects this.

Lesson 30

dag02: y is a collider dag03: g is a common cause dag04: chain

Samp <- sample(dag02, size=20)
lm(a ~ x, data = Samp) %>% conf_interval()
term .lwr .coef .upr
(Intercept) -1.0667773 -0.4807364 0.1053045
x -0.8336666 -0.2547662 0.3241341
lm(a ~ x, data = Samp) %>% R2()
n k Rsquared F adjR2 p df.num df.denom
20 1 0.0453391 0.8548618 -0.0076977 0.3662025 1 18
lm(a ~ x, data = Samp) %>% regression_summary()
term estimate std.error statistic p.value
(Intercept) -0.4807364 0.2789446 -1.7234115 0.1019481
x -0.2547662 0.2755459 -0.9245874 0.3674106
lm(a ~ x + y, data = Samp) %>% regression_summary()
term estimate std.error statistic p.value
(Intercept) 2.3830564 0.3263467 7.302224 1.2e-06
x 1.3089138 0.2019618 6.480995 5.6e-06
y -0.4981907 0.0530988 -9.382334 0.0e+00

NOT FINISHED.

Consider dag06, shown below:

dag_draw(dag06)

Suppose we want to understand the causal effect of a on c.

Part I. Notice that there are two pathways between a and c.

• Pathway 1: $$\mathtt{a} \longrightarrow \mathtt{b} \longrightarrow \mathtt{c}$$
• Pathway 2: $$\mathtt{a} \longrightarrow \mathtt{d} \longleftarrow \mathtt{c}$$.

A. Based on the diagram, is there causal flow along Pathway 1 from a to c?

B. Based on the diagram, is there causal flow along Pathway 2 from a to c?

Part II. In modeling the relationship between a and c, we have a choice of using no covariates, including b as a covariate, including d as a covariate, or including both b and d as covariates. These four possibilities correspond respectively to these four model specifications.

1. c ~ a
2. c ~ a + b
3. c ~ a + d
4. c ~ a + b + d

Use each of these specifications in turn, like this:

lm(c ~ a, data=sample(dag06, size=1000)) |> conf_interval()
term .lwr .coef .upr
(Intercept) -0.1219713 -0.0332282 0.0555149
a 0.8387843 0.9258735 1.0129628

A. Do specifications (i) through (iv) give similar coefficients on a?

Lesson 31

Consider dag02 in which y is caused by both x and c.

dag_draw(dag02)

A. Which of the nodes are exogenous, that is, have no inputs?

Since exogenous nodes are independent of one another, a model relating just those two nodes should show no relationship between them.

B. Construct an appropriate model specification to test the proposition that the exogenous nodes are independent. You should be able to get the information you need with a statement like this:

lm(____ ~ ____, data = sample(dag02, size=100)) |> conf_interval()

C. What about the results from (B) indicates whether or not there is a relationship between the two exogenous nodes?

D. The modeling situation can change when the third node is included as a covariate.

lm(____ ~ ____ + ____, data = sample(dag02, size=100)) |>
conf_interval()

What about these results (falsely) indicates a relationship between the exogenous nodes?

The results from (D) are an example of a spurious correlation. But there is nothing about the coefficients themselves that shows the correlation is spurious or not. Instead, it is the DAG that tells us which are the exogenous variables. There is never a correlation between exogenous variables.

E. Returning to the nomenclature introduced in Lesson 30, which of these is the case for dag02

1. Node y is a common cause of a and x.
2. Node y is an intermediary on a causal path between a and x.
3. Node y is a collider on the non-causal path between a and x.

Lesson 32

Consider a situation where there is a medical treatment that is thought to improve outcomes. Where might this belief come from? Experience! Over many years, medical workers have observed the treatment to be effective.

Imagine a novice researcher combing through medical records looking for patients with a specific condition, some of whom received the treatment and some not. The novice has in mind a relationship like this:

Accordingly, the researcher records each patient’s outcome together with whether the treatment was given. Like this:

treatment outcome
none 0.3288132
treat -1.0044737
none 1.7272293
treat -0.3115102
none 2.5408557
none -0.3412458
treat -0.0978546
treat 1.9870898

The DAG dag_medical_observations simulates the situation.

A. Draw a sample of size=10 from dag_medical_observations and fit the model outcome ~ treatment. Check the confidence interval on the coefficient treatmenttreat. Is there any clear evidence for a systematic relationship between treatment and outcome? What feature of the confidence interval leads to your conclusion.

B. Repeat (A), but with a much larger sample of size=400. With the larger amount of data, there will be clear evidence for a relationship between treatment and outcome. Which way does the relationship go? Does treatment improve (increase) outcome or worsen (decrease) it?

You can make the confidence interval as small as desired by increasing the sample size. (In the real world, though, you would run out of funding!) Nonetheless, no matter how narrow the confidence interval, there is no reason to believe it is an accurate represention of the real-world link between treatment and outcome. The reason is the possible existence of confounding variables and hidden influences.

Here is dag_medical_observations with the “hidden” influences revealed.

Since a patient’s sex is part of the medical record, the researcher compiling the data could have included that in the data frame. Sex is generally an important covariate in medical matters. Here’s are the confidence intervals from the model outcome ~ treatment + .sex with a sample of the augmented data. It’s tempting to think that more data gives more accurate results, so here we have used a sample of size $$n=4000$$.

term .lwr .coef .upr
(Intercept) 0.8581245 0.9434897 1.0288550
treatmenttreat 0.2391749 0.3267674 0.4143599
.sexM -1.4505452 -1.3647643 -1.2789835

But sample size is only germane to the precision of the estimate, not the accuracy. To know the accuracy of an estimate, we would have to know the actual mechanism generating the data. If that were so, we wouldn’t need to rely on data!

However, there is a way to make an accurate estimate. This involves changing the mechanism to simplify it.

Let’s construct such an experiment. In the real world, this would involve randomly assigning patients to treatment or not, then observing the outcome for each patient. This can be a huge amount of work. For us, using DAGs, the experimental procedure is much more convenient. First, we intervene in the DAG to cut off all other inputs to treatment by setting treatment as directed by a random number generator.

With the DAG simulations, we can do this by changing the treatment node, like this:

experiment_dag <- dag_intervene(dag_medical_observations,
treatment ~ binom(generator, labels=c("none", "treat")),
generator ~ exo())

This new DAG sets treatment based on the value of generator. In turn, generator is an exogenous random variable, that is, not influenced by any of the other nodes.

In experiment_dag, there are not any backdoor pathways between treatment and outcome. Consequently, we can use the mode outcome ~ treatment to measure the direct link between the two.

C. In your own R session, run the command to make experiment_dag. Then generate a sample (say, size=400) and fit the model outcome ~ treatment. Are the confidence intervals on treatment consistent with the ones you found in part (B)?

D. Unlike the real world, in a DAG simulation we can reveal the actual mechanism behind the data. Using the command print(dag_medical_observations), look at the formula for output ~. What is the coefficient on treatment=="treat"? Is this consistent with the result you got in (B) or in (C)?

Lesson 33

The math300::Gilbert data frame records whether a death occurred (variable death) in each of 1384 nursing shifts at a VA hospital. Variable gilbert records with nurse Kristen Gilbert was on duty during that shift.

We want to examine the extent to which nurse Gilbert might have contributed to the risk of death during a shift. To that end, we will fit a logistic regression model death ~ gilbert, like this:

Gilbert <- math300::Gilbert %>%
mutate(death = zero_one(death, one="yes"))
Model <- glm(death ~ gilbert, data = Gilbert, family = binomial)
Model |> conf_interval()
term .lwr .coef .upr
(Intercept) -1.1782925 -1.0566614 -0.9373749
gilberton_duty 0.3254669 0.6055858 0.8823493

The model output is $$-1.057$$ when Gilbert was not on duty and $$-1.057 + 0.606 = -0.451$$ when Gilbert was on duty.

The model output in each of the two situations is the “log odds” of the risk of death during a nursing shift. There’s a formula to translate log odds into probability, for instance

exp(-1.057) / (1 + exp(-1.057))
[1] 0.2578832
exp(-0.451) / (1 + exp(-0.451))
[1] 0.389123

From these two probabilities, Gilbert’s being on duty increased the probability of a death from 26% to 39%—a difference in “absolute increase in risk” of 13 percentage points. Another way to quantify exactly the same thing is a “risk ratio.” The risk ratio when Gilbert was on duty is $$0.39/0.26 = 1.5$$.

Using calculations like these, compute a confidence interval on the absolute increase in risk. Also, compute a confidence interval on the risk ratio.

Note: An “absolute risk” is a probability and an increase in risk is an increase in probability. In the above we see that Gilbert’s being on duty increased the risk by $$0.39 - 0.26 = 0.13$$.

A risk ratio (or, “relative risk”) a expresses the change of risk in relative terms. The risk ratio of 1.5 means that Gilbert’s being on duty increased the risk by a multiplicative factor of 1.5. Since the baseline risk is 26%, the risk with Gilbert on duty is $$1.5 \times 26\% = 39\%$$.

News reports will typically present risk ratios, even though the change in absolute risk is more meaningful for making decisions about risk.

There is a very subtle and easy-to-miss use of words when talking about absolute and relative risks. The change in absolute risk can correctly be reported as “13 percentage point increase.” The change in relative risk is correctly reported as “50 percent increase” (that is, from 1 to 1.5). Absolute change in risk and relative risk are different ways of describing exactly the same thing. For the reader, the only clue of whether an change in absolute or relative risk is being reported are the phrases “percentage point” versus “percent.”

Imagine the situation of a treatment that can reduce the risk of an uncommon disease. There can be multiple ways to describe the effectiveness of the treatment that are perceived very differently by most readers. For instance:

1. The treatment halves the risk of the disease.
2. The treatment increases the chances of being healthy by one percentage point.
3. Neglecting treatment doubles the risk of the disease.

A. Suppose the risk of disease in the untreated group is 2%, and the risk in the treated group is 1%. Is this consistent with statement (i)? With statement (ii)? With statement (iii)?

B. Suppose the chances of staying healthy in the untreated group is 98%, and the chances in the treated group is 99%. Is this consistent with statement (i)? With statement (ii)? With statement (iii)?

A desirable property of a mathematical description of risk is that the numerical value of a change in risk should have the same magnitude whether one is speaking of the risk of disease or the chances of staying healthy.

C. Focus for the moment on the description in terms of risk of disease: 1% in the treated group and 2% in the untreated group. i. What are the odds of getting the disease in the treated group? (Recall that an event with probability $$p$$ has odds $$\frac{p}{1-p}$$.) ii. What are the odds of getting the disease in the untreated group? iii. Calculate the difference in log-odds between the treated group and the untreated group. (In R, log() computes the logarithm.)

D. Now turn to the description in terms of the chances of staying healthy: 99% in the treated group and 98% in the untreated group. i. What are the odds of staying healthy in the treated group? ii. What are the odds of staying healthy in the untreated group? iii. Calculate the difference in log-odds between the treated and untreated groups.

E. There is a difference in log-odds differ between the analysis in (C) and the analysis in (D). Explain what the difference is and give a common-sense explanation for it.

Imagine the situation of a treatment that can reduce the risk of an uncommon disease. There can be multiple ways to describe the effectiveness of the treatment that are perceived very differently by most readers. For instance:

1. The treatment halves the risk of the disease.
2. The treatment increases the chances of being healthy by one percentage point.
3. Neglecting treatment doubles the risk of the disease.

A. Suppose the risk of disease in the untreated group is 2%, and the risk in the treated group is 1%. Is this consistent with statement (i)? With statement (ii)? With statement (iii)?

B. Suppose the chances of staying healthy in the untreated group is 98%, and the chances in the treated group is 99%. Is this consistent with statement (i)? With statement (ii)? With statement (iii)?

A desirable property of a mathematical description of risk is that the numerical value of a change in risk should have the same magnitude whether one is speaking of the risk of disease or the chances of staying healthy.

C. Focus for the moment on the description in terms of risk of disease: 1% in the treated group and 2% in the untreated group. i. What are the odds of getting the disease in the treated group? (Recall that an event with probability $$p$$ has odds $$\frac{p}{1-p}$$.) ii. What are the odds of getting the disease in the untreated group? iii. Calculate the difference in log-odds between the treated group and the untreated group. (In R, log() computes the logarithm.)

D. Now turn to the description in terms of the chances of staying healthy: 99% in the treated group and 98% in the untreated group. i. What are the odds of staying healthy in the treated group? ii. What are the odds of staying healthy in the untreated group? iii. Calculate the difference in log-odds between the treated and untreated groups.

E. There is a difference in log-odds differ between the analysis in (C) and the analysis in (D). Explain what the difference is and give a common-sense explanation for it.

Lesson 34

The data frame math300::Gilbert has a row for each of 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert.) The variable deaths records whether or not a death occurred on that shift, while gilbert records whether nurse Kristen Gilbert was on duty.

A. Using mutate() transform deaths to a zero-one variable where 1 indicates whether a death occurred on the shift.

B. Using the data from (A), fit the regression model death ~ gilbert and extract the confidence interval on the gilberton_duty coefficient.

i. The coefficient on indicates that Gilbert being on duty is associated with a 13.1 percentage point increase in the risk of a death during the shift. Gilbert was on duty for 257 shifts. Using the coefficient (.coef), how many deaths can reasonably be attributed to Gilbert?
ii. The .lwr and .upr bounds of the confidence interval suggest a range in the number of deaths that these data indicate can be attributed to Gilbert. What is that range?

How would you explain the

Lesson 35

Being a statistician, I am often approached by friends or acquaintances who have recently gotten a “positive” result on a medical screening test, for example cholesterol testing or prostate-specific antigen (PSA). They want to know how likely it is that they have the condition—heart disease or prostate cancer—being screened for. Before I can answer, I have to ask them an important question: How did you come to have the test? I want to know if the test was done as part of a general screening or if the test was done because of some relevant symptoms.

To illustrate why the matters of symptoms is important, consider a real-world test for schizophrenia.

In 1981, President Reagan was among four people shot by John Hinkley, Jr. as they were leaving a speaking engagement at a D.C. hotel. At trial, Hinckley’s lawyer presented an “insanity” defense, there being a longstanding legal principle that only people in control of their actions can be convicted of a crime.

As part of the evidence, Hinkley’s defense team sought to present a CAT scan showing atrophy in Hinkley’s brain. About 30% of schizophrenics had such atrophy, compared to only 2% of the non-schizophrenic population. Both of these are likelihoods, that is, a probability of what’s observed given the state of the subject.

A. Based on the above, do you think the CAT scan would be strong evidence of schizophrenia?

A proper calculation of the probability that a person with atrophy is schizophrenic depends on the prevalence of schizophrenia. This was estimated at about 1.5% of the US population.

Calculating the probability of the subject’s state given the observation of atrophy involves comparing two quantities, both of which have the form of a likelihood times a prevalence.

• Evidence in favor of schizophrenia: $\underbrace{30\%}_\text{likelihood} \times \underbrace{1.5\%}_\text{prevalence} = 0.45\%$
• Evidence against schizophrenia: $\underbrace{2\%}_\text{likelihood} \times \underbrace{98.5\%}_\text{prevalence} = 1.97\%$ The probability of schizophrenia given atrophy compares the evidence for schizophrenia to the total amount of evidence: $\frac{0.45\%}{1.97\% + 0.45\%} = 18.6\%\ .$ Based just on the result of the test for atrophy, Hinkley was not very likely to be a schizophrenic.

This is where the “How did you come to have the test?” question comes in.

For a person without symptoms, the 18.6% calculation is on target. But Hinkley had very definite symptoms: he had attempted an assassination. (Also, Hinkley’s motivation for the attempt was to impress actress Jody Foster, to “win your heart and live out the rest of my life with you.)

The prevalence of of schizophrenia among prisoners convicted of fatal violence is estimated at about 10 times that of the general population. Presumably, it is even higher among those prisoners who have other symptoms of schizophrenia.

B. Repeat the “evidence for” and “against” schizophrenia, but updated for a prevalence of 20% instead of the original 1.5%. Has this substantially change the calculated probability of schizophrenia?

Epilogue: Hinkley was found not guilty by virtue of insanity. He was given convalescent leave from the mental hospital in 2016 and released entirely in 2022.

Note: This question is based on a discussion in the July 1984 “Misapplications Reviews” column of INTERFACES **14(4):48-52.

Lesson 36

The data frame math300::Gilbert has a row for each of 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert.) The variable deaths records whether or not a death occurred on that shift, while gilbert records whether nurse Kristen Gilbert was on duty.

A. Using mutate() transform deaths to a zero-one variable where 1 indicates whether a death occurred on the shift.

B. Using the data from (A), fit the regression model death ~ gilbert and extract the confidence interval on the gilberton_duty coefficient.

i. The coefficient on indicates that Gilbert being on duty is associated with a 13.1 percentage point increase in the risk of a death during the shift. Gilbert was on duty for 257 shifts. Using the coefficient (.coef), how many deaths can reasonably be attributed to Gilbert?
ii. The .lwr and .upr bounds of the confidence interval suggest a range in the number of deaths that these data indicate can be attributed to Gilbert. What is that range?

How would you explain the

Lesson 37

DRAFT: ROUTINE QUESTIONS ABOUT WHETHER TO reject the Null: coefficients edition

Samp <- sample(dag02, size=20)
lm(a ~ x, data = Samp) %>% conf_interval()
term .lwr .coef .upr
(Intercept) -0.5001754 0.0192005 0.5385764
x -0.1627069 0.3899748 0.9426564
lm(a ~ x, data = Samp) %>% R2()
n k Rsquared F adjR2 p df.num df.denom
20 1 0.1088037 2.197571 0.0592928 0.1538147 1 18
lm(a ~ x, data = Samp) %>% regression_summary()
term estimate std.error statistic p.value
(Intercept) 0.0192005 0.2472133 0.0776677 0.9389492
x 0.3899748 0.2630662 1.4824205 0.1555249
lm(a ~ x + y, data = Samp) %>% regression_summary()
term estimate std.error statistic p.value
(Intercept) 2.1926047 0.3467561 6.323189 7.6e-06
x 1.5242267 0.2185714 6.973586 2.2e-06
y -0.4657688 0.0687006 -6.779696 3.2e-06

DRAFT: ROUTINE QUESTIONS ABOUT WHETHER TO reject the Null: ANOVA edition.

DRAFT: A DEMONSTRATION OF POISSON REGRESSION

MAKE AN EXERCISE OUT OF THIS.

Another way that we failed involves covariates. The covariate sex explains a lot of the variation in height from person to person. Including sex as a covariate will “eat variance” (see Lesson 29), narrowing the confidence interval. To start, notice that the margin of error on the mother coefficient is about 0.3. Then, let’s include sex as a covariate.

lm(height ~ mother + sex, data = Galton %>% sample(size=100)) |>
conf_interval()
term .lwr .coef .upr
(Intercept) 21.3102554 35.2182698 49.1262843
mother 0.2306213 0.4448445 0.6590677
sexM 5.3068031 6.2743771 7.2419511

Notice that the margin of error on the mother coefficient is about 0.2 inches. That’s much smaller than the 0.3 we got when we failed to include sex as a covariate.

DRAFT: Non-parametric tests and the rank transformation.

DRAFT: Robust regression

DRAFT: Chi-squared example : counts.

DRAFT: Chi-squared example : 2-way tables compared to simple regression.

DRAFT: Drill on the names of traditional tests: which model formula do they correspond to.

Lesson 38

3)

Suppose that the difference between the sample groups turns out not to be significant, even though your review of the research suggested that there really is a difference between men and women. Which conclusion is most reasonable?

6% a) Something went wrong with the analysis.

6% b) There must not be a difference after all.

88% c) The sample size might have been too small.

6)

Suppose that two different studies are conducted on this issue. Study A finds that 40 of 100 women sampled dream in color, compared to 20 of 100 men. Study B finds that 35 of 100 women dream in color, compared to 25 of 100 men. Which study provides stronger evidence that there is a difference between men and women on this issue?

78% a) Study A

2% b) Study B

20% c) The strength of evidence would be similar for these two studies.

7)

Suppose that two more studies are conducted on this issue. Both studies find that 30% of women sampled dream in color, compared to 20% of men. But Study C consists of 100 people of each sex, while Study D consists of 40 people of each gender. Which study provides stronger evidence that there is a difference between men and women on this issue?

82% a) Study C

8% b) Study D

10% c) The strength of evidence would be similar for these two studies.

10)

MAYBE THIS SHOULD BE AN EXAMPLE IN EXPERIMENTS, make a simulation where “friendliness” of the people at the table is linked to both tip amount and the probability that Alicia gives her name.

Alicia wants to know if she receives higher tips, on average, when she gives customers her name compared to when she does not. She decides to track her tips for one week keeping track of the amounts and whether she gives the customers her name or not. She finds that giving her name led to a statistically significant higher average tip amount. Of the options below, what additional information is necessary for Alicia to be able to conclude that giving her name causes a higher tip, on average?

1. The number of tables where she does and does not give her name

2. The method Alicia used to decide which tables she would tell or not tell her name

3. No additional information is necessary because the result was statistically significant.

Exercise: Sampling variation in p-value

This is a story about an experiment carefully designed and carried out by university physicists in Munich, Germany, in the late 1980s.

The physicists were evaluating claims made by many people that they are “dowsers,” that is, that they can detect flowing water by the perception of “earthrays.” The physicists recruited 500 self-identified dowsers. Each of these went through roughly ten experimental trials. In each trial, a line 10 meters long was drawn on the second-story floor of a barn. On the ground floor, another line of 10 meters was drawn, exactly parallel and directly underneath the second-story line. Out of the view of the dowswers, a pipe with flowing water was placed perpendicular to the ground-floor line at a random position along the line. The dowser walked along the second-story line and indicated where they thought the pipe was located.

Most of the so-called dowsers were not very good at the task. Out of the 500, forty-three people were identified as having demonstrated skill. These 43 were brought back for a new set of trials. Of these, six were identified as showing actual skill. These dowsers, the researchers wrote in their report, “in particular tasks, showed an extraordinarily high rate of success, which can scarcely if at all be explained as due to chance … a real core of dowser-phenomena can be regarded as empirically proven.”

Of the 43, the data for the best three dowsers was reported and is available in the math300::Dowsing data frame.

A. What does the selection of 43 “skilled” dowsers out of the original panel of 500 indicate about interpreting the study result.

Not very much. It is reasonable for an experimental to check on the experimental material before the experiment. One can regard the testing of the 500 as checking on the experimental material, while the actual experiment involved just the 43.

B. Using the Dowsing data, fit the model pipe ~ guess. Examine a summary of the model to find a p-value. What does this p-value indicate?

In this simple model, any of the usual regression model summaries will point to the same p-value:

mod <- lm(pipe ~ guess, data = Dowsing)
mod |> R2()
n k Rsquared F adjR2 p df.num df.denom
26 1 0.0986092 2.62552 0.0610512 0.1172251 1 24
mod |> conf_interval()
term .lwr .coef .upr
(Intercept) 6.5164745 30.0736931 53.6309116
guess -0.0842742 0.3078625 0.6999993
mod |> regression_summary()
term estimate std.error statistic p.value
(Intercept) 30.0736931 11.4139420 2.634821 0.0145128
guess 0.3078625 0.1899981 1.620346 0.1182247
mod |> anova_summary()
term df sumsq meansq statistic p.value
guess 1 1778.925 1778.9245 2.62552 0.1182247
Residuals 24 16261.229 677.5512 NA NA

The p-value does not indicate that there is a relationship between guess and pipe. If you think of the result as being the best p-value out of 14 randomly selected trios of dowswers, we would expect the result as likely to be below 0.10.

C. It’s appropriate to look at the three dowsers individually. Model pipe ~ guess for each of the three dowswers and report the three p-values. What do these indicate?

lm(pipe ~ guess, data=Dowsing |> filter(subject=="99")) |> R2()
n k Rsquared F adjR2 p df.num df.denom
10 1 0.0729793 0.6297962 -0.0428983 0.4458523 1 8
lm(pipe ~ guess, data=Dowsing |> filter(subject=="18")) |> R2()
n k Rsquared F adjR2 p df.num df.denom
6 1 0.2960147 1.681937 0.1200184 0.2423041 1 4
lm(pipe ~ guess, data=Dowsing |> filter(subject=="108")) |> R2()
n k Rsquared F adjR2 p df.num df.denom
10 1 0.1489826 1.400513 0.0426054 0.2640072 1 8

None of the “best” dowsers show any skill; the p-values are all large.

The p-value found in (B) is the result of combining the data from the best 3 out of 43 dowsers.

Source: This question is drawn from J.T. Enright (1999) “Testing dowsing: the failure of the Munich experiments” Skeptical Inquirer 23(1) link

Section 30.2 tells the story of a claim of a direct causal link between use of night-lights in children’s bedrooms and the development of nearsightedness. The authors pointed to “the statistical strength of the association of night-time light exposure and childhood myopia” as suggesting a causal link. A figure caption in the table presents the statistical strength as $$p<0.00001$$.

A. Explain whether $$p < 0.00001$$ is a good basis for making a causal claim.

Another paper presented a rebuttal to the claim. It points out that there is a possible covariate: the parents’ vision. “The proportion of myopic children with two myopic parents was significantly greater than the proportion of myopic children with zero or one myopic parent ($$p<0.05$$).” This p-value is more than 1000-times larger than the one given in support of the original claim.

B. Explain the extent to which the rebuttal’s much larger p-value invalidates the claim made in the rebuttal.