Lesson 28: Worksheet

Author

Jane Doe

In this Worksheet, you’ll explore how adding a covariate to a model can change the coefficient on an explanatory variable.

We will work with two nested models trained on the Galton data frame:

height ~ mother
height ~ mother + covar where you will replace covar with a variable we will construct.

Task A. Replace covar with father and observe how the coefficient on mother changes between (i) and (ii). Describe the change as large or small, providing also your definition for large and small in this context.

ANSWER

The two models are

lm(height ~ mother, data=Galton) |> conf_interval()

# A tibble: 2 × 4
  term          .lwr  .coef   .upr
  <chr>        <dbl>  <dbl>  <dbl>
1 (Intercept) 40.3   46.7   53.1  
2 mother       0.213  0.313  0.413

lm(height ~ mother + father, data=Galton) |> conf_interval()

# A tibble: 3 × 4
  term          .lwr  .coef   .upr
  <chr>        <dbl>  <dbl>  <dbl>
1 (Intercept) 13.9   22.3   30.8  
2 mother       0.187  0.283  0.380
3 father       0.290  0.380  0.470

I describe the change in the mother coefficient as small, since the confidence intervals on the two models overlap very substantially. The difference between the .coefs is very small compared to the width of the confidence intervals.

Task B. An important factor in whether a covariate changes a coefficient is the strength of the relationship between the covariate and the other explanatory variable. Measure the strength of the relationship between mother and father by fitting the model mother ~ father and finding R². Describe whether the R² you find in this way is large or small. (You’ll have to give a definition for “large” and “small” in this context. It will be different than for the context in part (A). Also, note that in this task, the response variable is mother, not height.)

ANSWER

To look at the relationship between the covariate father and the other explanatory variable mother, find R²

lm(mother ~ father, data=Galton) |> R2()

    n k    Rsquared       F      adjR2          p df.num df.denom
1 898 1 0.005426475 4.88865 0.00431646 0.02728512      1      896

The R² statistic can range from zero to one. On that scale, the above R² is very close to zero, so there is little if any relationship between the mother’s height and the father’s height. Following social convention, usually there is little or no genetic relationship between the mother and the father. But if you think that married couples tend to be similar in height, the Galton data suggests otherwise.

Task C. Now you are going to create a new covariate that is going to be closely related to mother. This is a matter of making the new variable very similar to mother, like this:

Galton <- Galton %>% 
  mutate(new_var = 6*mother + 2*father)

The new_var consists of six parts mother and two parts father.

What is the R² between mother and new_var?
Comparing the mother coefficient from the nested pair of models height ~ mother and height ~ mother + new_var, would you say that new_var changes things substantially?

ANSWER

lm(mother ~ new_var, data = Galton) |> R2()

    n k  Rsquared        F     adjR2 p df.num df.denom
1 898 1 0.8926256 7448.632 0.8925057 0      1      896

And now the change in the mother coefficient

lm(height ~ mother, data=Galton) |> conf_interval()

# A tibble: 2 × 4
  term          .lwr  .coef   .upr
  <chr>        <dbl>  <dbl>  <dbl>
1 (Intercept) 40.3   46.7   53.1  
2 mother       0.213  0.313  0.413

lm(height ~ mother + new_var, data=Galton) |> conf_interval()

# A tibble: 3 × 4
  term          .lwr  .coef   .upr
  <chr>        <dbl>  <dbl>  <dbl>
1 (Intercept) 13.9   22.3   30.8  
2 mother      -1.15  -0.856 -0.563
3 new_var      0.145  0.190  0.235

Notice that the mother coefficient changed sign between the two models. This is called “Simpson’s Paradox.” But it’s really only a paradox to people who don’t understand that using a covariate that is closely related to an explanatory variable can substantially change the coefficient on the explanatory variable.

Task D. Going back to the commands in (C), increase the mixture in new_var to fifty parts mother and one part father. What happens to the width of the confidence interval on mother when this close copy of mother is used as a covariate?

ANSWER

Note that we have already calculated, above, the mother coefficient from height ~ mother.

Galton <- Galton %>% 
  mutate(new_var = 50*mother + 1*father)
lm(height ~ mother + new_var, data=Galton) |> conf_interval()

# A tibble: 3 × 4
  term           .lwr   .coef    .upr
  <chr>         <dbl>   <dbl>   <dbl>
1 (Intercept)  13.9    22.3    30.8  
2 mother      -23.2   -18.7   -14.2  
3 new_var       0.290   0.380   0.470

The confidence interval on mother becomes extremely wide!

Task E. Just for interest’s sake … Like Task D, but make new_var 50 parts mother and zero parts father. Something perhaps unexpected happens to bother the mother and the new_var coefficients. Describe what this is.

ANSWER

Galton <- Galton %>% 
  mutate(new_var = 50*mother + 0*father)
lm(height ~ mother + new_var, data=Galton) |> conf_interval()

# A tibble: 3 × 4
  term          .lwr  .coef   .upr
  <chr>        <dbl>  <dbl>  <dbl>
1 (Intercept) 40.3   46.7   53.1  
2 mother       0.213  0.313  0.413
3 new_var     NA     NA     NA

Since new_var and mother have R²=1, new_var provides no new information. R is programmed to recognize such cases (which are typically the result of a mistake by the modeler) and disregard the no-new-information variable. (This is indicated with NA.) With new_var no longer in the model, the mother coefficient returns to its value from the smaller of the nested models!