Math 300Z: In-class group activity

This activity is to investigate how adding a covariate to a model can change the coefficient on another explanatory variable. The basic logic of the activity is to construct a series of pairs of nested models and examine the confidence interval on the coefficient of an explanatory variable.

You will use the Anthro_F data frame, which records measurements made on 184 women of body shape: knee circumference, ankle circumference, and so on. Since people have similar body shapes, regardless of size, these measurements tend to be correlated with one another.

The models you will build will have as BFat as the response variable. BFat is the measured proportion of body weight that is fat tissue. As an example of a nested pair of models, consider:

  1. BFat ~ Knee
  2. BFat ~ Knee + Ankle

Model (i) is “nested” in model (ii) because model (ii) includes all the variables in model (i). (Nested models always have the same response variable.)

Your analysis of each pair will compare the confidence interval on the explanatory variable in the smaller model to that same variable in the larger model, e.g.

lm(Height ~ Knee, data=Anthro_F) |> conf_interval()
# A tibble: 2 × 4
  term           .lwr   .coef    .upr
  <chr>         <dbl>   <dbl>   <dbl>
1 (Intercept) 1.31    1.44    1.57   
2 Knee        0.00275 0.00636 0.00996
lm(Height ~ Knee + Ankle, data=Anthro_F) |> conf_interval()
# A tibble: 3 × 4
  term              .lwr   .coef    .upr
  <chr>            <dbl>   <dbl>   <dbl>
1 (Intercept)  1.25      1.40    1.55   
2 Knee         0.0000406 0.00465 0.00925
3 Ankle       -0.00325   0.00479 0.0128 

Working with your group partners, try to find pairs of explanatory variables such that the coefficient on one variable changes greatly when the covariate is added to the model.

To help, here is the correlation between many pairs of variables presented in the form of an angle in degrees. (See background section below.) Every variable has a angle 0 with itself. Small angles mean the two variables are closely aligned, large angles mean they are not.

Code
corrs <- Anthro_F |>
  select(Neck, Chest, Calf, Biceps, Hips, Waist, PThigh, MThigh, DThigh, Forearm, Wrist, Knee, Elbow, Ankle, Age) |>
  cor() 
round(180*acos(corrs)/pi) 
        Neck Chest Calf Biceps Hips Waist PThigh MThigh DThigh Forearm Wrist Knee Elbow Ankle Age
Neck       0    53   63     49   52    47     51     53     58      47    51   60    53    58  95
Chest     53     0   71     55   58    52     55     58     67      55    62   67    56    71  90
Calf      63    71    0     65   63    64     60     59     60      58    63   62    63    58  89
Biceps    49    55   65      0   45    41     42     43     52      34    45   54    43    60  93
Hips      52    58   63     45    0    40     24     38     48      44    51   42    46    50  90
Waist     47    52   64     41   40     0     40     48     57      44    49   52    46    58  95
PThigh    51    55   60     42   24    40      0     27     43      42    49   40    45    50  92
MThigh    53    58   59     43   38    48     27      0     43      43    47   48    47    50  92
DThigh    58    67   60     52   48    57     43     43      0      50    55   49    55    53  92
Forearm   47    55   58     34   44    44     42     43     50       0    34   50    39    49  93
Wrist     51    62   63     45   51    49     49     47     55      34     0   54    45    48  94
Knee      60    67   62     54   42    52     40     48     49      50    54    0    54    51  97
Elbow     53    56   63     43   46    46     45     47     55      39    45   54     0    52  95
Ankle     58    71   58     60   50    58     50     50     53      49    48   51    52     0  96
Age       95    90   89     93   90    95     92     92     92      93    94   97    95    96   0

TASK: Pick several pairs of variables, some related by small angles and some with large angles. For each pair, compare the first variable’s coefficient between the nested models.

As a group, find the biggest change you observed in the coefficient when adding the covariate to the model? (You’ll have to agree on a way to measure change in the coefficient.) Do large or small angles tend to produce bigger changes?

Background: r, R2, and the “Angle” between variables

Since Francis Galton’s invention/discovery of the correlation coefficient in 1888, it has been the standard introduction to measuring the relationship between two quantitative variables. It has even entered the vocabulary of everyday English as “a mutual relationship or connection between two or more things.” (Oxford Dictionaries)

Also in 1888, the phenomenon of electromagnetic waves was discovered by physicist Heinrich Hertz. These had been theoretically predicted in 1865 by James Clerk Maxwell. The mathematics of Maxwell’s representation of electromagnetism was very difficult. Consequently, physicists and mathematicians worked to create a simpler formalism. This eventually emerged in the university-level curriculum as two courses: vector calculus (usually called Calc III) and linear algebra. Naturally, in 1888, Galton was unaware of these developments. Nonetheless, vectors and linear algebra provide a great simplification of the concept of correlation.

Any quantitative variable—a series of numbers—is also consequently a “vector,” which you can think of as an arrow pointing in a particular direction. The correlation coefficient between two variables amounts to the cosine of the angle between the two vectors. When the angle is very small, the variables are strongly aligned. When the angle is near 90\(^\circ\), the two variables are not at all aligned.

In R, a standard way of calculating the correlation coefficient uses cor() as in this example:

Galton |> summarize(correlation = cor(height, mother))
  correlation
1   0.2016549

The translation of the correlation coefficient into an angle (in degrees) involves some trigonometry (which is not a topic of Math 300):

acos(0.202)*180/pi
[1] 78.34606

78 degrees is pretty close to a right angle, meaning that height and mother are barely aligned.

Aside: R2 and r

We have not emphasized the correlation coefficient r in Math 300 because r is descriptive only of the (linear) relationship between two variables. In Math 300, we are often using multiple explanatory variables and r does not apply. Instead, we use R2: a much more general description of the relationship between a response variable and explanatory variables.

In the case where a model has only one explanatory variable, e.g., height ~ mother, R2 has a simple relationship to r, namely, r2 = R2. This use of lower-case (r) and upper-case (R) can be confusing, so we are not using r much in Math 300.

To demonstrate the relationship between r and R2, consider:

lm(height ~ mother, data=Galton) |> R2()
    n k  Rsquared        F      adjR2            p df.num df.denom
1 898 1 0.0406647 37.98001 0.03959401 1.078142e-09      1      896

Since r between height and mother was 0.2016, you can confirm that R2 from the model is exactly r2.