nrow(FEV)
[1] 654
Jane Doe
You are going to work with data collected in the 1970s to examine the effects of smoking and exposure to second-hand smoke on pulmonary functions in youths. The data frame is FEV
and is included in the {math300}
package.
The response variable that we will study is also called FEV
, standing for the “forced expiratory volume” measured in the participants in the study. In general, higher forced expiratory volume is considered a sign of better respiratory health.
Build a model using the FEV data frame and the model specification FEV ~ smoker
and pipe it through conf_interval()
. Then answer these questions:
smokersmoker
. This looks like a typo, but it is not. Explain why the variable is smoker
but the name of the coefficient is smokersmoker
. (Hint: Look at the documentation for FEV.)smokersmoker
coefficient is 0.71. Based on this, is smoking associated with higher or lower FEV
.FEV
? What are the units of the smokersmoker
coefficient.smokersmoker
?Just for pedagogical purposes, we are going to explore how the width of the confidence interval would change if we had more or less data. You already have heard the theoretical relationship of the width of the confidence interval as a function of sample size \(n\).
FEV
?We can easily simulate working with a sample of \(n=150\). To do this, fit a model (and calculate the confidence interval on smokersmoker
), but rather than using the argument data=FEV
use this instead: data=sample(FEV, size=150)
. Compare the width of confidence interval you get in this way to your theoretical prediction in (2).
This will be surprising, but we can actually simulate what would happen if we had a larger sample size. (This is just a simulation, and just for pedagogical purposes. This is not a way to collect a genuine sample of a larger size.)
To create a (simulated) sample of size, say, \(n=2500\) set the data argument to lm()
this way: data=resample(FEV, size=2500)
. (NOTE: The function being used here is not sample()
but the closely related resample()
, with an re
in front.)
Calculate the width of the (simulated) confidence interval on smokersmoker
for the sample size of 2500.
Were you surprised to see in Task 1 that smoking is associated with a higher FEV than non-smoking? Since a higher FEV is considered healthier, does this mean that smoking is healthy? The answer is “no,” but let’s consider it from the perspective of accuracy versus precision.
The confidence interval on smokersmoker
in the model FEV ~ smoker
was 0.50 to 0.93 liters. This precision is good enough to rightfully claim that the smokersmoker
coefficient is not zero or negative.
But precision is different from accuracy. One of the major potential determinants of FEV is age.
FEV ~ age
and construct the confidence interval on age
. Explain whether your model is consistent or not with the idea that FEV depends on age.smoker ~ age
. Since smoker
is a categorical variable, we need to convert it to a zero-one variable before fitting the model. Here’s a chunk to do so, assigning the smokers to have a value of 1:FEV |>
mutate(smoke = zero_one(smoker, one="smoker")) |>
lm(smoke ~ age, data = _) |>
conf_interval()
# A tibble: 2 × 4
term .lwr .coef .upr
<chr> <dbl> <dbl> <dbl>
1 (Intercept) -0.381 -0.308 -0.234
2 age 0.0338 0.0410 0.0481
Interpret the coefficient as a rate of probability: how the probability that a participant smokes changes per year of age.
Is the age
coefficient consistent with the idea that older kids are more likely to smoke?
Now the point about accuracy and precision being different things. FEV increases with age and so does the probability of being a smoker. That means that smoker
is also related to age. In fact, to some extent smoker
is a proxy for age. As an exercise, draw on paper a DAG where age influences FEV, and age influences smoking status, and also smoking status influences FEV.
For the DAG just described, an accurate model to estimate the direct effect of smoking on FEV is FEV ~ age + smoker
. Fit this model and use the confidence interval on smoker
to make an accurate statement about smoking and FEV.