Risk of Congestive Heart Disease
More than 6 million people in the US suffer from congestive heart disease (CHD); it is the leading cause of hospitalization in people older than 65. In this activity, you are going to build a screening model for high risk of congestive heart disease. The data come from the Framingham Heart Study and were published as part of a Kaggle machine learning competition.
Work with the members of your group.
- One person should create and edit a new Rmd file names
CHD-risk.Rmd
. The other group members will work with person editing the file, making suggestions and interpreting results. - At the end of the activity, the
CHD-risk.Rmd
file should be shared with all members of the group.
To the editor:
- Open a new text file (not Rmd). Then immediately save it as
CHD-risk.Rmd
. We’re doing things this way to avoid having to deal with the “template” that RStudio uses when opening an Rmd file directly. - Copy this into your newly saved
CHD-risk.Rmd
file. There are just two R chunks involved, which you should run from withinCHD-risk.Rmd
.
# Class activity: Risk of congestive heart disease
```{r include=FALSE}
library(math300)
```
```{r}
model0 <- lm(TenYearCHD ~ diabetes + totChol,
data=Framingham, family=binomial)
model_plot(model0, show_data=FALSE) ```
We are working with the
math300::Framingham
data frame. The goal is to predict theTenYearCHD
variable from the other variables. We’ll start withlm()
, but in anticipation of changing to something more suitable, the following command has a third argument tolm()
, namelyfamily=binomial
. Leave the third argument in place; at this point it doesn’t do anything.The second chunk creates a graph. Discuss the format of the graph and what it says about
age
andtotChol
as risk factors for CHD.Try out other explanatory variables. (
model_plot
won’t hand more than three.) Your goal is to maximize the range of model values, that is, to identify some people at low risk and others at high risk. For our purposes, you don’t need to worry about medical plausibility: just search through the variables.Switch to logistic regression, which is more suitable for modeling probabilities. All you need to do in the above command is change
lm
toglm
. (Leave in thefamily=binomial
argument, which is needed for logistic regression.)
Deliverable: The model specification you find that does a nice job creating different model values for different values of the explanatory variables.