# Repeated from the above to get code to
# align properly
|>
Galton tilde_graph(height ~ sex + mother, alpha=0.2)
|>
Galton mutate(modval = model_values(height ~ sex + mother)) |>
tilde_graph(modval ~ sex + mother) |>
gf_lims(y = c(55, 80))
9 Model patterns
In these Lessons, we will be building many models using many different sources of data. Often, we will build multiple models of the same response variable from the same data frame, in order to compare different ways of explaining the variation in the response variable.
All of the models will have certain features in common. This Lesson points out those commonalities so that you can “read” a new model with understanding.
The model specification
Two basic inputs go into constructing a model:
- A data frame.
- The model specification, which declares which column from the data frame will be the response variable and which other column(s) will be the explanatory variable(s).
For directing the computer, we write the model specification as a tilde expression: the name of the response variable goes to the left of the . The name of the explanatory variable is on the right.
When there is more than one explanatory variable, their names all go on the right side of separated by the
+
symbol which stands for the English word “and” rather than an sum in the arithmetic sense.
*
symbol instead of +
for reasons which will be pointed out whenever we come to such a situation. We will also sometimes use mathematical functions such as log()
or ns()
in the model specification.From time to time, we will refer to models with no explanatory variables. In such models, a simple 1
goes to the right of the . The reasons for doing this require some explanation which will be provided in later Lessons.
The type of model we will see in these lessons are called “regression models.” In a regression model, the response variable must always be quantitative. But remember that a categorical variable with two levels (e.g., yes/no, alive/dead, succeed/fail) can be translated into a zero-one quantitative variable with no loss of information.
“Shapes” of models
Although the response variable in a regression model is always quantitative, explanatory variables can be either quantitative or categorical. In professional work, regression models may sometimes involve tens or thousands (or more) explanatory variables. Almost all the models used in these Lessons will have one or two explanatory variables (and, occasionally, zero explanatory variables). This suffices for introducing the concepts and methods of statistical thinking.
It is convenient to think of the various combinations of explanatory variables in terms of the “shape” of a graph of the model. For models with a single explanatory variable, there are two main shapes, one when the explanatory variable is categorical, another when the explanatory variable is quantitative.
To illustrate, we will use the CPS85
data frame which records a small survey of worker’s wage
s (in 1985) and includes both numerical and categorical variables. The unit of observation is an individual worker. The categorical variable sector
, records the type of work done by the worker and has multiple different levels such as clerical, manufacturing, sales, and service. We’ll compare the shapes of several different models all of which use wage
as the response variable.
One explanatory variable
First, we’ll consider models with a single explanatory variable. When that explanatory variable is categorical the model shape consists of potentially different values for each level of the explanatory variable. Figure 9.1 shows two examples:
Code
<- function(model, data_tilde, model_tilde, width=0.3) {
graph_model gf_point(data_tilde, data=model,
alpha=0.3,
position = position_jitter(height=0, width=width, seed=101)) |>
gf_point(model_tilde, data=model, color="blue", alpha=0.3,
position = position_jitter(height=0, width=width, seed=101))
}<- CPS85 |> filter(wage < 35)
CPS85 |>
CPS85 mutate(modval = model_values(wage ~ union)) |>
graph_model(wage ~ union, modval ~ union)
|>
CPS85 mutate(modval = model_values(wage ~ sector)) |>
graph_model(wage ~ sector, modval ~ sector)
|>
CPS85 mutate(modval = model_values(wage ~ married)) |>
graph_model(wage ~ married, modval ~ married)
wage ~ union
wage ~ sector
wage ~ married
When the explanatory variable is quantitative, the model values are arrayed on a smooth curve, as in Figure 9.2
Code
|>
CPS85 mutate(modval = model_values(wage ~ exper)) |>
graph_model(wage ~ exper, modval ~ exper) |>
gf_line(modval ~ exper, color='blue')
|>
CPS85 mutate(modval = model_values(wage ~ educ)) |>
graph_model(wage ~ educ, modval ~ educ, width=0) |>
gf_line(modval ~ educ, color='blue')
|>
CPS85 mutate(modval = model_values(wage ~ splines::ns(age, 3))) |>
graph_model(wage ~ age, modval ~ age, width=0) |>
gf_line(modval ~ age, color='blue')
wage ~ exper
wage ~ educ
wage ~ ns(age, 3)
Two explanatory variables
Explanatory variables can be either quantitative or categorical. With two explanatory variables, one will be mapped to the horizontal axis and one mapped to color. There are four combinations possible each of which has a distinctive graphical format:
Example | Vertical axis | Horizontal axis | Color |
---|---|---|---|
Figure 9.3 | quantitative | categorical | categorical |
Figure 9.4 | quantitative | categorical | quantitative |
Figure 9.5 | quantitative | quantitative | categorical |
Figure 9.6 | quantitative | quantitative | quantitative |
Two categorical explanatory variables
|>
Whickham tilde_graph(age ~ smoker + outcome, alpha=0.05, annot="model", model_alpha=0.7)
age ~ smoker + outcome
This example shows data from a survey of nurses in the UK. Each nurse’s age and smoking status was recorded at an initial interview. The interview was followed up 20 years later, at which point some of the original interviewees were dead and others still living, recorded in the variable outcome
. Unsurprisingly, the older interviewees were much more likely to have died during the 20-year follow-up. The model values show the difference in mean ages between the smokers and non-smokers separately for the survivors and non-survivors. With two categorical variables, each with two levels, there are four distinct model values.
Categorical & quantitative
This example shows (full-grown) child’s height as a function of the child’s sex and his or her mother’s height.
height ~ sex + mother
Reading such a graph takes patience. We’ve tried to help by separating the data layer from the model-value layer. In the data layer, you can see easily that some males are taller than almost all females, and that some females are shorter than almost all males. Less easy, but still discernable, is that the shorter children of either sex tend to have shorter mothers (yellow) and that taller children of each sex tend to have taller mothers (purple).
The model-value layer shows the extent of the relationship between mother’s and child’s height more clearly. (This is exactly what models are supposed to do!) You can see that the model values differ for children of the shortest mothers and of the tallest mothers. The different is about 3 inches of child’s height.
The model values are faithful to the data, but leave out the residuals. The raw data include the residuals. The non-zero size of residuals means that children of the shortest mothers differ in height from the model values. Similarly for the children of the tallest mothers. The result is, in the raw data, that some children of the shorter mothers are in fact taller than some children of the taller mothers. The model values, by stripping away the residual child-to-child differences, make the trends easier to see.
Quantitative & categorical
This examples shows exactly the same data and model as the previous example. The only difference is that here the quantitative variable is mapped to the horizontal axis and the categorical variable is mapped to color.
Point for point the model values in Figure 9.5 are exactly the same as in Figure 9.4. But the new arrangement spreads them out differently in space. In Figure 9.5 the model values are organized along two straight lines, one for each sex. The slope of the lines indicates the relationship between mother’s and child’s heights. The vertical offset between the lines is the difference in model values for the two sexes.
That Figure 9.5 is easier for you to read than Figure 9.4 suggests an important graphical rule: When a model has one quantitative and one categorical explanatory variable, map the quantitative variable to the horizontal axis.
|>
Galton tilde_graph(height ~ mother + sex, alpha=0.1, annot="model", model_alpha=0.7)
Two quantitative explanatory variables
This example draws on the same data as the previous two examples, but the explanatory variables are the mother’s height and the father’s height. Both these explanatory variables are quantitative.
|> tilde_graph(height ~ mother + father)
Galton |>
Galton mutate(modvals = model_values(height ~ mother + father)) |>
tilde_graph(modvals ~ mother + father) |>
gf_lims(y=c(55,80))
height ~ mother + father
As in Figure 9.4, using a mapping a quantitative variable to color makes the graph hard to read. To simplify, we’ve separated the data layer from the model layer.
It’s almost impossible to see the relationship between father’s and child’s height in the data layer. In contrast, the model-value layer, by stripping away the child-to-child residuals, makes things clearer. The father/child relationship is seen from color strata, with shorter fathers (yellow) on the bottom and taller fathers at the top. The mother/child relationship appears in the upward slope of the cloud of model-values, similar to the slope in Figure 9.5.
Exercises
DRAFT: Which variables could be used as response variables?
DRAFT: Which shape of model will correspond to these model specifications?
DRAFT: Plot model but against the wrong variable. Ask them to fix the statement.
Show graphs with slope roses and ask to estimate the slopes of lines
Experiments with tilde_graph()
Use this as an example. Make as two, side-by-side graphs, showing the effect of adding a second explanatory variable. Show the residual variance for each of the models.
|>
Whickham tilde_graph(age ~ smoker, alpha=0.1, annot="model",
model_alpha=0.7)
|>
Whickham tilde_graph(age ~ smoker + outcome, alpha=0.01, annot="model", model_alpha=0.7)
Show three graphs from this one, survived ~ 1, survived ~ smoker, survived ~ smoker + age
<- Whickham |>
Tmp mutate(survived = zero_one(outcome, one="Alive"))
|> tilde_graph(survived ~ age, alpha=0.1, annot="model", model_alpha=0.3) Tmp
|> tilde_graph(survived ~ age + smoker, alpha=0.1, annot="model", model_alpha=0.3) Tmp
<- Whickham |>
Tmp mutate(survived = zero_one(outcome, one="Alive"))
|> tilde_graph(survived ~ smoker, alpha=0.1, size=0.2, annot="model", model_alpha=0.7) Tmp
|> tilde_graph(survived ~ smoker + age, alpha=0.1, size=0.2, annot="model", model_alpha=0.7) Tmp
TO DO
ADD TO math300: labels when the color variable is quantitative.
Shiny app for measuring slopes and differences.
Kill messages from tilde_graph()