9  Model patterns

In these Lessons, we will be building many models using many different sources of data. Often, we will build multiple models of the same response variable from the same data frame, in order to compare different ways of explaining the variation in the response variable.

All of the models will have certain features in common. This Lesson points out those commonalities so that you can “read” a new model with understanding.

The model specification

Two basic inputs go into constructing a model:

  1. A data frame.
  2. The model specification, which declares which column from the data frame will be the response variable and which other column(s) will be the explanatory variable(s).

For directing the computer, we write the model specification as a tilde expression: the name of the response variable goes to the left of the tilde. The name of the explanatory variable is on the right.

When there is more than one explanatory variable, their names all go on the right side of tilde separated by the + symbol which stands for the English word “and” rather than an sum in the arithmetic sense.

Occasionally we will use the * symbol instead of + for reasons which will be pointed out whenever we come to such a situation. We will also sometimes use mathematical functions such as log() or ns() in the model specification.

From time to time, we will refer to models with no explanatory variables. In such models, a simple 1 goes to the right of the tilde. The reasons for doing this require some explanation which will be provided in later Lessons.

The type of model we will see in these lessons are called “regression models.” In a regression model, the response variable must always be quantitative. But remember that a categorical variable with two levels (e.g., yes/no, alive/dead, succeed/fail) can be translated into a zero-one quantitative variable with no loss of information.

Models with a categorical response variable with more than two levels are called, in the language of machine learning, “classifiers.” We will not touch on multi-level classifiers in these Lessons.

“Shapes” of models

Although the response variable in a regression model is always quantitative, explanatory variables can be either quantitative or categorical. In professional work, regression models may sometimes involve tens or thousands (or more) explanatory variables. Almost all the models used in these Lessons will have one or two explanatory variables (and, occasionally, zero explanatory variables). This suffices for introducing the concepts and methods of statistical thinking.

It is convenient to think of the various combinations of explanatory variables in terms of the “shape” of a graph of the model. For models with a single explanatory variable, there are two main shapes, one when the explanatory variable is categorical, another when the explanatory variable is quantitative.

To illustrate, we will use the CPS85 data frame which records a small survey of worker’s wages (in 1985) and includes both numerical and categorical variables. The unit of observation is an individual worker. The categorical variable sector, records the type of work done by the worker and has multiple different levels such as clerical, manufacturing, sales, and service. We’ll compare the shapes of several different models all of which use wage as the response variable.

One explanatory variable

First, we’ll consider models with a single explanatory variable. When that explanatory variable is categorical the model shape consists of potentially different values for each level of the explanatory variable. Figure 9.1 shows two examples:

Code
graph_model <- function(model, data_tilde, model_tilde, width=0.3) {
  gf_point(data_tilde, data=model, 
           alpha=0.3, 
           position = position_jitter(height=0, width=width, seed=101)) |> 
  gf_point(model_tilde, data=model, color="blue", alpha=0.3, 
           position = position_jitter(height=0, width=width, seed=101)) 
}
CPS85 <- CPS85 |> filter(wage < 35)
CPS85 |> 
  mutate(modval = model_values(wage ~ union)) |>
  graph_model(wage ~ union, modval ~ union)
CPS85 |>
  mutate(modval = model_values(wage ~ sector)) |>
  graph_model(wage ~ sector, modval ~ sector)
CPS85 |> 
  mutate(modval = model_values(wage ~ married)) |>
  graph_model(wage ~ married, modval ~ married)

(a) wage ~ union

(b) wage ~ sector

(c) wage ~ married

Figure 9.1: Examples of regression models with a single categorical explanatory variable.

When the explanatory variable is quantitative, the model values are arrayed on a smooth curve, as in Figure 9.2

Code
CPS85 |> 
  mutate(modval = model_values(wage ~ exper)) |>
  graph_model(wage ~ exper, modval ~ exper) |>
  gf_line(modval ~ exper, color='blue') 
CPS85 |> 
  mutate(modval = model_values(wage ~ educ)) |>
  graph_model(wage ~ educ, modval ~ educ, width=0) |>
  gf_line(modval ~ educ, color='blue') 
CPS85 |> 
  mutate(modval = model_values(wage ~ splines::ns(age, 3))) |>
  graph_model(wage ~ age, modval ~ age, width=0) |>
  gf_line(modval ~ age, color='blue') 

(a) wage ~ exper

(b) wage ~ educ

(c) wage ~ ns(age, 3)

Figure 9.2: Examples of regression models with a single quantitative explanatory variable.

Two explanatory variables

Explanatory variables can be either quantitative or categorical. With two explanatory variables, one will be mapped to the horizontal axis and one mapped to color. There are four combinations possible each of which has a distinctive graphical format:

Example Vertical axis Horizontal axis Color
Figure 9.3 quantitative categorical categorical
Figure 9.4 quantitative categorical quantitative
Figure 9.5 quantitative quantitative categorical
Figure 9.6 quantitative quantitative quantitative

Two categorical explanatory variables

Whickham |> 
  tilde_graph(age    ~ smoker + outcome, alpha=0.05, annot="model", model_alpha=0.7) 

Figure 9.3: age ~ smoker + outcome

This example shows data from a survey of nurses in the UK. Each nurse’s age and smoking status was recorded at an initial interview. The interview was followed up 20 years later, at which point some of the original interviewees were dead and others still living, recorded in the variable outcome. Unsurprisingly, the older interviewees were much more likely to have died during the 20-year follow-up. The model values show the difference in mean ages between the smokers and non-smokers separately for the survivors and non-survivors. With two categorical variables, each with two levels, there are four distinct model values.

Categorical & quantitative

This example shows (full-grown) child’s height as a function of the child’s sex and his or her mother’s height.

(a) Data layer

(b) Model-value layer
Figure 9.4: height ~ sex + mother
# Repeated from the above to get code to
# align properly
Galton |> 
  tilde_graph(height ~ sex + mother, alpha=0.2)
Galton |> 
  mutate(modval = model_values(height ~ sex + mother)) |>
  tilde_graph(modval ~ sex + mother) |>
  gf_lims(y = c(55, 80))

Reading such a graph takes patience. We’ve tried to help by separating the data layer from the model-value layer. In the data layer, you can see easily that some males are taller than almost all females, and that some females are shorter than almost all males. Less easy, but still discernable, is that the shorter children of either sex tend to have shorter mothers (yellow) and that taller children of each sex tend to have taller mothers (purple).

The model-value layer shows the extent of the relationship between mother’s and child’s height more clearly. (This is exactly what models are supposed to do!) You can see that the model values differ for children of the shortest mothers and of the tallest mothers. The different is about 3 inches of child’s height.

The model values are faithful to the data, but leave out the residuals. The raw data include the residuals. The non-zero size of residuals means that children of the shortest mothers differ in height from the model values. Similarly for the children of the tallest mothers. The result is, in the raw data, that some children of the shorter mothers are in fact taller than some children of the taller mothers. The model values, by stripping away the residual child-to-child differences, make the trends easier to see.

Quantitative & categorical

This examples shows exactly the same data and model as the previous example. The only difference is that here the quantitative variable is mapped to the horizontal axis and the categorical variable is mapped to color.

Point for point the model values in Figure 9.5 are exactly the same as in Figure 9.4. But the new arrangement spreads them out differently in space. In Figure 9.5 the model values are organized along two straight lines, one for each sex. The slope of the lines indicates the relationship between mother’s and child’s heights. The vertical offset between the lines is the difference in model values for the two sexes.

That Figure 9.5 is easier for you to read than Figure 9.4 suggests an important graphical rule: When a model has one quantitative and one categorical explanatory variable, map the quantitative variable to the horizontal axis.

Galton |> 
  tilde_graph(height ~ mother + sex, alpha=0.1, annot="model", model_alpha=0.7)

Figure 9.5: Mapping the quantitative explanatory variable to the horizontal axis.

Two quantitative explanatory variables

This example draws on the same data as the previous two examples, but the explanatory variables are the mother’s height and the father’s height. Both these explanatory variables are quantitative.

Galton |> tilde_graph(height ~ mother + father) 
Galton |> 
  mutate(modvals = model_values(height ~ mother + father)) |>
  tilde_graph(modvals ~ mother + father) |> 
  gf_lims(y=c(55,80))

(a) Data layer

(b) Model-value layer
Figure 9.6: height ~ mother + father

As in Figure 9.4, using a mapping a quantitative variable to color makes the graph hard to read. To simplify, we’ve separated the data layer from the model layer.

It’s almost impossible to see the relationship between father’s and child’s height in the data layer. In contrast, the model-value layer, by stripping away the child-to-child residuals, makes things clearer. The father/child relationship is seen from color strata, with shorter fathers (yellow) on the bottom and taller fathers at the top. The mother/child relationship appears in the upward slope of the cloud of model-values, similar to the slope in Figure 9.5.

Exercises

DRAFT: Which variables could be used as response variables?

DRAFT: Which shape of model will correspond to these model specifications?

DRAFT: Plot model but against the wrong variable. Ask them to fix the statement.

Show graphs with slope roses and ask to estimate the slopes of lines

Experiments with tilde_graph()

Use this as an example. Make as two, side-by-side graphs, showing the effect of adding a second explanatory variable. Show the residual variance for each of the models.

Whickham |> 
  tilde_graph(age ~ smoker, alpha=0.1, annot="model",
              model_alpha=0.7)

Whickham |>
  tilde_graph(age ~ smoker + outcome, alpha=0.01, annot="model", model_alpha=0.7) 

Show three graphs from this one, survived ~ 1, survived ~ smoker, survived ~ smoker + age

Tmp <- Whickham |> 
  mutate(survived = zero_one(outcome, one="Alive")) 
Tmp |>  tilde_graph(survived ~ age, alpha=0.1, annot="model", model_alpha=0.3)

Tmp |>  tilde_graph(survived ~ age + smoker, alpha=0.1, annot="model", model_alpha=0.3) 

Tmp <- Whickham |> 
  mutate(survived = zero_one(outcome, one="Alive")) 
Tmp |>  tilde_graph(survived ~ smoker, alpha=0.1, size=0.2, annot="model", model_alpha=0.7)

Tmp |>  tilde_graph(survived ~ smoker + age, alpha=0.1, size=0.2, annot="model", model_alpha=0.7) 

TO DO

ADD TO math300: labels when the color variable is quantitative.

Shiny app for measuring slopes and differences.

Kill messages from tilde_graph()