10  Linear regression

When all explanatory variables are categorical variables, group-wise means are an effective way of showing the trend patterns present in data. But means are not meaningful when explanatory variables are quantitative. In Lesson 8 we generalized the idea of group-wise means to handle quantitative explanatory variables. The generalization involves the splitting response variable into two parts: model values and residuals.

In Lesson 9, we graphed model values as a way of making clearer the trends in the data. Sometimes this was successful (Figures 9.1, 9.2, 9.3, and 9.5), but when a quantitative variable was mapped to color (Figures 9.4 and 9.6), the graphs could be hard to read.

Since modeling is so important to the interpretation of data, in this Lesson we will overcome some of the limitations of graphical presentation of model values by looking directly at the slopes and differences and by graphing model functions as an alternative to row-by-row model values.

Regression models

“Regression” is a strange name for a statistical/mathematical technique. It comes from a misunderstanding in the early days of statistics which remains remarkably prevalent even today. We’ll be able to explain the misconception in Lesson 23 when we have the conceptual tools required.

Lesson 8 introduced the model_values() function. This looks a for pattern (Lesson 9) across the rows in a data frame. In conjunction with the mutate() wrangling function, model_values() assigns a value to each row of a data frame consistent with the pattern.

The row-by-row model values are not the only way to represent such patterns. For instance, the pattern can be presented as a mathematical function which takes values of the explanatory variables as inputs and produces a corresponding output in terms of the response variable. model_values() is like a bike with training wheels; it’s time now to move to a full-size adult bicycle. To this end, we will switch for the rest of these Lessons to the model_train() function.

“Train” is meant in the sense of “training a pet” or “vocational training.” model_train() has nothing to do with miniature-scale layouts found in model-train hobbyists’ basements.

model_train() requires two basic inputs:

  1. a data frame, which is called the “training data
  2. a model specification in the form of a tilde expression, just as in model_values().

Instructors and others familiar with the R language should note that model_train() makes use of the standard R modeling functions, lm() and glm(), for example. You could use those functions in place of model_train(), but you would have to deal with more complicated piping syntax.

The output of model_train() is a computational object that we will call a “regression model.” Regression model objects enable us to make improved graphics and to perform other statistical operations that we will encounter in later Lessons.

To illustrate the use of model_train() we will remake the models from Section 9.2.2 and graph the models (on top of the training data) using model_plot().

Two categorical explanatory variables as in Figure 9.3.

Mod1 <- Whickham |> model_train(age ~ smoker + outcome) 

Categorical & quantitative as in Figure 9.4.

Mod2 <- Galton |> model_train(height ~ sex + mother)

Quantitative & categorical as in Figure 9.5.

Mod3 <- Galton |> model_train(height ~ mother + sex)

Quantitative & quantitative as in Figure 9.6

Mod4 <- Galton |> model_train(height ~ mother + father)

Note in DRAFT

model_plot() uses a different color pallet than tilde_graph(). The tilde_graph() pallet is “colorblind friendly.” Probably both functions should use the same pallet. In deciding which one to use, a question is whether the so-called “colorblind friendly” pallet is in fact easier for colorblind persons to read. Or, maybe, both functions should have an argument to specify the pallet to be used, allowing people to choose what works best for themselves.


Operations on regression models

The point of constructing a regression model is to use it in some way address the problem at hand.

  • Is an explanatory variable related to the response variable?

  • Given such a relationship, how big is it? For instance, by how much might medicine that reduces blood pressure in hypertensive patients by 10 mmHg reduce the mortality rate?

  • Consider a new observation from the system described by data. When can we be confident that this observation indicates a change in the system?

There are statistical techniques that are appropriate for each of these questions. Each technique is built on a regression model which then is summarized in some way.

To illustrate, consider an experiment about simple measures to increase voter turnout. In the experiment, get out the vote post-cards with messages of possibly different persuasive force were sent at random to registered voters before the 2006 mid-term election. (Some voters— the control group—recieved no post-card.) The voters’ response—whether they voted in the election—was recorded from public records. The data, involving 305,866 voters, is in the Go_vote data frame. Three of the variables are of clear relevance: the type of get-out-the-vote message (in messages), whether the voter voted in the upcoming election (primary2006), and whether the voter had voted in the previous election (primary2004).

See Alan S. Gerber, Donald P. Green, and Christopher W. Larimer (2008) “Social pressure and voter turnout: Evidence from a large-scale field experiment.” American Political Science Review, vol. 102, no. 1, pp. 33–48

For reference, here are the means of the zero-one variable primary2006 for each of eight combinations of explanatory variable levels: four postcard messages times the two values of primary2004. Note that primary2006 is a zero-one variable, so the means will be the proportion of 1s. That is, the mean of primary2006 is the proportion of voters who voted in 2006.

Go_vote |> 
  group_by(messages, primary2004) |>
  summarize(vote_proportion = mean(primary2006))
messages primary2004 vote_proportion
Control 0 0.2371099
Control 1 0.3858050
Civic Duty 0 0.2554026
Civic Duty 1 0.4034456
Hawthorne 0 0.2604062
Hawthorne 1 0.4140863
Neighbors 0 0.3064061
Neighbors 1 0.4823302

Here’s a basic model using messages and primary2004 as explanatory variables:

Vote_mod1 <- Go_vote |> 
  model_train(primary2006 ~ messages + primary2004, type="linear")

We’ve already seen that the R2 summary indicates whether there is a connection between the explanatory and response variables. Here it is for our basic model, calculated by the R2() summary function is :

Vote_mod1 |> R2()
      n    k   Rsquared      F   adjR2    p   df.num   df.denom
-------  ---  ---------  -----  ------  ---  -------  ---------
 310000    4      0.029   2300   0.029    0        4     310000

There’s more information in this report than you know what to do with … yet! But even at this point you can see that R2 = 0.03, not very large.

Remember, $0 $ R^2 \(\leq 1\), with zero indicating that the model output accounts for none of the variance in the response variable.

A different summary—the model coefficients—provides a measure of the effect of the various post-card messages on the voting rate.

Vote_mod1 |> conf_interval() 
term                   .lwr   .coef    .upr
-------------------  ------  ------  ------
(Intercept)           0.230   0.240   0.240
messagesCivic Duty    0.013   0.018   0.023
messagesHawthorne     0.020   0.025   0.030
messagesNeighbors     0.075   0.080   0.085
primary2004           0.150   0.150   0.160

In the next section, you’ll start to learn how to interpret this model summary.

“Evaluation” is yet another operation on regression models. Evaluating a model calculates the model values (.output) for whatever values for the explanatory variable(s) we are interested in.

One use for model_eval() is to show the model output for each of the distinct values of the explanatory variable(s). For example, here are the voting rates (in 2006) for voters who received the “Civic Duty” postcard. [The interval = "none" instructs model_eval() not to report “confidence” or “prediction” intervals, which otherwise would be part of the summary. We’ll save these intervals for later Lessons when we know what to do with them.]

Vote_mod1 |> 
  model_eval(skeleton = TRUE, interval = "none")
messages      primary2004   .output
-----------  ------------  --------
Civic Duty            0.0      0.25
Hawthorne             0.0      0.26
Control               0.0      0.24
Neighbors             0.0      0.32
Civic Duty            0.5      0.33
Hawthorne             0.5      0.34
Control               0.5      0.31
Neighbors             0.5      0.39
Civic Duty            1.0      0.41
Hawthorne             1.0      0.41
Control               1.0      0.39
Neighbors             1.0      0.47

The skeleton = TRUE argument instructs model_eval() to pick a representative set of values for the explanatory variables and report the model output for each combination. Although 0 and 1 are the only possible values of primary2004 for an individual voter, we might want to know the voting rate for a group of voters whose turnout was 50% (0.5) in the 2006 election. model_eval() can handle any values for the explanatory variables.

Regression coefficients

As we saw in Lesson 8, “regression” is the process of finding simple patterns that give a concise summary of data. Earlier, we used model_values(), which produced a output value consistent with the pattern for each row of the training data. As part of the process of finding such simple patterns, regression modeling constructs a mathematical function that produces the pattern output when given values for the explanatory variables.

term                  .coef
-------------------  ------
(Intercept)           0.240
messagesCivic Duty    0.018
messagesHawthorne     0.025
messagesNeighbors     0.080
primary2004           0.150

In linear regression, the mathematical function for the model pattern has a particularly simple form. The output of the function can be calculated from the model coefficients.

  1. Start with the “intercept” coefficient, here 0.236. This gives the baseline value of the output.

  2. Each explanatory variable will contribute to the overall output value depending on the value for that variable. There are just two situations:

    1. A quantitative explanatory variable such as primary2004. The contribution of that variable to the output will be the model coefficient times the value of interest of the explanatory variable.

    2. A categorical explanatory variable contributes differently. First, look up the coefficient for the level of interest for that categorical variable. For instance, level "Hawthorne" has a coefficient of 0.0253. This is the contribution of that level of the categorical explanatory variable.

The overall model output for a 2004 voter who received a “Hawthorne” postcard will be:

\[ 0.236 + 0.153 \times 1 + 0.025 = 0.415\]

Similarly, the model output for a person who did not vote in 2004 but who received a “Civic Duty” postcard will be

\[0.236 + 0.153 \times 0 + 0.018 = 0.254\]

Note that there is no coefficient for one level—“Control”—of the messages explanatory variable. A voter in the “Control” group (who received no postcard) gets zero contribution from the messages explanatory variable. Every categorical variable, no matter what are the levels, will have such a level without a coefficient.


Turn this into a find-the-model-output from the regression coefficients exercise.

Consider the model gestation ~ parity. In the next lines of code we build this model, training it with the Gestation data. Then we evaluate the model on the trained data. This amounts to using the model coefficients to generate a model output for each row in the training data, and can be accomplished with the model_eval() R function.

Model <- Gestation |>
  model_train(gestation ~ parity)
Evaluated <- model_eval(Model)
 .response   parity   .output   .resid   .lwr   .upr
----------  -------  --------  -------  -----  -----
       271        0     281.1   -10.10    250    312
       324        1     280.2    43.80    249    311
       266        0     281.1   -15.10    250    312
       289        0     281.1     7.85    250    312
       303        1     280.2    22.80    249    311
The .response variable

The output from model_eval() repeats some columns from the data used for evaluation. For example, the explanatory variables are listed by name. (Here, the only explanatory variable is parity.) The response variable is also included, but given a generic name, .response to make it easy to distinguish it from the explanatory variables.

To see where the .output comes from, let’s look again at the model coefficients:

Model %>% conf_interval()
term              .lwr      .coef      .upr
------------  --------  ---------  --------
(Intercept)    279.900   281.1000   282.400
parity          -1.411    -0.9421    -0.473

The baseline value is 281.3 days. This applies to first-time mothers. For the other mothers, those with a previous pregnancy, the coefficient indicates that the model value is 2.6 days less than the baseline, or 279.7 days.

The output from model_eval() includes other columns of importance. For us, here, those are. the response variable itself (gestation, which has been given a generic name, .response) and the residuals from the model (.resid). There is a simple relationship between .response, .output and .resid:

\[\mathtt{.response} = \mathtt{.output} + \mathtt{.resid}\]

Another exercise: model coefficients don’t tell us the residuals. Emphasize that the residual refers an individual specimen, the difference between the response value and the model output (which does come from the coefficients.)

A model typically accounts for only some of the variation in a response variable. The remaining variation is called “residual variation.”