14  Graphics for regression models

In Chapter 2 of the ModernDive text introduces what they call the “five named graphs”: scatterplots, linegraphs, histograms, boxplots, barplots. Each of these graphical modalities—except maybe boxplots—are familiar to many quantitatively literate readers. That’s a good reason for using them in presentations to a general audience.

In the rest of this course, we will focus not on such general presentations, but on the contemporary statistical thinking skills needed to extract valid information from data. Our main analytic technique for this will be regression modeling. The graphics we will generate will often be used to display results from regression models. This means we need to use graphic modalities that are a good match for regression models.

The central feature of a regression model is that there is one response variable and one or more explanatory variables. The graphics frame we use will always have the response variable on the vertical axis. The horizontal axis will always be one of the explanatory variables; the choice depends on our interest.

To illustrate, consider the familiar settings of height, age, and sex. The data plotted in Figure 14.1 come from surveys conducted in 2009-2012. (That turns out to be important. See Lesson 30.) Figure 14.1 places the height variable on the vertical axis. The horizontal axis is age, and color specifies sex.

Code
ForMod <- NHANES::NHANES |>
  select(Height, Age, Gender) |>
  filter(Age >=15) |>
  rename(Sex = Gender)
Mod2 <-  lm(Height ~ splines::ns(Age,3)*Sex, data = ForMod) 
model_plot(Mod2, x = Age, color=Sex,
           interval="confidence", show_data = FALSE) |>
  gf_jitter(Height ~ Age, data=ForMod |> filter(Age >= 15), alpha=0.05, inherit=FALSE, color=~ Sex) |>
  gf_labs(title="Height vs age and sex")

Figure 14.1: Placing a statistical annotation layer on top of data.

The point plot is the base layer of Figure 14.1. A model of height as a function of age and sex has been layered on top, so that you can see how the model relates to the data. The layer showing the model is more of a ribbon than a line plot. You will see the significance of this in later Lessons.

At their best, models show patterns in data that are not readily seen by eye. Figure 14.2 gives an example. Only the variables age and sex are plotted, age is the explanatory variable and sex is the response variable.

Code
Small <- NHANES::NHANES %>% 
  select(Height, Age, Gender, Poverty, HealthGen, Depressed) %>%
  filter(Age >= 15) |>
  rename(Sex = Gender) |> 
  mutate(Sex01 = zero_one(Sex, one="female"))
Mod3 <- glm(Sex01 ~ splines::ns(Age,4), data = Small, family="binomial")
model_plot(Mod3, x=Age, show_data=FALSE, interval="confidence") |>
  gf_jitter(Sex01 ~ Age, color=~Sex, data=Small, inherit=FALSE, 
            height = 0.1, width=1, size=0, alpha=0.5) +
  scale_y_continuous(breaks=c(0, .5, 1), labels=c("male (0)", 0.5, "female (1)")) +
  ylab("Proportion of females") + 
  guides(color="none")

Figure 14.2: ?(caption)

The graphics style of Figure 14.2 is important. Often, we will work with categorical response variables that have two levels: yes-or-no, alive-or-dead, success-or-failure, promoted-or-not, and so on. Such “binomial” variables (bi=two, nomial=name) can always be encoded as zero/one quantitative variables. In Figure 14.2, the raw data is shown as a jittered point plot. Each dot in the lower band of the graph corresponds to a male; the points in the upper band are for females. In the middle, the fitted model of sex versus age is plotted.

Because sex has been re-coded as a zero-one numerical value, we can use regression modeling techniques, a powerful tool. The output from the model can be interpreted as a probability (or, what is the same thing: a proportion). The blue-colored model snaking across the ages gives the proportion of females at each age. For 20-year olds, that proportion is about 50%. Starting about age 60, however, the proportion of females starts increasing. At age 80 (and above), females make up about 60% of the data points.

Underlying the use of regression modeling of binomial response variable is the transformation of the variable into a zero-one format. In Figure 14.2, this was accomplished by adding a new column (mutate() does this), like this:

data_frame %>% mutate(sex01 = zero_one(sex, one="female"))

Violins for density

Histograms and box-and-whisker plots are often seen in presentations. They were invented to facilitate computation and pencil-and-ruler drawing, which, with the advent of computer graphics, are no longer important constraints on graphics. The violin plot is a modern alternative to both histograms (which are too visually busy and highlight random fluctuations) and box-and-whisker plots (which are unnecessarily chunky and suitable only for single-peaked distribution).

When the explanatory variable is categorical, violins make it straightforward to compare distributions and are easy to draw with the geom_violin() function.

Code
Small %>% select(HealthGen, Poverty) %>% na.omit() %>%
  ggplot(aes(x=HealthGen, y=Poverty)) +
  geom_jitter(size=.1, width=0.15, alpha=0.3) +
  geom_violin(fill="blue", alpha=0.5, color=NA) +
  xlab("Self-assessed health condition") +
  ylab("Income (multiple of poverty level)")

Figure 14.3: ?(caption)

You can think of the violin as a sideways histogram drawn with no bars and with mirror-image symmetry.

The Null model

Occasionally, we will have reason to build models without any explanatory variables. Or, more precisely, models with only one explanatory variable whose value is the same in all rows, signified by 1. In terms of a tilde expression, such models look like income ~ 1. The left-hand side of the tilde expression is, as always, the response variable.

Graphically, the lack of an explanatory variable means that the horizontal axis is used only to provide space for jittering, violins, and other graphical annotations that extend horizontally. Tick marks on the horizontal axis are meaningless. Figure 14.4 shows the same data as Figure 14.3, but without being broken down by health condition.

Code
Stats <- df_stats(Poverty ~ 1, Small, quantile(c(0.33, 0.66)))
Small %>% select(Poverty) %>% 
  ggplot(aes(x=1, y=Poverty)) +
  geom_jitter(size=.1, width=0.1, alpha=0.3) +
#  geom_boxplot(aes(x=1.5), width=0.1, color="green", fill=NA, alpha=1) +
  geom_violin(fill="blue", alpha=0.3, color=NA) +
  geom_errorbar(aes(ymin=`33%`, ymax=`66%`), y=NA, 
                color="magenta", linewidth=1, width=0.05, 
                data = Stats) +
  xlab("") +
  ylab("Income (multiple of poverty level)")

Figure 14.4: When there is no explanatory variable, the horizontal axis is merely a placeholder.

The violin plot shows a clear bimodal distribution, which we know from Figure 14.3 comes from the different distributions for people in different health conditions. Just for illustration, the \(\color{magenta}{\text{magenta}}\) line segment, drawn as an “error bar,” shows the middle third of the distribution.