25  Mechanics of prediction

An effect size describes the relationship between two variables in an input/output format. Lesson 24 introduced effect size in the context of causal connections as if turning a knob to change the input will produce a change in the output. Such mechanistic connections make for a nice mental image for those considering intervening in the world but can be misleading.

First, the mere calculation of an effect size does not establish a causal connection. The statistical thinker has more work to do to justify a causal claim, as we will see in Lesson 30.

Second, owing to noise, the input/output relationship quantified by an effect size may not be evident in a single intervention, say, increasing a drug dose for any given individual patient. Instead, effect sizes are descriptions of average effects—trends—across a large group of individuals.

This Lesson is about prediction: what a model can properly say about the outcome of an individual case. Often, the setting is that we know values for some aspects of the individual but have yet to learn some other aspect of interest.

The word “prediction” suggests the future but also applies to saying what we can about an unknown current or past state. Synonyms for “prediction” include “classification” (Lessons 34 and 35), “conjecture,” “guess,” and “bet.” The phrase “informed guess” is a good description of prediction: using available information to support decision-making about the unknown.

Example: Differential diagnosis

A patient comes to an urgent-care clinic with symptoms. The healthcare professional tries to diagnose what disease or illness the patient has. A diagnosis is a prediction. The inputs to the prediction are the symptoms—neck stiffness, a tremor, and so on—as well as facts about the person, such as age, sex, occupation, and family history. The prediction output is a set of probabilities, one for each medical condition that could cause the symptoms.

Doctors learn to perform a differential diagnosis, where the current set of probabilities informs the choices of additional tests and treatments. The probabilities are updated based on the information gained from the tests and treatments. This update may suggest new tests or treatments, the results of which may drive a new update. The television drama House provides an example of the evolving predictions of differential diagnosis in every episode.

Differential diagnosis is a cycle of prediction and action. This Lesson, however, is about the mechanics of prediction: taking what we know about an individual and producing an informed guess about what we do not yet know.

The prediction machine

A statistical prediction is the output of a kind of special-purpose machine. The inputs given to the machine are values for what we already know; the output is a value (or interval) for the as-yet-unknown aspects of the system.

There are always two phases involved in making a prediction. The first is building the prediction machine. The second phase is providing the machine with inputs for the individual case, turning the machine crank, and receiving the prediction as output.

These two phases require different sorts of data. Building the machine requires a “historical” data set that includes records from many instances where we already know two things: the values of the inputs and the observed output. The word “historical” emphasizes that the machine-building data must already have known values for each of the inputs and outputs of interest.

The evaluation phase—turning the crank of the machine—is simple. Take the input values for the individual to be predicted, put those inputs into the machine, and receive a predicted value as output. Those input values may come from pure speculation or the measured values from a specific case of interest.

Building and using the machine

To illustrate building a prediction machine, we turn to a problem first considered quantitatively in the 1880s: the relationship between parents’ heights and their children’s heights at adulthood. The Galton data frame records the heights of about 900 children, along with their parents’ heights. Suppose we want to predict a child’s adult height (variable name: height) from his or her parents’ heights (mother and father). An appropriate model specification is height ~ mother + father. We use the model-training functionlm() to transform the model specification and the data into a model.

Mod1 <- lm(height ~ mother + father, data = Galton)

As the output of an R function, Mod1 is a computer object. It incorporates a variety of information organized in a somewhat complex way. There are several often-used ways to extract this information in ways that serve specific purposes.

One of the most common ways to see what is in a computer object like Mod1 is by printing:

print(Mod1)

Call:
lm(formula = height ~ mother + father, data = Galton)

Coefficients:
(Intercept)       mother       father  
    22.3097       0.2832       0.3799  

Newcomers to technical computing tend to confuse the printed form of an object with the object itself. For example, the Mod1 object contains many components, but the printed form displays only two: the model coefficients and the command used to construct the object.

We have already used some other functions to extract information from a model object. For instance,

Mod1 %>% conf_interval()
term .lwr .coef .upr
(Intercept) 13.8569119 22.3097055 30.7624990
mother 0.1867750 0.2832145 0.3796540
father 0.2898301 0.3798970 0.4699639
Mod1 %>% R2()
n k Rsquared F adjR2 p df.num df.denom
898 2 0.1088952 54.6856 0.1069039 0 2 895
Mod1 %>% regression_summary()
term estimate std.error statistic p.value
(Intercept) 22.3097055 4.3068968 5.179995 3e-07
mother 0.2832145 0.0491382 5.763635 0e+00
father 0.3798970 0.0458912 8.278209 0e+00

We have already used another extractor, model_eval() for calculating effect sizes. But model_eval() is also well suited to the task of prediction. This is accomplished by providing the input values for which we want to make a prediction of the corresponding response value. To illustrate, here is how to calculate the predicted height of the child of a 63-inch-tall mother and a 68-inch father.

Mod1 %>% model_eval(mother = 63, father=68)
mother father .output .lwr .upr
63 68 65.98521 59.33448 72.63594

The data frame includes the input values along with a point value for the prediction (.output) and a prediction interval (.lwr to .upr).

Naturally, the predictions depend on the explanatory variables used in the model. For example, here is a model that uses only sex to predict the child’s height:

Mod2 <- lm(height ~ sex, data = Galton)
Mod2 %>% model_eval(sex=c("F", "M")) 
sex .output .lwr .upr
F 64.1 59.2 69.0
M 69.2 64.3 74.2

This model includes three explanatory variables:

Mod3 <- lm(height ~ mother + father + sex, data = Galton)
Mod3 %>% model_eval(mother=63, father=68, sex=c("F", "M"))
mother father sex .output .lwr .upr
63 68 F 63.2 59.0 67.4
63 68 M 68.4 64.2 72.7

In Lesson 26, we will look at the components that make up the prediction interval and some ways to use it.

Prediction or confidence interval

We have encountered two different interval summaries: the confidence interval and the prediction interval. It’s important to keep straight the different purposes of the different intervals.

A confidence interval is used to summarize the precision of an estimate of a model coefficient or effect size.

A prediction interval is used to express the uncertainty in the outcome for any given model inputs.

By default, model_eval() gives the prediction interval. The following chunk produces a prediction (and prediction interval) for several values of mother’s height: 57 inches up to 72 inches.

Mod3 %>% 
  model_eval(mother=c(57,62, 67), 
             father=68, sex=c("F", "M"))
mother father sex .output .lwr .upr
57 68 F 61.3 57.0 65.5
62 68 F 62.9 58.6 67.1
67 68 F 64.5 60.3 68.7
57 68 M 66.5 62.2 70.8
62 68 M 68.1 63.9 72.3
67 68 M 69.7 65.5 74.0

The prediction intervals are broad, roughly 8 inches. This is consistent with the real-life observation that kids and their parents can be quite different in height.

Figure 25.1: Prediction intervals for Mod3 for several different values of mother’s height and a father 68 inches tall.

The prediction interval answers a question like this: If I know that a woman’s mother was 65 inches tall (and her father 68 inches and her sex, self-evidently, F), then how tall is the woman likely to be? To judge from Figure 25.1, we can fairly say that she is very likely (95%) to be between 60 and 68 inches tall.