`<- lm(height ~ mother + father, data = Galton) Mod1 `

# 25 Mechanics of prediction

An effect size describes the relationship between two variables in an input/output format. Lesson 24 introduced effect size in the context of causal connections as if turning a knob to change the input will produce a change in the output. Such mechanistic connections make for a nice mental image for those considering intervening in the world but can be misleading.

First, the mere calculation of an effect size does not establish a causal connection. The statistical thinker has more work to do to justify a causal claim, as we will see in Lesson 30.

Second, owing to noise, the input/output relationship quantified by an effect size may not be evident in a single intervention, say, increasing a drug dose for any given individual patient. Instead, effect sizes are descriptions of *average* effects—trends—across a large group of individuals.

This Lesson is about *prediction*: what a model can properly say about the outcome of an individual case. Often, the setting is that we know values for some aspects of the individual but have yet to learn some other aspect of interest.

The word “prediction” suggests the future but also applies to saying what we can about an unknown current or past state. Synonyms for “prediction” include “classification” (Lessons 34 and 35), “conjecture,” “guess,” and “bet.” The phrase “informed guess” is a good description of prediction: using available information to support decision-making about the unknown.

Differential diagnosis is a cycle of prediction and action. This Lesson, however, is about the mechanics of prediction: taking what we know about an individual and producing an informed guess about what we do not yet know.

## The prediction machine

A statistical prediction is the output of a kind of special-purpose machine. The inputs given to the machine are values for what we already know; the output is a value (or interval) for the as-yet-unknown aspects of the system.

There are always two phases involved in making a prediction. The first is building the prediction machine. The second phase is providing the machine with inputs for the individual case, turning the machine crank, and receiving the prediction as output.

These two phases require different sorts of data. Building the machine requires a “historical” data set that includes records from many instances where we already know two things: the values of the inputs and the observed output. The word “historical” emphasizes that the machine-building data must already have known values for each of the inputs and outputs of interest.

The evaluation phase—turning the crank of the machine—is simple. Take the input values for the individual to be predicted, put those inputs into the machine, and receive a predicted value as output. Those input values may come from pure speculation or the measured values from a specific case of interest.

## Building and using the machine

To illustrate building a prediction machine, we turn to a problem first considered quantitatively in the 1880s: the relationship between parents’ heights and their children’s heights at adulthood. The `Galton`

data frame records the heights of about 900 children, along with their parents’ heights. Suppose we want to predict a child’s adult height (variable name: `height`

) from his or her parents’ heights (`mother`

and `father`

). An appropriate model specification is `height ~ mother + father`

. We use the model-training function`lm()`

to transform the model specification and the data into a model.

As the output of an R function, `Mod1`

is a computer object. It incorporates a variety of information organized in a somewhat complex way. There are several often-used ways to extract this information in ways that serve specific purposes.

One of the most common ways to see what is in a computer object like `Mod1`

is by printing:

`print(Mod1)`

```
Call:
lm(formula = height ~ mother + father, data = Galton)
Coefficients:
(Intercept) mother father
22.3097 0.2832 0.3799
```

Newcomers to technical computing tend to confuse the printed form of an object with the object itself. For example, the `Mod1`

object contains many components, but the printed form displays only two: the model coefficients and the command used to construct the object.

We have already used some other functions to extract information from a model object. For instance,

`%>% conf_interval() Mod1 `

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 13.8569119 | 22.3097055 | 30.7624990 |

mother | 0.1867750 | 0.2832145 | 0.3796540 |

father | 0.2898301 | 0.3798970 | 0.4699639 |

`%>% R2() Mod1 `

n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|

898 | 2 | 0.1088952 | 54.6856 | 0.1069039 | 0 | 2 | 895 |

`%>% regression_summary() Mod1 `

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 22.3097055 | 4.3068968 | 5.179995 | 3e-07 |

mother | 0.2832145 | 0.0491382 | 5.763635 | 0e+00 |

father | 0.3798970 | 0.0458912 | 8.278209 | 0e+00 |

We have already used another extractor, `model_eval()`

for calculating effect sizes. But `model_eval()`

is also well suited to the task of prediction. This is accomplished by providing the input values for which we want to make a prediction of the corresponding response value. To illustrate, here is how to calculate the predicted height of the child of a 63-inch-tall mother and a 68-inch father.

`%>% model_eval(mother = 63, father=68) Mod1 `

mother | father | .output | .lwr | .upr |
---|---|---|---|---|

63 | 68 | 65.98521 | 59.33448 | 72.63594 |

The data frame includes the input values along with a point value for the prediction (`.output`

) and a **prediction interval** (`.lwr`

to `.upr`

).

Naturally, the predictions depend on the explanatory variables used in the model. For example, here is a model that uses only `sex`

to predict the child’s height:

```
<- lm(height ~ sex, data = Galton)
Mod2 %>% model_eval(sex=c("F", "M")) Mod2
```

sex | .output | .lwr | .upr |
---|---|---|---|

F | 64.1 | 59.2 | 69.0 |

M | 69.2 | 64.3 | 74.2 |

This model includes three explanatory variables:

```
<- lm(height ~ mother + father + sex, data = Galton)
Mod3 %>% model_eval(mother=63, father=68, sex=c("F", "M")) Mod3
```

mother | father | sex | .output | .lwr | .upr |
---|---|---|---|---|---|

63 | 68 | F | 63.2 | 59.0 | 67.4 |

63 | 68 | M | 68.4 | 64.2 | 72.7 |

In Lesson 26, we will look at the components that make up the prediction interval and some ways to use it.

## Prediction or confidence interval

We have encountered two different interval summaries: the *confidence interval* and the *prediction interval.* It’s important to keep straight the different purposes of the different intervals.

A *confidence* interval is used to summarize the precision of an estimate of a model coefficient or effect size.

A *prediction* interval is used to express the uncertainty in the outcome for any given model inputs.

By default, `model_eval()`

gives the prediction interval. The following chunk produces a prediction (and prediction interval) for several values of mother’s height: 57 inches up to 72 inches.

```
%>%
Mod3 model_eval(mother=c(57,62, 67),
father=68, sex=c("F", "M"))
```

mother | father | sex | .output | .lwr | .upr |
---|---|---|---|---|---|

57 | 68 | F | 61.3 | 57.0 | 65.5 |

62 | 68 | F | 62.9 | 58.6 | 67.1 |

67 | 68 | F | 64.5 | 60.3 | 68.7 |

57 | 68 | M | 66.5 | 62.2 | 70.8 |

62 | 68 | M | 68.1 | 63.9 | 72.3 |

67 | 68 | M | 69.7 | 65.5 | 74.0 |

The prediction intervals are broad, roughly 8 inches. This is consistent with the real-life observation that kids and their parents can be quite different in height.

The prediction interval answers a question like this: If I know that a woman’s mother was 65 inches tall (and her father 68 inches and her sex, self-evidently, F), then how tall is the woman likely to be? To judge from Figure 25.1, we can fairly say that she is very likely (95%) to be between 60 and 68 inches tall.