21  Predictions



Note in draft

More emphasis on the “proper form of prediction” as from the class notes.

An effect size describes the relationship between two variables in an input/output format. Lesson 21 introduced effect size in the context of causal connections as if turning a knob to change the input will produce a change in the output. Such mechanistic connections make for a nice mental image for those considering intervening in the world but can be misleading.

First, the mere calculation of an effect size does not establish a causal connection. The statistical thinker has more work to do to justify a causal claim, as we will see in Lesson 24.

Second, owing to noise, the input/output relationship quantified by an effect size may not be evident in a single intervention, say, increasing a drug dose for any given individual patient. Instead, effect sizes are descriptions of average effects—trends—across a large group of individuals.

This Lesson is about prediction: what a model can properly say about the outcome of an individual case. Often, the setting is that we know values for some aspects of the individual but have yet to learn some other aspect of interest.

The word “prediction” suggests the future but also applies to saying what we can about an unknown current or past state. Synonyms for “prediction” include “classification” (Lesson 29), “conjecture,” “guess,” and “bet.” The phrase “informed guess” is a good description of prediction: using available information to support decision-making about the unknown.

Example: Differential diagnosis

A patient comes to an urgent-care clinic with symptoms. The healthcare professional tries to diagnose what disease or illness the patient has. A diagnosis is a prediction. The inputs to the prediction are the symptoms—neck stiffness, a tremor, and so on—as well as facts about the person, such as age, sex, occupation, and family history. The prediction output is a set of probabilities, one for each medical condition that could cause the symptoms.

Doctors learn to perform a differential diagnosis, where the current set of probabilities informs the choices of additional tests and treatments. The probabilities are updated based on the information gained from the tests and treatments. This update may suggest new tests or treatments, the results of which may drive a new update. The television drama House provides an example of the evolving predictions of differential diagnosis in every episode.

Differential diagnosis is a cycle of prediction and action. This Lesson, however, is about the mechanics of prediction: taking what we know about an individual and producing an informed guess about what we do not yet know.

The prediction machine

A statistical prediction is the output of a kind of special-purpose machine. The inputs given to the machine are values for what we already know; the output is a value (or interval) for the as-yet-unknown aspects of the system.

There are always two phases involved in making a prediction. The first is building the prediction machine. The second phase is providing the machine with inputs for the individual case, turning the machine crank, and receiving the prediction as output.

These two phases require different sorts of data. Building the machine requires a “historical” data set that includes records from many instances where we already know two things: the values of the inputs and the observed output. The word “historical” emphasizes that the machine-building data must already have known values for each of the inputs and outputs of interest.

The evaluation phase—turning the crank of the machine—is simple. Take the input values for the individual to be predicted, put those inputs into the machine, and receive a predicted value as output. Those input values may come from pure speculation or the measured values from a specific case of interest.

Building and using the machine

To illustrate building a prediction machine, we turn to a problem first considered quantitatively in the 1880s: the relationship between parents’ heights and their children’s heights at adulthood. The Galton data frame records the heights of about 900 children, along with their parents’ heights. Suppose we want to predict a child’s adult height (variable name: height) from his or her parents’ heights (mother and father). An appropriate model specification is height ~ mother + father. We use the model-training functionlm() to transform the model specification and the data into a model.

Mod1 <- lm(height ~ mother + father, data = Galton)

As the output of an R function, Mod1 is a computer object. It incorporates a variety of information organized in a somewhat complex way. There are several often-used ways to extract this information in ways that serve specific purposes.

One of the most common ways to see what is in a computer object like Mod1 is by printing:


lm(formula = height ~ mother + father, data = Galton)

(Intercept)       mother       father  
    22.3097       0.2832       0.3799  

Newcomers to technical computing tend to confuse the printed form of an object with the object itself. For example, the Mod1 object contains many components, but the printed form displays only two: the model coefficients and the command used to construct the object.

We have already used some other functions to extract information from a model object. For instance,

Mod1 %>% conf_interval()
term                 .lwr        .coef         .upr
------------  -----------  -----------  -----------
(Intercept)    13.8569119   22.3097055   30.7624990
mother          0.1867750    0.2832145    0.3796540
father          0.2898301    0.3798970    0.4699639
Mod1 %>% R2()
   n    k    Rsquared         F       adjR2    p   df.num   df.denom
----  ---  ----------  --------  ----------  ---  -------  ---------
 898    2   0.1088952   54.6856   0.1069039    0        2        895
Mod1 %>% regression_summary()
term             estimate   std.error   statistic   p.value
------------  -----------  ----------  ----------  --------
(Intercept)    22.3097055   4.3068968    5.179995     3e-07
mother          0.2832145   0.0491382    5.763635     0e+00
father          0.3798970   0.0458912    8.278209     0e+00

We have already used another extractor, model_eval() for calculating effect sizes. But model_eval() is also well suited to the task of prediction. This is accomplished by providing the input values for which we want to make a prediction of the corresponding response value. To illustrate, here is how to calculate the predicted height of the child of a 63-inch-tall mother and a 68-inch father.

Mod1 %>% model_eval(mother = 63, father=68)
 mother   father    .output       .lwr       .upr
-------  -------  ---------  ---------  ---------
     63       68   65.98521   59.33448   72.63594

The data frame includes the input values along with a point value for the prediction (.output) and a prediction interval (.lwr to .upr).

Naturally, the predictions depend on the explanatory variables used in the model. For example, here is a model that uses only sex to predict the child’s height:

Mod2 <- lm(height ~ sex, data = Galton)
Mod2 %>% model_eval(sex=c("F", "M")) 
sex    .output   .lwr   .upr
----  --------  -----  -----
F         64.1   59.2   69.0
M         69.2   64.3   74.2

This model includes three explanatory variables:

Mod3 <- lm(height ~ mother + father + sex, data = Galton)
Mod3 %>% model_eval(mother=63, father=68, sex=c("F", "M"))
 mother   father  sex    .output   .lwr   .upr
-------  -------  ----  --------  -----  -----
     63       68  F         63.2   59.0   67.4
     63       68  M         68.4   64.2   72.7

In Lesson ?sec-lesson-26, we will look at the components that make up the prediction interval and some ways to use it.

Prediction or confidence interval

We have encountered two different interval summaries: the confidence interval and the prediction interval. It’s important to keep straight the different purposes of the different intervals.

A confidence interval is used to summarize the precision of an estimate of a model coefficient or effect size.

A prediction interval is used to express the uncertainty in the outcome for any given model inputs.

By default, model_eval() gives the prediction interval. The following chunk produces a prediction (and prediction interval) for several values of mother’s height: 57 inches up to 72 inches.

Mod3 %>% 
  model_eval(mother=c(57,62, 67), 
             father=68, sex=c("F", "M"))
 mother   father  sex    .output   .lwr   .upr
-------  -------  ----  --------  -----  -----
     57       68  F         61.3   57.0   65.5
     62       68  F         62.9   58.6   67.1
     67       68  F         64.5   60.3   68.7
     57       68  M         66.5   62.2   70.8
     62       68  M         68.1   63.9   72.3
     67       68  M         69.7   65.5   74.0

The prediction intervals are broad, roughly 8 inches. This is consistent with the real-life observation that kids and their parents can be quite different in height.

Figure 21.1: Prediction intervals for Mod3 for several different values of mother’s height and a father 68 inches tall.

The prediction interval answers a question like this: If I know that a woman’s mother was 65 inches tall (and her father 68 inches and her sex, self-evidently, F), then how tall is the woman likely to be? To judge from Figure 21.1, we can fairly say that she is very likely (95%) to be between 60 and 68 inches tall.

Constructing a prediction interval

Note in draft

This has to be revised to reflect the new emphasis on the proper form of a prediction.

Lesson ?sec-lesson-25 introduced predictions in two forms:

  1. a point quantity, the direct output of the model function.
  2. the prediction interval, which indicates a range of likely values for the quantity being predicted.

To clarify this distinction, consider this procedure that trains a model and applies the model function to inputs to generate a prediction in point-estimate form.

Time_mod <- lm(time ~ distance + climb, data = Hill_racing)
model_eval(Time_mod, distance=10, climb=500)

If you have studied calculus, it might seem strange not to use the mathematical syntax where the function name comes before the parentheses and only the inputs to the function are contained inside. Like this: Time_mod(distance=10, climb=500). Instead, we use the more verbose model_eval(Time_mod, distance=10, climb=500).

Every regression model contains the information needed to construct a function in the mathematical style. But model objects contain other data such as the residuals from model training and some more technical ones (such as “degrees of freedom”) that software such as conf_interval() makes use of and are essential guides to interpretation for the statistical thinker.

Notice that the output from model_eval() contains a column .output. This contains the output of the model function for the provided inputs as well as the lower and upper limits of a prediction interval (in .lwr and .upr respectively).

The .output column could be called a “point prediction”—a single number indicating a likely response value for the given explanatory inputs. Many people prefer a point prediction, possibly because the single number suggests a single, correct answer, which is somehow emotionally comforting. But the comfort is unjustified.

The proper form for a prediction is a list of the possible outcomes each of which is assigned a probability. That is, a probability distribution. To write the prediction in a more compact way, it is often presented as a prediction interval: two numbers bounding the lower and upper limits for the likely outcome. (Statistical thinkers learn that the interval is just shorthand for a bell-shaped probability distribution.)

For the hill-racing model, the point prediction is 3372.985 seconds, which is a running time of just under one hour. Nothing about this single number even tells us how many digits are appropriate. The prediction interval tells a different story. The interval, 1700 to 5100 seconds, conveys the appropriate uncertainty in the prediction.

Where does the prediction interval come from?

The prediction interval has two distinct components:

  1. The uncertainty in the model function and hence in the output of the model function.
  2. The size of the residuals found when training the model.

Consider first the model function. For the running-time model, we can construct the model function from the coefficients of the linear model. These are:

Time_mod %>% conf_interval()
term                  .lwr         .coef          .upr
------------  ------------  ------------  ------------
(Intercept)    -533.432471   -469.976937   -406.521402
distance        246.387096    253.808295    261.229494
climb             2.493307      2.609758      2.726209

The algebraic expression for the model function is straightforward: \[t(d, c) \equiv -470 + 254 d + 2.61 c\ .\]

The statistical thinker knows that such coefficients have uncertainty due to sampling variation. That uncertainty is, of course, quantified by the confidence interval.

Time_mod %>% conf_interval()
term                  .lwr         .coef          .upr
------------  ------------  ------------  ------------
(Intercept)    -533.432471   -469.976937   -406.521402
distance        246.387096    253.808295    261.229494
climb             2.493307      2.609758      2.726209

Since we cannot legitimately claim to know the values of the coefficients any better than indicated by these confidence intervals, we ought to temper our claims about the model function so that it reflects the uncertainty in the coefficients. For instance, we might provide an interval for the model output, using in an “upper” function the high ends of the confidence intervals on the coefficients and another “lower” function that uses the low ends of the confidence interval. Like this:

\[t_{upr}(d,c) \equiv -407 + 261 d + 2.72 c\] \[t_{lwr}(d,c) \equiv -533 + 246 d + 2.49 c\]

Evaluate both the lower and upper functions to get an interval on the model output. That would give us \(t_{lwr}(10, 500) = 3172\) and \(t_{upr}(10, 500) = 3569\).

This idea for generating the “lower” and “upper” functions has the right spirit but is not on target mathematically. The reason is that using the low end of the confidence interval for all coefficients is overly pessimistic; usually, the uncertainty in the different coefficients cancels out to some extent.

The mathematics for the correct “lower” and “upper” functions are well understood but too advanced for the general reader. For our purposes, it suffices to know that model_eval() knows how to do the calculations correctly.

The prediction interval produced by model_eval() includes both components (1) and (2) listed above. Insofar as we are interested in component (1) in isolation, the correct sort of interval—a confidence interval—can be requested from model_eval().

Time_mod %>% 
    model_eval(distance=10, climb=500, interval="confidence")
 distance   climb    .output       .lwr       .upr
---------  ------  ---------  ---------  ---------
       10     500   3372.985   3335.264   3410.706

This report shows that the confidence interval on the model output—that is, just component (1) of the prediction interval—is pretty narrow: 3335 seconds to 3411 seconds, or, in plus-or-minus format, \(3373 \pm 38\) seconds.

The prediction interval—that is, the sum of components (1) and (2)—is comparatively huge: 1700 to 5100 seconds or, in plus-or-minus format, \(3400 \pm 1700\) seconds. That is almost 50 times wider than the confidence interval.

Why is the prediction interval so much more comprehensive than the confidence interval? The confidence interval reports on the sampling variation of a model constructed as an average over all the data, the \(n=2236\) participants recorded in the Hill_racing data frame. However, each runner in Hill_racing has their own individual time: not an average but just for the individual. The individual value might be larger or smaller than the average. How much larger or smaller? The residuals for the model provide this information. As always, we can measure the individual-to-individual variation with the standard deviation.

Time_mod %>% model_eval() %>% summarize(se_residuals = sd(.resid))
Using training data as input to model_eval().

Keeping in mind that the overall variation of the residuals is plus-or-minus “twice” the standard deviation of the residuals, we can say that the residuals indicate an additional uncertainty in the prediction for a runner of about \(\pm 1700\) seconds. This \(\pm 1700\) second is our estimate of the noise in the measurements. In contrast, the confidence interval is about the sampling variation in the signal.

In this case, the prediction interval is wholly dominated by noise; the sampling variability contributes only a tiny amount of additional uncertainty.

Example: Graphics for the prediction interval

We shift the running scene from Scotland to Washington, DC. The race now is a single 10-miler with almost 9000 registered participants. We wish to predict the running time of an individual based on his or her age.

Age_mod <- lm(net ~ age, data = TenMileRace)

We can see the prediction interval for an individual runner using model_eval(). For example, here it is for a 23-year-old.

Age_mod %>% model_eval(age=23)
 age    .output       .lwr       .upr
----  ---------  ---------  ---------
  23   5485.587   3592.054   7379.119

We can also calculate the prediction interval for several different ages and graph out the results with the “errorbar” glyph:

Age_mod %>% 
  model_eval(age=c(20,25,30,35,40,45,50,55,60,65,70,75)) %>%
  ggplot(aes(x=age)) +
  geom_errorbar(aes(ymin=.lwr, ymax=.upr))

For convenience, the model_plot() function will do this work for us, plotting the prediction interval along with the training data. We can also direct model_plot() to show the confidence interval.

model_plot(Age_mod, x=age, interval="prediction", data_alpha=0.025)
model_plot(Age_mod, x=age, interval="confidence", data_alpha=0.01)

(a) prediction interval

(b) confidence interval

Figure 21.2: The prediction and model-function confidence intervals for the model net ~ age.

Since we are looking at the intervals as a function of an input variable, what we formerly showed using the “errorbar” glyph is now shown using a ribbon or band.

Notice that the prediction interval covers almost all of the data points. This is intentional: the nature of prediction intervals. There are hundreds of data points outside the interval, but with almost 9000 rows in the TenMileRace data frame, an interval that covers 95% of the data will have about 450 rows outside the interval.

The confidence band on the model function is pleasingly narrow and precise. It covers only a tiny fraction of the raw data. A characteristic for people who have yet to develop statistical thinking skills to believe that the narrow intervals tell more than broader intervals. Prediction and confidence intervals address very different questions and you should always choose the type of interval that’s appropriate for the question you want to answer.

Prediction intervals are the right tool to describe the range of likely outcomes for an individual observation. In the age-and-running example, the prediction interval is indeed very broad. But the breadth correctly indicates that age alone is not informative about running performance. The broad interval tells the modeler that, if a precise prediction for an individual is needed, other information will have to be brought to bear on the problem, for instance a runner’s fitness, data from previous runs, and so on.

The confidence interval is for an entirely different purpose. Confidence intervals take advantage of averaging across all of the factors not included in the model. Such averaging increases precision and so a confidence interval is a good tool for showing relationships between variables general trends in the data. For instance, Figure 21.2 shows a clear upward trend in running time with age. There is no flat or negatively sloping line compatible with the confidence interval.

To summarize:

  1. When making a prediction, report a prediction interval.
  2. The prediction interval is always larger than the confidence interval and is usually much larger.

The confidence interval is not for predictions. Use a confidence interval when looking at an effect size. Graphically, the confidence interval is to indicate whether there is an overall trend in the relationship between the response variable and the explanatory variable.

In draft: The prediction interval

is shorthand for a probability model. There is a strong link between interval descriptions of variation and the density display. Suppose you specify the fraction of cases that you want to include in an interval description, say 50% or 80%. In terms of the violin, that fraction is a proportion of the overall area of the violin. For instance, the 50% interval would include the central 50% of the area of the violin, leaving 25% out at the bottom and another 25% out at the top. The 80% interval would leave out only 10% of the area at the top and bottom of the violin. This suggests that the interval style of describing variation really involves three numbers; the top and bottom of the interval as well as the selected percentage (say, 50% or 80%) used to find the location of the top and bottom.


TITLE GOES HERE: You’ve been told that Jenny is in an elementary school that covers grade K through 6. Predict how old is Jenny.

  1. Put your prediction in the format of assigning a probability to each of the possible outcomes, as listes below. Remember that the sum of your probabilities should be 1. (You don’t have to give too much thought to the details. Anything reasonable will do.)
Age         | 3 or under | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12  | 13 | 14 | 15+
probability |            |   |   |   |   |   |   |    |    |     |    |    |

Perhaps something like the following, where the probabilities are given in percentage points.

Age         | 3 or under  | 4   | 5   | 6   | 7   | 8   | 9   | 10  | 11  | 12  | 13  | 14  | 15+
probability |      0      | 2.5 | 12  | 12  | 12  | 12  | 12  | 12  | 12  | 11  | 2   | 0.5 | 0

Ages 5 through 12 are equally likely, with a small possibility of 4-year olds or 14 year olds.

  1. Translate your set of probabilities to a 95% prediction interval.

The 95% prediction interval 5 to 12 years old.

A 95% interval should leave out 2.5% of the total probability on either end. Below age 5 there is 2.5% and above age 12 there is 2.5%.

If you wrote your own probabilities so that there’s no cut-off that gives exactly 2.5%, then set the interval to come as close as possible to 2.5%.

At a very large ballroom dance class, you are to be teamed up with a randomly selected partner. There are 200 potential partners. The figure below shows their heights.

From the data plotted, calculate a 95% prediction interval on the height of your eventual partner. (Hint: You can do this by counting.)


59 to 74 inches.

Since there are 200 points, a 95% interval should exclude the top five cases and the bottom five cases. So draw the bottom boundary of the interval just above the bottom five points, and the top boundary just below the top five points.

The town where you live has just gone through a so-called 100-year rain storm, which caused flooding of the town’s sewage treatment plant and consequent general ickiness. The city council is holding a meeting to discuss install flood barriers around the sewage treatment plant. The are trying to decide how urgent it is to undertake this expensive project. When will the next 100-year storm occur.

To address the question, the city council has enlisted you, the town’s most famous data scientist, to do some research to find the soonest that a 100-year flood can re-occcur.

You look at the historical weather records for towns that had a 100-year flood at least 20 years ago. The records start in 1900 and you found 1243 towns with a 100-year flood that happened 20 or more years ago. The plot shows, for all the towns that had a 100-year flood at least 20 years ago, how long it was until the next flood occurred. Those town for which no second flood occurred are shown in a different color.

You explain to the city council what a 95% prediction interval is and that you will put your prediction in the form of a probability of 2.5% that the flood will occur sooner than the date you give. You show them how to count dots on a jitter plot to find the 2.5% level.

Since the town council is thinking of making the wall-building investment in the next 10 years, you also have provided a zoomed-in plot showing just the floods where the interval to the next flood was less than ten years.

  1. You have n = 1243 floods in your database. How many is 2.5% of 1243? Answer: 31
  2. Using the zoomed-in plot, starting at the bottom count the number of floods you calculated in part (a). A line drawn where the counting stops is the location of the bottom of the 95% coverage interval. Where is the bottom of the 95% interval. Answer: About 2.5 years.
  3. A council member proposes that the town act soon enough so that there is a 99% chance that the next 100-year flood will not occur before the work is finished. It will take 1 year to finish the work, once it is started. According to your data, when should the town start work? Answer: Find the bottom limit that excludes 1% of the 1243 floods in your data. This will be between the 12th and 13th flood, counting up from the bottom. This will be at about 1.25 years, that is 15 months. So the town has 3 months before work must begin. That answer will be a big surprise to those who think the next 100-year flood won’t come for about 100 years.
  4. A council member has a question. “Judging from the graph on the left, are you saying that the next 100-year flood must come sometime within the next 120 years?” No, that’s not how the graph shold be read. Explain why. Answer: Since the records only start in 1900, the longest possible interval can be 120 years, that is, from about 2020 to 1900. About half of the dots in the plot reflect towns that haven’t yet had a recurrence 100-year flood. Those could happen at any time, and presumably many of them will happen after an interval of, say, 150 years or even longer.

Calculation of a 95% coverage interval (or any other percent level interval) is straightforward with the right software. To illustrate, consider the efficiency of cars and light trucks in terms of CO_2 emissions per mile driven. We’ll use the CO2city variable in the SDSdata::MPG data frame. The basic calculation using the mosaic package is:

SDSdata::MPG |> df_stats( ~ CO2city,  coverage(0.95))
response      lower     upper
---------  --------  --------
CO2city     276.475   684.525

The following figure shows a violin plot of CO2city which has been annotated with various coverage intervals. Use the calculation above to identify which of the intervals corresponds to which coverage level.

  1. 50% coverage interval Answer: (c)
  2. 75% coverage interval Answer: (e)
  3. 90% coverage interval Answer: (g)
  4. 100% coverage interval Answer: (i). This extends from the min to the max, so you could have figured this out just from the figure.

Maybe beech-run-mug?

TITLE GOES HERE: The graph plots winning time in the Scottish hill races as a function of climb (in meters) and distance (in km). There is a jittered data layer as well as an interval layer showing the 95% prediction interval.

  1. What is the unit of observation for the data layer. Answer: In the hill racing data, the unit of observation is a winner of a race. In the interval layer, the unit of observation is a stratum of distance and climb.
  2. How many different strata for race distance are there? Answer: Five. Within each of the five distance strata are a few sub-strata for climb.
  3. From the legend you can see that there are five strata for climb. Yet not every climb stratum shows up for each distance stratum. State, in everyday terms, why some strata are missing. Answer: The shorter races don’t include any with the largest climbs, while the longer races don’t include any with the shortest climbs.
  4. Your friend, a competitive hill racer, is planning to run a race with a distance of 13 km and a climb of 1200 m. Make a prediction, in the form of a 95% interval, of how long the race will take for the winner. Answer: The inputs correspond to the stratum of 10-15 km and 1000-1500 m. Looking at the interval for this stratum, you can read off the prediction interval as 4700 to 8500 seconds.
  5. The Scottish hill racing data includes both male and female winners. Make a simple sketch of what the interval plot would look like if sex were included as an explanatory variable in the interval plot. (You can be casual about the exact lengths and positions of the individual prediction intervals.)

The HELPrct date frame (in the mosaicData package) is about a clinical trial (that is, an experiment) conducted with adult inpatients recruited from a detoxification unit. The response variable of interest reflects the success or failure of the detox treatment, namely, did the patient continue use of the substance abused after the treatment.

?fig-giraffe-fall-door-1 shows the output of a simple classifier (maybe too simple!) of the response given these inputs: the average number of alcoholic drinks consumed per day in the past 30 day (before treatment); and the patient’s self-perceived level of social support from friends. (The scale for social support is zero to fourteen, with a higher number meaning more support.)

Classifier based on data from a clinical trial
  1. What’s the probability of treatment failure for a patient who has 25 alcoholic drinks per day? Does the probability depend on the level of social support? Answer: Probability of failure is 75%, and doesn’t depend on the level of social support.

  2. For a patient at 0 to 10 alcoholic drinks per day, what’s the probability of treatment failure? Does the probability depend on the level of social support? Answer: The probability of failure ranges from about 72% for those with no social support to 82% for those with high social support?

  3. You are thinking about a friend who has roughly five alcoholic drinks per day. You are concerned that he will go on to substance abuse. Do the data from the clinical trial give good reason for your concern? Explain why or why not.


It’s always a good idea to be concerned for your friend, but the data reported here are not a basis for that concern. These data are from a population consisting of inpatients from a detoxification unit. These are people who have already shown strong substance abuse. The classifier is not generalizable to your friend, unless he is an inpatient from a detox unit.

  1. Explain what’s potentially misleading about the y-axis scale selected for the plot. Answer: The selected scale doesn’t include zero and so tends to over-emphasize what amount to small differences in the probability of failure.