```
<- lm(mpg ~ hp + wt, data = mtcars)
mpg_mod %>% conf_interval() mpg_mod
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 33.9573825 | 37.2272701 | 40.4971578 |

hp | -0.0502408 | -0.0317729 | -0.0133051 |

wt | -5.1719160 | -3.8778307 | -2.5837454 |

The *linear models* (`lm()`

) we have mostly been using up until now accumulate the model output as a linear combination of model inputs. Consider, for instance, a simple model of fuel economy based on the horsepower and weight of a car:

```
<- lm(mpg ~ hp + wt, data = mtcars)
mpg_mod %>% conf_interval() mpg_mod
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 33.9573825 | 37.2272701 | 40.4971578 |

hp | -0.0502408 | -0.0317729 | -0.0133051 |

wt | -5.1719160 | -3.8778307 | -2.5837454 |

These coefficients mean that the model output is a **sum**. For instance, a 100 horsepower car weighting 2500 pounds has a predicted fuel economy of `37.2 - 0.032*100 - 3.88*2.5=`

24.3 miles per gallon.^{1} If we’re interested in making a prediction, we often hide the arithmetic behind a computer function, but it is the same arithmetic:

`model_eval(mpg_mod, hp = 100, wt = 2.5)`

hp | wt | .output | .lwr | .upr |
---|---|---|---|---|

100 | 2.5 | 24.3554 | 18.91817 | 29.79263 |

The arithmetic, in principle, lets us evaluate the model for any inputs, even ridiculous ones like a 10,000 hp car weighing 50,000 lbs. There is no such car, but there is a model output.^{2}

`model_eval(mpg_mod, hp=10000, wt = 50)`

hp | wt | .output | .lwr | .upr |
---|---|---|---|---|

10000 | 50 | -474.3937 | -623.7013 | -325.0862 |

The prediction reported here means that such a car goes *negative* 474 miles on a gallon of gas. That’s silly. Fuel economy needs to be non-negative; the output \(-474\) mpg is *out of bounds*.

A good way to avoid out-of-bounds behavior is to model a *transformation* of the response variable instead of the variable itself. For example, to avoid negative outputs from a model of `mpg`

, change the model so that the output is in terms of the logarithm of `mpg`

, like this:

```
<- lm(log(mpg) ~ hp + wt, data = mtcars)
logmpg_mod model_eval(logmpg_mod, hp = 100, wt = 2.5)
```

hp | wt | .output | .lwr | .upr |
---|---|---|---|---|

100 | 2.5 | 3.173411 | 2.939839 | 3.406984 |

The reported output, 3.17, should **not** be interpreted as `mpg`

. Instead, interpret it as `log(mpg)`

. If we want output in terms of `mpg`

, then we have to undo the logarithm. That’s the role of the exponential function, which is the *inverse* of the logarithm.

```
model_eval(logmpg_mod, hp = 100, wt = 2.5) %>%
mutate(mpg = exp(.output))
```

hp | wt | .output | .lwr | .upr | mpg |
---|---|---|---|---|---|

100 | 2.5 | 3.173411 | 2.939839 | 3.406984 | 23.88884 |

The logarithmic transform at the model-training stage does not not prevent the model output from being negative. We can see this by looking at the tank example:

```
<- lm(log(mpg) ~ hp + wt, data = mtcars)
mod_logmpg model_eval(mod_logmpg, hp=10000, wt=50) %>%
mutate(mpg = exp(.output))
```

hp | wt | .output | .lwr | .upr | mpg |
---|---|---|---|---|---|

10000 | 50 | -21.6327 | -28.04665 | -15.21874 | 0 |

The model output is negative for the tank, but the model output corresponds to `log(mpg)`

. What will keep the model from producing negative `mpg`

will be the exponential transformation applied to the model output. A graph of the exponential function shows how this works.

The log transform does not fix the absurdity of modeling tanks based on the fuel economy of cars. The model’s prediction of `mpg`

for the tank is 0.0000000004 miles/gallon, but real-world tanks do much better than that. For instance, the M1 Abrams tank is reported to get approximately 0.6 miles per gallon.

In everyday language, “risk” refers to a dangerous or unwelcome outcome. We talk about the “risk of heart disease” or the “risk of bankruptcy” or other financial loss. To apply risk to a positive outcome is non-idiomatic. For instance, for a person wanting to have a baby, we don’t talk about the “risk of pregnancy,” but about the “chances of becoming pregnant.”

In statistics, the word “risk” is similarly used for an unwelcome outcome. However, an additional component of meaning is added. “Risk” refers to the **probability** of the unwelcome outcome. In principle, though, it would be entirely equivalent to speak of the “probability of heart disease,” leaving the phrase “heart disease” to signal that the outcome is unwanted. We talk about the “risk of death” but never the “risk of survival.” Instead, we would say something like the “chances of survival.”

The outcomes described by “risk” are categorical. Generically, the levels of the categorical variable might be “unwelcome” and “not unwelcome,” but they might be more specifically named, say, “death” and “survival,” or “lung disease” and “not.”

We have been building models of such categorical output variables from the start of these Lessons. For the zero-one categorical variables we have emphasized, the model output is in the form of a probability: the probability of the outcome of the event being “one” (or whatever actual level “one” corresponds to.) If we assign one for “death” and zero for “survival,” the probability which is the output of a model is a risk, but other than the choice of zero-one assignment, there is no mathematical difference (in statistics) between a risk and a probability.

It often happens that risk depends on other factors, often called “risk factors.” In our modeling framework, such risk factors are merely explanatory variables. For instance, a study of the impact of smoking on health might use `outcome`

represented by a categorical response variable with levels “death” or “survival.”

To summarize, for statistical thinkers a model of risk takes the usual form that we have used for models of zero-one categorical models. All the same issues apply: covariates, DAGs, confidence intervals, and so on. There is, however, a slightly different style for presenting effect sizes.

Up until now, we have presented effect in terms of an arithmetic difference. As an example, we turn to the fuel-economy model introduced at the beginning of this lesson. Effect sizes are about *changes*. To look at the effect size of, say, weight (`wt`

), we would calculate the model output for two cars that differ in weight (but are the same for the other explanatory variables). For instance, to know the change in fuel economy due to a 1000 pound change in weight, we can do this calculation:

```
model_eval(logmpg_mod, hp = 100, wt = c(2.5, 3.5)) %>%
mutate(mpg = exp(.output))
```

hp | wt | .output | .lwr | .upr | mpg |
---|---|---|---|---|---|

100 | 2.5 | 3.173411 | 2.939839 | 3.406984 | 23.88884 |

100 | 3.5 | 2.972875 | 2.736388 | 3.209362 | 19.54803 |

The lighter car is predicted to get 24 mpg, the heavier car to get 19.5 mpg. The arithmetic difference in output \(19.5 - 24 = -4.5\) mpg is the effect of the 1000 pound increase in weight.

There is another way to present the effect, as a **ratio** or proportion. In this style, the effect of an addition 1000 pounds is \(19.5 / 24 = 81\%\), that is, the heavier car can go only 81% of the distance that the lighter car will travel on the same amount of gasoline. (Stating an effect as a ratio is common in some fields. For example, economists use ratios when describing prices or investment returns.)

In presenting a change in risk—that is, the change in probability resulting from a change in some explanatory variable—both the arithmetic difference and arithmetic ratio forms are used. But there is a special terminology that is used to identify the two forms. A change in the form of an arithmetic difference is called an “**absolute** change in risk.” A change in the ratio form is called a “**relative** risk.”

The different forms—absolute change in risk versus relative risk—both describe the same change in risk. For most decision-makers, the absolute form is most useful. To illustate, suppose exposure to a toxin increases the risk of a disease by 50%. This would be a risk ratio of 1.5. But that risk ratio might be based on an absolute change in risk from 0.00004 to 0.00006, or it might be based on an absolute change in risk from 40% to 60%. The latter is a much more substantial change in risk and ought to warrant more attention from decision makers interested.

There can be multiple factors that contribute to risk. As a simple example of how different factors can contribute simultaneously, consider the following DAG:

`dag_draw(dag09)`

`print(dag09)`

```
a ~ exo()
b ~ exo()
c ~ binom(2 * a + 3 * b)
```

The formulas for `dag09`

show that the nodes `a`

and `b`

are exogenous, their values set randomly and independently of one another by the `exo()`

function. Both `a`

and `b`

contribute to the value of `c`

. Ordinary stuff. But there is something new in this simulation, the `binom()`

function. The output of `binom()`

will always be either 0 or 1. If the input to `binom()`

is zero, it’s completely random whether a 0 or 1 will result. The more positive the input to `binom()`

, the larger the chance of a 1. Similarly, the more negative the input, the larger the chance of a 0. Like this:

`Loading required package: mosaicCore`

```
Attaching package: 'mosaicCore'
```

```
The following objects are masked from 'package:dplyr':
count, tally
```

```
Attaching package: 'mosaicCalc'
```

```
The following object is masked from 'package:stats':
D
```

In `dag09`

, the input to `binom()`

is `2*a + 3*b`

; both `a`

and `b`

are risk factors for `c`

being 1. The larger that `a`

is, the larger the risk. Similarly, the larger `b`

, the larger the risk. The formula `2*a + 3*b`

accumulates the risk steming from `a`

and `b`

.

Let’s generate a big sample—\(n=10,000\)—from `dag09`

and see if a simple model is able to capture the way `a`

and `b`

contribute to `c`

:

```
sample(dag09, size=10000) %>%
lm(c ~ a + b, data = .) %>%
conf_interval()
```

term | .lwr | .coef | .upr |
---|---|---|---|

(Intercept) | 0.4925271 | 0.4993648 | 0.5062024 |

a | 0.1891660 | 0.1959941 | 0.2028223 |

b | 0.2918993 | 0.2987193 | 0.3055393 |

The coefficients on `a`

and `b`

are inconsistent with the `dag09`

formulas. What’s wrong? The problem can be seen in Figure 33.2. The output of `binom()`

is a probability. Probabilities have to stay within the bounds 0 to 1. But there is nothing in the formula `c ~ a + b`

that enforces such bounds. If we are to model zero-one response variables like `c`

, we will need a means to enforce the bounds.