In everyday language, “risk” refers to a dangerous or unwelcome outcome. We talk about the “risk of heart disease” or the “risk of bankruptcy” or other financial loss. To apply risk to a positive outcome is non-idiomatic. For instance, for a person wanting to have a baby, we don’t talk about the “risk of pregnancy,” but about the “chances of becoming pregnant.”

In statistics, the word “risk” is similarly used for an unwelcome outcome. However, an additional component of meaning is added. “Risk” refers to the probability of the unwelcome outcome. In principle, though, it would be entirely equivalent to speak of the “probability of heart disease,” leaving the phrase “heart disease” to signal that the outcome is unwanted. We talk about the “risk of death” but never the “risk of survival.” Instead, we would say something like the “chances of survival.”

The outcomes described by “risk” are categorical. Generically, the levels of the categorical variable might be “unwelcome” and “not unwelcome,” but they might be more specifically named, say, “death” and “survival,” or “lung disease” and “not.”

We have been building models of such categorical output variables from the start of these Lessons. For the zero-one categorical variables we have emphasized, the model output is in the form of a probability: the probability of the outcome of the event being “one” (or whatever actual level “one” corresponds to.) If we assign one for “death” and zero for “survival,” the probability which is the output of a model is a risk, but other than the choice of zero-one assignment, there is no mathematical difference (in statistics) between a risk and a probability.

It often happens that risk depends on other factors, often called “risk factors.” In our modeling framework, such risk factors are merely explanatory variables. For instance, a study of the impact of smoking on health might use outcome represented by a categorical response variable with levels “death” or “survival.”

Risk vocabulary

In statistical terms, a risk is a probability associated with an outcome.

A full description of risk looks much like a prediction: a complete list of possible outcomes, each associated with a probability, which we’ll call a risk level.

A risk level is properly measured as a pure number, e.g. 30 percent.

Being a probability, such numbers must always be between 0 and 1, or, equivalently, between 0 and 100 percent.

There are two ways of referring to percentages, e.g. 30 percent vs 30 percentage points. When talking about a single risk, these two are equivalent. However, “percentage points” should be reserved for a particular situation: Describing a change in absolute risk.

For simplicity, we will focus on situations where there are only two outcomes, e.g. alive/dead, success/failure, cancer/not, diabetes/not.

Since there are only two outcomes, knowing the probability p of one outcome automatically sets the probability of the other outcome.

One of the outcomes is worse than the other, so we usually take the risk to be the worse outcome and its probability.

A risk factor is a condition, behavior, or such that changes the probability of the (worse) outcome. Just to have concise names, we will use this terminology:

baseline risk (level): the risk (level) without the risk factor applying.

augmented risk (level) : the risk (level) when the risk factor applies.

A risk ratio is exactly what the name implies: the ratio of the augmented risk to the baseline risk.

For instance, suppose the baseline risk is 30 percent and the augmented risk is 45%. Then the risk ratio is 45/30 = 1.5 = 150 percent. Risk ratios are often greater than 1, which should remind us that a risk ratio is a different kind of beast from a risk, which can never be larger than 1.

There are two distinct uses for risk factors:

Draw attention to a factor under our control (e.g. skiing, biking, using a motorcycle, smoking) so that we can decide whether the augmentation in risk is worth avoiding.

Establish the baseline risk in a relevant way (e.g. our age, sex, and so on).

For decision-making regarding a risk factor, it is most meaningful to focus on the change in absolute risk, that is, the difference between the augmented risk and the baseline risk.

Example: The risk ratio for the smoking risk factor is about 2.5/1 for ten-year, all-cause mortality. If the baseline risk is 3 percentage points, the augmented risk is 7.5%. Consequently, the augmentation in risk for smoking is (2.5-1) x 3% = 4.5 percentage points. On the other hand, if the baseline risk were 30 percentage points, the 2.5 risk ratio increases the the risk by 45 percentage points.

Notice that we are describing the augmentation in risk as “percentage points.” Always use “percentage points” to avoid ambiguity. If we had said, “45 percent,” people might mistake the augmentation in risk as a risk ratio of 1.45.

Question

Why bother to present risk factors in terms of risk ratios when for decision-making it’s better to use the augmentation in risk in percentage points?

Answer: Because the same risk factor can lead to different amounts of augmentation depending on the baseline risk. If there are multiple risk factors, then adding up such augmentations can potentially lead to the risk level exceeding 100%.

One reason there are multiple risk factors is that we often need to adjust for confounders, etc., which means putting more factors into consideration.

Modeling risk

The linear models (lm()) we have mostly been using up until now accumulate the model output as a linear combination of model inputs. Consider, for instance, a simple model of fuel economy based on the horsepower and weight of a car:

mpg_mod <-lm(mpg ~ hp + wt, data = mtcars) mpg_mod %>%conf_interval()

term .lwr .coef .upr
------------ ----------- ----------- -----------
(Intercept) 33.9573825 37.2272701 40.4971578
hp -0.0502408 -0.0317729 -0.0133051
wt -5.1719160 -3.8778307 -2.5837454

These coefficients mean that the model output is a sum. For instance, a 100 horsepower car weighting 2500 pounds has a predicted fuel economy of 37.2 - 0.032*100 - 3.88*2.5=24.3 miles per gallon.^{1} If we’re interested in making a prediction, we often hide the arithmetic behind a computer function, but it is the same arithmetic:

The arithmetic, in principle, lets us evaluate the model for any inputs, even ridiculous ones like a 10,000 hp car weighing 50,000 lbs. There is no such car, but there is a model output.^{2}

The prediction reported here means that such a car goes negative 474 miles on a gallon of gas. That’s silly. Fuel economy needs to be non-negative; the output \(-474\) mpg is out of bounds.

A good way to avoid out-of-bounds behavior is to model a transformation of the response variable instead of the variable itself. For example, to avoid negative outputs from a model of mpg, change the model so that the output is in terms of the logarithm of mpg, like this:

logmpg_mod <-lm(log(mpg) ~ hp + wt, data = mtcars) model_eval(logmpg_mod, hp =100, wt =2.5)

The reported output, 3.17, should not be interpreted as mpg. Instead, interpret it as log(mpg). If we want output in terms of mpg, then we have to undo the logarithm. That’s the role of the exponential function, which is the inverse of the logarithm.

model_eval(logmpg_mod, hp =100, wt =2.5) %>%mutate(mpg =exp(.output))

The logarithmic transform at the model-training stage does not not prevent the model output from being negative. We can see this by looking at the tank example:

mod_logmpg <-lm(log(mpg) ~ hp + wt, data = mtcars)model_eval(mod_logmpg, hp=10000, wt=50) %>%mutate(mpg =exp(.output))

The model output is negative for the tank, but the model output corresponds to log(mpg). What will keep the model from producing negative mpg will be the exponential transformation applied to the model output. A graph of the exponential function shows how this works.

The log transform does not fix the absurdity of modeling tanks based on the fuel economy of cars. The model’s prediction of mpg for the tank is 0.0000000004 miles/gallon, but real-world tanks do much better than that. For instance, the M1 Abrams tank is reported to get approximately 0.6 miles per gallon.

Risk

To summarize, for statistical thinkers a model of risk takes the usual form that we have used for models of zero-one categorical models. All the same issues apply: covariates, DAGs, confidence intervals, and so on. There is, however, a slightly different style for presenting effect sizes.

Up until now, we have presented effect in terms of an arithmetic difference. As an example, we turn to the fuel-economy model introduced at the beginning of this lesson. Effect sizes are about changes. To look at the effect size of, say, weight (wt), we would calculate the model output for two cars that differ in weight (but are the same for the other explanatory variables). For instance, to know the change in fuel economy due to a 1000 pound change in weight, we can do this calculation:

model_eval(logmpg_mod, hp =100, wt =c(2.5, 3.5)) %>%mutate(mpg =exp(.output))

The lighter car is predicted to get 24 mpg, the heavier car to get 19.5 mpg. The arithmetic difference in output \(19.5 - 24 = -4.5\) mpg is the effect of the 1000 pound increase in weight.

There is another way to present the effect, as a ratio or proportion. In this style, the effect of an addition 1000 pounds is \(19.5 / 24 = 81\%\), that is, the heavier car can go only 81% of the distance that the lighter car will travel on the same amount of gasoline. (Stating an effect as a ratio is common in some fields. For example, economists use ratios when describing prices or investment returns.)

In presenting a change in risk—that is, the change in probability resulting from a change in some explanatory variable—both the arithmetic difference and arithmetic ratio forms are used. But there is a special terminology that is used to identify the two forms. A change in the form of an arithmetic difference is called an “absolute change in risk.” A change in the ratio form is called a “relative risk.”

The different forms—absolute change in risk versus relative risk—both describe the same change in risk. For most decision-makers, the absolute form is most useful. To illustate, suppose exposure to a toxin increases the risk of a disease by 50%. This would be a risk ratio of 1.5. But that risk ratio might be based on an absolute change in risk from 0.00004 to 0.00006, or it might be based on an absolute change in risk from 40% to 60%. The latter is a much more substantial change in risk and ought to warrant more attention from decision makers interested.

Other ways to measure change in risk

It is important for measures of change in risk to be mathematically valid. But from among the mathematically valid measures, one wants to choose a form that will be the best for communicating with decision-makers. Those decision-makers might be the people in charge of establishing screening for diseases like breast cancer, or a judge and jury deciding the extent to which blame for an illness ought to be assigned to second-hand smoke.

Two useful ways to present a change in risk are the “number needed to treat” (NNT) and the “attributable fraction.” The NNT is useful for presenting the possible benefits of a treatment or screening test. Consider these data from the US Preventive Services Task Force which take the form of the number of breast-cancer deaths in a 10-year period avoided by mammography. The confidence interval on the estimated number is also given.

Age

Deaths avoided

Conf. interval

40-49

3

0-9

50-59

8

2-17

60-69

21

11-32

70-74

13

0-32

The table does not give the risk of death, but rather the absolute risk reduction. For the 70-74 age group this risk reduction is 13/100000 with a confidence interval of 0 to 32/100000.

The NNT is well named. It gives the number of people who must receive the treatment in order to avoid one death. Arithmetically, the NNT is simply the reciprocal of the absolute risk reduction. So, for the 70-74 age group the NNT is 100000/13 or 7700 or, stated as a confidence interval, [3125 to \(\infty\)].

For a decision-maker, NNT presents the effect size in a readily understood way. For example, the 40-49 year-old group has an NTT of 33,000. The cost of the treatment could be presented in terms of anxiety prevented (mammography produces a lot of false positives) or monetary cost. The US Affordable Care Act requires health plans to fully cover the cost of a screening mammogram every one or two years for women over 40. Those mammograms each cost about $100-200. Consequently, the cost of mammography over the ten-year period (during which 5 mammograms might be performed) is roughly \(5\times \$100 \times 33000\) or about $16 million per life saved.

The attributable fraction is a way of presenting a risk ratio—in other words, a relative risk—in a way that is more concrete than the ratio itself. Consider the effect of smoking on the risk of getting lung cancer. According to the US Centers for Disease Control, “People who smoke cigarettes are 15 to 30 times more likely to get lung cancer.” This statement directly gives the confidence interval on the relative risk: [15 to 30].

The attributable fraction refers to the proportion of disease in the exposed group—that is, smokers—to be attributed to expose. The general formula for attributable fraction is simple. If the risk ratio is denoted \(RR\), the attributable fraction is \[\text{attributable fraction} \equiv \frac{RR-1}{RR}\] For a smoker who gets lung cancer, the confidence interval on the attributable fraction is [93% to 97%].

For second-hand smoke, the CDC estimates the risk ratio for cancer at [1.2 to 1.3]. For a person exposed to second-hand smoke who gets cancer, the attributable fraction is [17% to 23%]. Such attributions are useful for those, such as judges and juries, who need to assign a level of blame for a bad outcome.

Accumulating risk

There can be multiple factors that contribute to risk. As a simple example of how different factors can contribute simultaneously, consider the following DAG:

dag_draw(dag09)

print(dag09)

a ~ exo()
b ~ exo()
c ~ binom(2 * a + 3 * b)

The formulas for dag09 show that the nodes a and b are exogenous, their values set randomly and independently of one another by the rnorm() function. Both a and b contribute to the value of c. Ordinary stuff. But there is something new in this simulation, the bernoulli() function. The output of bernoulli() will always be either 0 or 1. If the input to bernoulli() is zero, it’s completely random whether a 0 or 1 will result. The more positive the input to bernoulli(), the larger the chance of a 1. Similarly, the more negative the input, the larger the chance of a 0. Like this:

Alternatively, you can use the labels= argument to assign categorical labels instead of zero or one.

In dag09, the input to bernoulli() is 2*a + 3*b; both a and b are risk factors for c being 1. The larger that a is, the larger the risk. Similarly, the larger b, the larger the risk. The formula 2*a + 3*b accumulates the risk steming from a and b.

Let’s generate a big sample—\(n=10,000\)—from dag09 and see if a simple model is able to capture the way a and b contribute to c:

sample(dag09, size=10000) %>%lm(c ~ a + b, data = .) %>%conf_interval()

term .lwr .coef .upr
------------ ---------- ---------- ----------
(Intercept) 0.4983874 0.5052478 0.5121081
a 0.1901735 0.1971530 0.2041324
b 0.2932693 0.3001183 0.3069672

The coefficients on a and b are inconsistent with the dag09 formulas. What’s wrong? The problem can be seen in Figure 17.2. The output of binom() is a probability. Probabilities have to stay within the bounds 0 to 1. But there is nothing in the formula c ~ a + b that enforces such bounds. If we are to model zero-one response variables like c, we will need a means to enforce the bounds.

Probability, odds, and log odds

Under construction

A probability—a number between 0 and 1—is the most used measure of the chances that something will happen, but it is not the only way nor the best for all purposes.

Also part of everyday language is the word “odds,” as in, “What are the odds?” to express surprise at an unexpected event.

Odds are usually expressed in terms of two numbers, as in “3 to 2” or “100 to 1”, written more compactly as 3:2 and 100:1 or even 1.5 and 100, respectively. The setting for odds is an even that might happen or not: the horse Fortune’s Chance might win the race, otherwise not; it might rain today, otherwise not; the Red Sox might win the World Series, otherwise not.

The format of a probability assigns a number between 0 and 1 to the chances that Fortune’s Chance will win, or that it will rain, or that the Red Sox will come out on top. If that number is called \(p\), then the chances of the “otherwise outcome” must be \(1-p\). The event with probability \(p\) would be reformatted into odds as \(p:(1-p)\). No information is lost if we treat the odds as a single number, the result of the division \(p/(1-p)\). Thus, when \(p=0.25\) the corresponding odds will be \(0.25/0.75\), in other words, 1/3.

A big mathematical advantage to using odds is that the odds number can be anything from zero to infinity; it’s not bounded within 0 to 1. Even more advantageous for the purposes of accumulating risk is the logarithm of the odds, called “log odds.” We will come back to this later.

The printed version of dag09 shows that the value of node c is a linear combination of a and b converted into a zero-one, binomial value. Unfortunately, the linear modeling trainer, lm(), is not well-tuned to work with binomial data. Another modeling technique, “logistic regression,” does a better job. The glm() function trains logistic regression models on data.

sample(dag09, size=10000) %>%glm(c ~ a + b, data = ., family="binomial") %>%conf_interval()

term .lwr .coef .upr
------------ ----------- ---------- ----------
(Intercept) -0.0805027 -0.018025 0.0444399
a 1.8634245 1.956253 2.0515897
b 2.9106033 3.034490 3.1621667

When we use the appropriate modeling technique, we can, in this case, recover the coefficients in the DAG formula: 2 for a and 3 for b.

It has long been touted that 10,000 steps a day is the magic number you need to stay fit and healthy—but a new study shows fewer than 5,000 may be enough to see a benefit.

The analysis of more than 226,000 people around the world showed 4,000 was enough to start reducing the risk of dying [from] any cause.

The more you do, the more health benefits are seen, researchers said. Every extra 1,000 steps beyond the 4,000 reduced the risk of dying … by 15% up to 20,000 steps.

The researchers did not—and could not—measure the mortality rate for an individual. (That rate would be either zero or, at the moment of death, infinite!) Instead, they looked at the mortality rate of groups of similarly situated individuals: age, sex, and steps-per-day.

Consider an individual, Alice, who walks 4000 steps per day and is in a group with a mortality rate of 2% per year. Suppose Betty is just like Alice, but walks an extra 1000 steps each day. What is the mortality rate in Betty’s group?

Now consider Carol, who walks 14,000 steps per day but is otherwise just like Alice and Betty. What is the mortality rate in Carol’s group?

What is the mortality risk ratio for Carol’s group compared to Alice’s?

The mortality rate tends to increase with age. But let’s stipulate, contrary to fact, that it was constant at all ages. Which of the families of probability distributions would provide an appropriate model of how many years of remaining life a person has? Explain your reasoning in choosing the appropriate probability distribution.

According to the stipulation in (4), would the years of remaining life depend on how old the person is?

Exercise 17.1 ash-chew-kitchen

AN EXAMPLE OF A CLASSIFIER WITH MULTIPLE LEVELS. OR MAYBE PUT THIS IN AN EXAMPLE IN THE TEXT AND POINT TO THE INTERESTING DATA for 70+ males.

The graph below shows a function of age. The output of the function is categorical: the most likely marital status of a person at a given age. (The model was trained on the National Health and Nutrition Evaluation Survey data.)

Attaching package: 'kernlab'

The following object is masked from 'package:mosaic':
cross

The following object is masked from 'package:ggplot2':
alpha

maximum number of iterations reached -0.002102341 -0.00214156

A classifier output should be a probability, not a categorical level. On the blank graph below, sketch out a plausible form for probability vs age for each of three categorical levels shown in the above plot. (Hint: At an age where, say, “NeverMarried” is the categorical output, the probability for “NeverMarried” will be higher than the other categories.)

Answer:

Presumably the probability output for each category varies somewhat smoothly. There are two constraints:

At any age/sex, one probability will be the highest of the three. That one should correspond to the category shown in the first graph.

The probabilities should add up to 1.

Here’s one possibility. Note that for females, the highest probability around age 80 is “widowed”.

Exercise 17.2 ant-take-room

To illustrate how stratification is used to build a classifier, consider this very simple, unrealistically small, made-up data frame listing observations of animals:

species size color
-------- ------ ---------
A large reddish
B large brownish
B small brownish
A large brownish

You are going to build classifiers using the data. The output of the classifier will be the probability that the species is A.

Use just size as an explanatory variable. Since there are two levels for size, the classifier can take the form of a simple table, giving the proportion of rows for each of the two sizes. Fill in the table to reflect the data.

size prop_of_A
------ ----------
large
small

Answer:

There are three rows where the size is large, of which one is species A. The classifier output is thus 2/3 for large.

Similarly, there is only one row where the size is small, none of which are species A. The classifier output is 0/1 for small.

Repeat (1), but instead of “size”, use just “color” as an explanatory variable. ::: {.cell} ::: {.cell-output .cell-output-stdout}

color prop_of_A
--------- ----------
reddish
brownish

:::

Answer:

There are three rows where the color is brownish, of which two are species A. The classifier output is thus 1/3 for brownish.

There is only one row where the color is reddish, and it is species A. The classifier output is 1/1 for reddish.

Again build a classifier, but use both color and size as explanatory variables.

color size prop_of_A
--------- ------ ----------
reddish large
reddish small
brownish large
brownish small

Answer:

There is just one row in which color is reddish and size is large, and it is species A. The classifier output is thus 1/1.

There are two rows in which color is brownish and size is large, one of which is species A. The classifier output is thus 1/2.

There is one row in which color is brownish and size is small. It is species B. The classifier output is 1/1.

There are no rows in which color is reddish and size is small. A classifier output of 0/0 is meaningless. So our classifier has nothing to say for these inputs.

Finally, build the “null model”, a no-input classifier. This means there is just one group, which has all four rows. Answer: Of the four rows, two are species A, so the classifier output is 2/4.

:::

The wt variable is measured in units of 1000 lbs, so a 2500 pound vehicle has a wt value of 2.5.↩︎

A 10,000 hp, 50,000 lbs ground vehicle does have a name: a “tank.” Common sense dictates that one not put too much stake in a calculation of a tank’s fuel economy based on data from cars!↩︎