<- lm(mpg_city ~ displacement, data=MPG) Mod
21 Effect size
Regression modeling and confidence intervals provide a substantial toolbox to support statistical thinking. This Lesson starts to develop methods using modeling to inform decision-making. Decision-making takes many guises: whether to administer medicine, change a budget, raise or lower a price, respond to an evolving situation, and so on.
A useful simplification splits support for decision-making into two broad categories.
Making a prediction for an individual choice. The need for predictions arises in both mundane and critical settings. For instance, an airline needs to set prices. They want to maximize revenue. Higher prices will bring in more money per seat but may reduce the number of people flying. To make the pricing decision, the airline needs a prediction about what the demand will be for those seats, which may vary based on price, day of the week, time of day, time of year, origin and destination of the flight, and so on. Another example: Merchants and social media sites must choose what products or posts to display to a viewer. Merchants have many products, and social media has many news feeds, tweets, and competing blog entries. The people who manage these websites want to promote the products or postings most likely to cause a viewer to respond. To identify viable products or postings, the site managers construct predictive models based on earlier viewers’ choices. We will study prediction models in Lessons 25 and 26,
Intervening in a system. Such interventions occur on both grand scales and small: changes in government policies such as funding for preschool education or subsidies for renewable energy, closing a road to redirect traffic or opening a new highway or bus line, changing the minimum wage, etc. Before making such interventions, it is wise to know what the consequences are likely to be. Figuring this out is often a matter of understanding how the system works: what causes what. As interventions often affect multiple individuals, influencing the overall trend of the effect across individuals might be the goal instead of predicting how each individual will be affected.
This Lesson focuses on “effect size,” a measure of how changing an explanatory variable will play out in the response variable. Built into the previous sentence is an assumption that the explanatory variable causes the response variable. In Lessons 28 through 31, we will look into ways to make responsible claims about whether a connection between variables is causal. Here, we will focus on the calculation and interpretation of effect size.
Effect size: Input to output
An intervention changes something in the world. Some examples are the budget for a program, the dose of a medicine, or the fuel flow into an engine. The thing being changed is the input. In response, something else in the world changes, for instance, the reading ability of students, the patient’s serotonin levels (a neurotransmitter), or the power output from the engine. The thing that changes in response to the change in input is called the “output.”
“Effect size” describes the change in the output with respect to the change in the input. The simplest case is when the output is a quantitative variable. In this case, the change in the output is a difference between two numbers. The form of the effect size depends on the input type. For example, for a quantitative input, the effect size will be a ratio, that is, a rate. (For calculus students: the effect size is a derivative of the output with respect to the input.)
To measure an effect size from data, construct a model with the output as the response variable and the input as an explanatory variable.
A person buying a car typically has multiple objectives in mind. Perhaps the buyer is deciding whether to order a more powerful engine. This decision has consequences, including a reduction in fuel economy. The decision variable—the engine size—is the input; the fuel economy is the output.
Since both input and output are quantitative, the effect size will be a rate: change in fuel economy per change in engine size. To inform a decision, use data such as the math300::MPG
data frame, which compares various car models. MPG
records the engine size in terms of displacement
, in liters. Fuel economy is listed in miles per gallon, differently for city versus highway driving.
The buyer is debating between a 2-liter and a 3-liter engine. Most driving will be in the city. To calculate the effect size, first build a model with the output (mpg_city
) as the response variable and the input (displacement
) as an explanatory variable.
Second, evaluate that model for the range of inputs under consideration.
model_eval(Mod, displacement=c(2, 3))
displacement .output .lwr .upr
------------- --------- --------- ---------
2 24.01437 15.91915 32.10959
3 20.86976 12.77698 28.96254
The change in the input from 3 liters displacement to 2 liters leads to a change in fuel economy of \(24.0 - 20.9 = -3.1\) miles per gallon. The change in displacement is \(3 - 2 = 1\) liters. The effect size is the ratio between the output change and the input change. Here, that is -3.1 miles per gallon per liter.
The decision-maker may be more concerned about the cost of driving than with the miles per gallon. Then the appropriate response variable might be EPA_fuel_cost,
denominated in dollars per year.
<- lm(EPA_fuel_cost ~ displacement, data=MPG)
Mod2 model_eval(Mod2, displacement=c(2, 3))
displacement .output .lwr .upr
------------- --------- --------- ---------
2 1585.887 1000.649 2171.125
3 1882.534 1297.473 2467.596
The change in output is about $300 per year. However, the change in input is still 1 liter. The effect size is, therefore, $300 per year per liter.
Some decision variables are categorical. For instance, the buyer might like the idea of an engine that automatically turns off when the car is stopped at a light or in traffic. The start_stop
variable, which has categorical levels “Yes” and “No,” records whether the car has this feature. Effect size estimation is slightly different when the input is categorical rather than quantitative. Still, build a model and compare the change in output to the change in input:
<- lm(EPA_fuel_cost ~ start_stop, data=MPG)
Mod3 model_eval(Mod3, start_stop=c("No", "Yes"))
start_stop .output .lwr .upr
----------- --------- --------- ---------
No 1872.193 916.0164 2828.369
Yes 1945.194 989.0637 2901.324
In this case, the change in output is $73 per year; the change in input is “Yes” - “No.” But, of course, it is meaningless to subtract one categorical level from another. Consequently, the effect size of start_stop
on fuel cost cannot be quantified as a ratio. So, instead, the effect size is simply the difference in the output: a $73 per year increase with the Start/Stop feature.
The statistical thinker knows to pay attention to whether a calculated result makes sense. It seems unlikely that the Start/Stop feature causes more fuel to be consumed. Was there an error? Perhaps we did the subtraction backward? Check the report from model_eval()
to make sure.
Here, the problem is not arithmetic. However, there is another possibility. It might be that manufacturers include the Start/Stop feature with big cars but not little ones. Then, even if Start/Stop might save gas when everything else is held constant, because the big cars use more fuel than little cars, it only appears that Start/Stop hurts fuel economy. This theory is, at this point, speculation: a hypothesis. Such a mixture of effects—big versus small car mixed with availability of Start/Stop—is called “confounding.” In Lessons 28 through 30, we discuss identifying and dealing with possible confounding.
The surprising positive effect size of the Start/Stop feature caused a double take and led us to think of ways to make sense of the result. Right now, we simply have a hypothesis that Start/Stop is associated with bigger cars. (We will check that out in a little bit.)
The effect size of annual fuel cost with respect to engine displacement, $300 per year per liter, did not surprise us. Perhaps it should have. After all, larger vehicles tend to have larger engines. This relationship might lead to confounding between vehicle size and engine displacement. We think we are looking at engine displacement, but instead, the effect might be due to vehicle size. Again, just a hypothesis at this point. The statistical thinker knows to consider possible confounding from the start.
Categorical outputs
Sometimes the relevant effect size involves a categorical output variable. A case in point is the possible confounding of the Start/Stop feature with vehicle size. To investigate this, we should build a model with Start/Stop as the output and vehicle size as the input.
In this case, the issue of whether vehicle size causes Start/Stop is not essential. We are not concerned with the decisions made by automobile designers so much as with the possible confounding.
When the output variable is categorical, it is not reasonable to calculate the change in output as the difference in categories. As before, “Yes” - “No” is not a number. Still, there is a meaningful and helpful way to quantify a change in a categorical output.
The essential insight is quantifying the change in output in terms of probabilities. For instance, a small effect size would reflect a slight chance of the output changing from one level to another.
The appropriate model type for a categorical output is to transform the output to a zero-one variable, as introduced in Lesson 1. We will present this in a demonstration here and return to the topic more fully in Lesson 34.
As described earlier, we are interested in the possibility that Start/Stop is available mainly on large, higher-fuel-consumption cars. If so, that would explain why the effect size we calculated of fuel cost with respect to Start/Stop was positive.
The model we build will have a zero-one encoding of Start/Stop as the response and the vehicle’s fuel cost as the explanatory variable.
<- MPG %>%
MPG mutate(has_start_stop = zero_one(start_stop, one="Yes"))
<- lm(has_start_stop ~ EPA_fuel_cost, data = MPG)
Mod4 model_eval(Mod4, EPA_fuel_cost=c(1600, 2000))
EPA_fuel_cost .output .lwr .upr
-------------- ---------- ----------- ---------
1600 0.4901341 -0.4891981 1.469466
2000 0.5207835 -0.4583924 1.499959
The .output
here is interpreted as a probability of start_stop
having the value “Yes.” (That is because we set one="Yes"
in the zero_one()
conversion.) The model_eval()
report indicates $400 per year increase in fuel cost is associated with a three percentage point increase in the probability of a vehicle having a Start/Stop feature. That is a small effect, so we see little support for our hypothesis that Start/Stop tends to be installed on larger, more fuel-efficient vehicles.
Multiple explanatory variables
When a model has more than one explanatory variable, each has a different effect size.
As an example, consider the price of books. We have some data that might be informative, moderndive::amazon_books
. What is the effect size of page count on price. The appropriate model here is list_price ~ num_pages
. The effect size is easy to compute:
<- lm(list_price ~ num_pages, data = moderndive::amazon_books)
Mod1 model_eval(Mod1, num_pages = c(200, 400))
num_pages .output .lwr .upr
---------- --------- ----------- ---------
200 15.82014 -11.636987 43.27726
400 19.79643 -7.637503 47.23037
We elected to compare 200-page books with 400-page books, simply because those seem like reasonable book lengths. However, the longer book costs about 4 dollars more. So the effect size, to judge from this model, is $4 divided by 200 more pages, which comes to 2 cents per page.
Another effect size is needed to address the question: Are hardcovers more expensive than paperbacks? The output is still price. But now, the input is categorical. In the moderndive::amazon_books
data frame, the variable hard_paper
has levels “P” and “H.” A possible model: list_price ~ hard_paper
.
<- lm(list_price ~ hard_paper, data = amazon_books)
Mod2 model_eval(Mod2, hard_paper = c("P", "H"))
hard_paper .output .lwr .upr
----------- --------- ---------- ---------
P 17.13523 -10.62291 44.89338
H 22.39393 -5.46052 50.24839
A hardcover book costs about $5.25 more than a paperback book. Since the input is categorical, there is no change of input to divide by, so the effect size is $5.25 when going from a paperback to a hardcover.
We can look at the effects of page length and cover-type separately. Instead, we can include both as explanatory variables.
<- lm(list_price ~ hard_paper + num_pages, data = amazon_books)
Mod3 model_eval(Mod3, hard_paper = c("P", "H"), num_pages=c(200, 400))
hard_paper num_pages .output .lwr .upr
----------- ---------- --------- ----------- ---------
P 200 14.52494 -12.641928 41.69182
H 200 19.48253 -7.785720 46.75077
P 400 18.43605 -8.709404 45.58151
H 400 23.39363 -3.847698 50.63497
This output requires some interpretation. We have got short and long paperback books and short and long hardcover books. What should we compare to what?
The convention is to consider each of the two inputs separately and hold the other input constant when we compare.
Effect size of num_pages
on list_price
. To hold hard_paper
constant, we will compare the two rows of the model_eval()
report that have a “P” value for hard_paper
. The difference in output for these two rows is $3.90. The effect size divides by the change in input—200 pages—so the effect size is just under 2 cents per page. Effect size of hard_paper
on list_price
. This time we will hold num_pages
constant, say at 200 pages. Comparing the corresponding rows in the model_eval()
output shows a change in list price of $4.96 when going from paper back to hard cover. There is no special reason we decided to hold hard_paper
constant at “P” rather than “H” or hold num_pages
constant at 200 rather than 400. In general, the effect size will depend on the value being held constant. Choose a value that’s relevant to the purpose at hand.
In these Lessons we are building models with additive effects.That is what the +
means in, say, list_price ~ hard_paper + num_pages
. We do this to keep the effect-size story as simple as possible. (Occasionally, you will see examples with multiplicative effects, called “interactions.” The tilde expressions for such models involve *
rather than +
, as in list_price ~ hard_paper + num_pages.
When there are multiple explanatory variables a few more “shapes” are available. To illustrate, suppose g
is a categorical variable. Then additional available shapes are:
y ~ x + g
is a set of parallel sloping-line functions, one for each level ofg
.y ~ x * g
is a set of not-necessarily-parallel sloping-line functions
There are many other possibilities if the shape we have in mind is curved or discontinuous rather than a sloping line. For the sake of simplicity, we will not deal with the curved shapes in these Lessons. Besides, the sloping-line shapes are by far the most widely used in practice.
Confidence intervals
Statistical thinkers know that any estimate they make, including estimates of effect sizes, involves sampling variation. Consequently, in reporting an effect size, always give an interval estimate: the confidence interval.
These Lessons usually involve model specifications that are linear, for example y ~ x + a
. For such models, the effect size with respect to each variable is identical to the regression coefficient for that variable. Consequently, the confidence interval on the coefficient is also the confidence interval on the effect size.
The confidence interval communicates to the decision-maker the uncertainty in the estimated quantity. Sophisticated decision-makers keep this uncertainty in mind, considering the range of outcomes likely from whatever use they make of effect size. For example suppose you have an effect size such as \(32 \pm 15\). In considering possible decisions, keep in mind the entire interval, not just the point estimate 32. Suppose the effect size 32 would lead you to decision \(\mathbb{A}\). If you would make a different decision \(\mathbb{B}\) for an effect size of, say, 20, then the precision of the effect size doesn’t enable you to distinguish meaningfully between decisions \(\mathbb{A}\) and \(\mathbb{B}\).
A particularly common and important situation involves deciding whether there is evidence to support any relationship at all between an explanatory variable and the response. You can think of this as deciding whether you have detected from your data whether a relationship exists. Whenever the confidence interval on the effect size with respect to that variable includes zero, a plausible conclusion is that there is no relationship between that variable and the response variable.
Statistically naive decision makers—even highly educated decision-makers can be statistically naive—look at the interval and sometimes ask the modeler, “Just give me a number. I don’t know what to do with two numbers.” Such a request might elicit a frank response: “If you don’t know what to do with two numbers, you also won’t know what to do with one number.” Unfortunately, that kind of frankness is not often well received; a reasonable alternative is: “The interval indicates the amount of uncertainty in the result. We’ll need to collect more data if you want to reduce the uncertainty.”
You can even estimate how much more data would be needed. Suppose the confidence interval were \(12 \pm 20\) estimated from a sample of size \(n=25\). Since this interval includes zero, it does not point definitively to the existence of a relationship. But the margin of error, 20 in this example, scales as \(1/\sqrt{n}\). If you make \(n\) bigger, you can expect the margin of error to become smaller: more data means better precision! How much better? If you quadruple \(n\), the margin of error will be about half as big. So \(n=100\) will give a margin of error about half the size of \(n=25\). In other words, with \(n=100\) the margin of error would be about 10.
Keep in mind, however, this paradox. Although we know that a sample size of \(n=100\) will produce a margin of error half the size as that from a sample size of \(n=25\), we cannot expect the point estimate (12, in this example) to remain at that value in the larger sample. All we know is that in the larger sample the point estimate will plausibly be somewhere in the interval \(12 \pm 20\). So plausible results from the larger sample might be \(-8 \pm 10\) or \(32 \pm 10\) or anywhere in between. Even though we can estimate the size of the margin of error for the larger sample, to know the overall result from the larger sample, you have to collect that sample!