 1

summarize()
reduces many rows to a single row.  2

Calculate the mean of
height
and name the corresponding variable in the outputmodval
.  3

Break down the calculation in (2) group by group.
sex  modval 

M  69.23 
F  64.11 
Lesson secvariation listed three foundations for statistical thinking and provided instruction on the first one:
In this Lesson, we will take on two and three:
In business, accounting keeps track of money and value: assets, receipts, expenditures, etc. Likewise, in statistics, accounting for variation is the process of keeping track of variation. At its most basic, accounting for variation splits a variable into components: those parts which can be attributed to one or more other variables and the remaining (or residual^{} ) which cannot be so attributed.
The variable being split up is called the â€śresponse variable.â€ť The human modeler chooses a response variable, depending on her purpose for undertaking the work. The modeler also chooses explanatory variables that may account for some of the variability in the response variable. There is almost always something left over, variation that cannot be attributed to the explanatory variables. This leftover part is called the â€śresidual.â€ť
To illustrate, we return to a historical statistical landmark, the data collected in the 1880s by Francis Galton on the heights of parents and their fullgrown children. In that era, there was much excitement, uncertainty, and controversy about Charles Darwinâ€™s theory of evolution.
Everyone knows that height varies from person to person. How much variation in height is there? We quantify variation using the variance. var()
will do the calculation. The units of the output will be in squareinches, since height
itself is measured in inches.
var(height)
has units of â€śsquareinchesâ€ť when the variable height
is measured in inches.price
is 144 square inches. What is the standard deviation of price
? (Make sure to include the units!)height
includes the two values 57 inches and 54 inches. The difference between them is 3 inches. The square of this difference is 9 squareinches. The variance is the average of all the different pairs of values.Galtonâ€™s overall project was to quantify the heritability of traits (such as height) from parents to offspring. Darwin had famously measured the beaks of finches on the Galapagos Islands. Galton, in London, worked with the most readily available local species: humans. Height is easy and socially acceptable to measure, a good candidate for data collection.
Galton sought to divide the height variation in the children into two parts: that which could be attributed to the parents and that which remained unaccounted for, which we call the â€śresidual.â€ť We will start with a mathematically more straightforward project: dividing the variation in the childrenâ€™s heights into that part associated with the childâ€™s sex and the residual.
Listing lstgaltonmean shows how to calculate the average height
for each sex
using the techniques from Lesson secwrangling. The line numbers in the code listing explain each of the components of the command.
summarize()
reduces many rows to a single row.
height
and name the corresponding variable in the output modval
.
sex  modval 

M  69.23 
F  64.11 
To see the results of the command, run it!
Note that the output has one row for each sex
. As always, the output from summarize is set by the arguments to the function. Here, there will be two variables in the output: sex
and the mean height for each sex, which weâ€™re calling modval
, short for model value.
This simple report indicates that the sexes differ, on average, by about 5 inches in height. But this report offers no insight into how much of the persontoperson variation in height is attributable to sex
. For that, we need to look at the group averages in the context of the variation of individuals. That is, instead of computing the model value for each sex
, we will compute it for each specimen using mutate()
. This is simple, since all F
specimens will have a model value of 64.1 inches, while the M
specimens all have a model value of 69.2 inches. However, anticipating what will come later in this Lesson, we will do the calculation using mutate()
.
Note: To save space on the screen, Listing lstgaltonmutate is written to show only a handful of the 898 rows that would otherwise be in the output.
Listing lstgaltonmutate2 repeats the above calculation, but summarizes to show the variance of height
and the variance of the model values.
The output shows that the variability in the model values of height, is about half the variability in the height itself. This is the the answer to (2) from the introduction to this Lesson; sex
accounts for about half of the variation in height.
Item (3) in the introduction was about measuring what remains unexplained. This â€śunexplainedâ€ť part of height is called the residual from the model values. The residual is easy to calculate. Listing lstgaltonresidvar calculates the residuals from the simple model that accounts for height
by sex
.
Itâ€™s essential to note from the output of Listing lstgaltonresidvar that the variance in the response variable (height
) is exactly the sum of the the variance of the model values and the variance of the residuals. In other words, the variance in the response variable is perfectly split into two parts:
Listing lstgaltonresidvar shows how to calculate the residual for each specimen: subtract the model value from the value of the response variable. Letâ€™s look at the values of the residuals from the model of height versus sex.
Which line of Listing lstresidpointplot finds the residual for each of the specimens?
Some residuals are positive and some negative. Explain how a residual can be negative.
If the model value is larger than the response variable, the residual will be negative. Similarly, if the model value is less than the response variable, the residual is positive.
Since Lesson [@secvariationanddistribution] we have used the variance as the prefered way to measure the amount of variation. The advantage of the variance is that it perfectly partitions variation in the response variable into two parts: (i) variation in the model values and (ii) variation in the residuals.
The following chunk demonstrates the perfect partitioning: Run it!
mage
) from the Births2022
data frame by the motherâ€™s education level (meduc
). Is the variation in mage
perfectly partitioned between the model values and the residuals?In addition to replacing Galton
with Births2022
, you will need to replace the explanatory variable sex
with meduc
. Similarly, replace height
with mage
.
sd()
) to measure variation. Modify the above chunk, replacing var()
with sd()
. Looking at the results, does the standard deviation perfectly partition the response variable into model values and residuals?The previous section used the categorical variable sex
as the explanatory variable. Galtonâ€™s interest, however, was in the relationship between childrenâ€™s and parentsâ€™ height.
sex
is a categorical explanatory variable. Using mutate(..., .by = sex)
is a good method of handling categorical explanatory variables. However, this does not work for quantitative explanatory variables. The reason is that .by
translates a quantitative variable into a categorical one, with a result for each unique quantitative value. Itâ€™s trivial to substitute the height of the mother
or father
in place of sex
in the method introduced in the previous section. However, as we shall see, the results are not satisfactory. Galtonâ€™s key discovery was the proper method for relating two quantitative variables such as height
and mother
.
First, letâ€™s try simply substituting in mother
as the explanatory variable and using mean()
to create the model values. Then, use point_plot()
to show the model values as a function of mother
.
The last line of Listing lstgaltonbymother connects adjacent points with lines. Itâ€™s hard to see any pattern in the model values.
It is common sense that the model linking mothersâ€™ heights to childrenâ€™s height should be smooth, not the jagged pattern seen in Listing lstgaltonbymother. The source of the jaggedness is the use of .by = mother
in mutate(modval = mean(height), .by = mother)
. There is nothing in .by = mother
to enforce the idea that the model value for midheight mothers should be inbetween the model values for short and for tall mothers.
The solution to the problem of jagged model values is to avoid the absolute splitting into nonoverlapping groups by motherâ€™s height. Instead, we want to find a smooth relationship. Galton invented the method for accomplishing this. A modern form of his method is provided by the model_values()
function., which we shall use to construct Model3
.
Notice that the .by = mother
step has been entirely removed. Notice also that model_values()
uses the same kind of tilde expression as we have employed when plotting. The response variable is listed on the left of the , the explanatory variable on the right side. In other words, we are modeling the childâ€™s height
as a function of mother
â€™s height.
As always, modeling splits the variance of the response variable into two parts, one associated with the explanatory variable and the other holding whatâ€™s left over: the residual. Hereâ€™s the split for Model3
which uses mother
as an explanatory variable:
The motherâ€™s height doesnâ€™t account for much of the variation in the childrenâ€™s heights.
The model_values()
approach to modeling works for both quantitative and categorical explanatory variables. For convenience, you can annotate point_plot()
to show the model along with the raw data. Try it:
The model value for females is about 64 inches. The tallest female is 71 inches. This makes the largest residual 71  64 = 5 inches.
height ~ mother
. Find a point that has a large negative residual from this model.The models we work with in these Lessons always have exactly one response variable. But models can have any number of explanatory variables.
Whatever the number of explanatory variables and however many levels a categorical explanatory variable has the model splits the variance of the response into two complementary pieces: the variance accounted for by the explanatory variables and the part not accounted for, that is, the residual variance.
To illustrate, here is a sequence of models of height
with different numbers of explanatory variables.
Number of explanatory variables  tilde expression  var(resid) 

0  height ~ 1 

1  height ~ sex 

2  height ~ sex + mother 

3  height ~ sex + mother + father 
Plugging each of these number
>
Galton mutate(modval =
model_values(
..tilde.expression.here.
),resid = height  modval) >
summarize(var(height), var(modval), var(resid))
Three explanatory variables
YOU WERE HERE: Perhaps convert this to an interactive
>
Galton mutate(modval =
model_values(height ~ sex + mother + father),
resid = height  modval) >
summarize(var(height), var(modval), var(resid))
var(height)  var(modval)  var(resid) 

12.8373  8.211706  4.625599 
Two explanatory variables
>
Galton mutate(modval =
model_values(height ~ sex + mother),
resid = height  modval) >
summarize(var(height), var(modval), var(resid))
var(height)  var(modval)  var(resid) 

12.8373  7.212022  5.625283 
One explanatory variable
>
Galton mutate(modval = model_values(height ~ sex),
resid = height  modval) >
summarize(var(height), var(modval), var(resid))
var(height)  var(modval)  var(resid) 

12.8373  6.549134  6.288171 
Zero explanatory variables
>
Galton mutate(modval = model_values(height ~ 1),
resid = height  modval) >
summarize(var(height), var(modval), var(resid))
var(height)  var(modval)  var(resid) 

12.8373  0  12.8373 
When selecting explanatory variables, comparing two or more different models sharing the same response variable is often helpful: a simple model and a model that adds one or more explanatory variables to the simple model. The model with no explanatory variables, is always the simplest possible model. For example, in secmultiplevars1, the model height ~ 1
is the simplest. Compared to the simplest model, the model height ~ sex
has one additional explanatory variable, sex
. Similarly, height ~ sex + mother
has one additional explanatory variable compared to height ~ sex
, and height ~ sex + mother + father
adds in still another explanatory variable.
The simpler model is said to be â€śnested inâ€ť the more extensive model, analogous to a series of Matroshka dolls. A simple measure of how much of the response variance is accounted for by the explanatory variables is the ratio of the variance of the model values divided by the variance of the response variable itself. This ratio is called â€śR^{2}â€ť, pronounced â€śRsquared.â€ť
For instance, R^2 for the model height ~ sex + mother
is \[\text{R}^2 = \frac{7.21}{12.84} = 0.56\] By comparison, R^2 for the simpler model, height ~ sex,
is slightly smaller:
\[\text{R}^2 = \frac{6.55}{12.84} = 0.51\] For all models, \(0 \leq\) R^{2} \(\leq 1\). It is tempting to believe that the â€śbestâ€ť model in a set of nested models is the one with the highest R^{2}, but statistical thinkers understand that â€śbestâ€ť ought to depend on the purpose for which the model is being built. This matter will be a major theme in the remaining Lessons.
Exercise 9.1 The data frame LSTbook::Gilbert
records 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert
.) Whether or not a patient died during a shift is indicated by the variable death
. The variable gilbert
marks whether or not nurse Kristen Gilbert was on duty during the shift.
A. Are the variables numerical or categorical?
B. The data were used as evidence during Gilbertâ€™s trial for murder. The data can be reduced to a compact form by data wrangling:
::Gilbert >
LSTbooksummarize(count = n(), .by = c(death, gilbert))
death  gilbert  count 

no  not  1027 
yes  not  357 
no  on_duty  157 
yes  on_duty  100 
Calculate the proportion of times a death occurred when Gilbert was on duty, and the proportion of time when Gilbert was not on duty. Does the shift evidence support the claim that Gilbert was a contributing factor to the deaths.
C. In the Lessons, we will mainly use regression modeling rather than wrangling as a datareduction technique. In the trial, the relevant question was whether Gilbert was a factor in the deaths. Express this question as a tilde expression involving the variables death
and gilbert
.
D. The response variable in your tilde expression from (C) is categorical. Regression modeling requires a numerical response variable. Use mutate()
and zero_one()
to transform the variable death
from categorical form to a numerical form that has a value 1 when Gilbert was on duty and 0 otherwise. This will become the response variable for the regression: death ~ gilbert
.
E. Carry out the regression using model_train()
. Look at the coefficients from the fitted model with conf_interval()
; these are in .coef
column. Explain how the coefficients correspond to the proportions you calculated in (B).
Exercise 9.4
DRAFT: Show that the sum of squares and mean square for modeling using the median is larger than for the mean.
Exercise 9.6