|> pointplot(BFat ~ ntiles(Height, 5, format="interval"), annot="model") Anthro_F
4 Annotating point plots with a model
Lesson sec-variation-and-distribution introduced the violin-plot annotation to display graphically the “shape” of variation: which values are more common, which values rare, and which values never seen at all. In this Lesson, we turn to a completely different sort of annotation showing a “statistical model.” Models provide a way to summarize quantitatively the relationships among variables.
Simple models
We define a “simple model” as a model with a single explanatory variable, as opposed to multiple explanatory variables. (All the models we consider in these Lessons have a single response variable.) To illustrate, let’s return to the anthropometric measurements displayed in Figure fig-wrist-ankle where the explanatory variable is ankle circumference. Adding a statistical model annotation is accomplished by using the argument `annot = “model”:
|> pointplot(Wrist ~ Ankle, annot = "model") Anthro_F
Wrist ~ Ankle
point plot with a statistical model.The model annotation is drawn as a more-or-less straight band, shaded blue to help distinguish it from individual data points. By default, pointplot()
looks for a “linear” pattern in the data; the band is straight because we asked for it to be straight. The particular band presented by pointplot()
is the one that comes as close as possible to the data points. “As close as possible” is defined in a very specific way, which we’ll investigate later; for now it suffices to notice that the band goes nicely through the cloud of data points.
The explanatory variable in Figure fig-wrist-ankle-annot is quantitative. Model annotations can also be drawn for categorical explanatory variables. To illustrate, consider the data in Birdkeepers
, which was used in a study of smoking, bird-keeping, and lung cancer. The unit of observation is an individual person. The variable YR
records the number of years that person smoked, while LC
is a categorical variable indicating whether the person had been diagnosed with lung cancer. The data and a model annotation are shown in Figure fig-birdkeepers-A
|> pointplot(YR ~ LC, annot="model") Birdkeepers
For a categorical explanatory variable, the model annotation is a vertical band (or “interval”) for each of the categorical levels. As with Figure fig-wrist-ankle-annot, the model annotation in Figure fig-birdkeepers-A is vertically centered among the data points.
In later Lessons, we will discuss how pointplot()
chooses the specific model annotation shown in any given case. But consider these closely related questions:
- Why are the model annotations shown as a band or interval, rather than as a single, simple line or single numerical value for each category?
- What do the model annotations tell us?
You might have encountered statistical graphics such as those shown in Figure fig-point-estimate. With a numerical explanatory variable, the model annotation is a simple, straight line. For a categorical explanatory variable, there is a single vertical value for each level of the explanatory variable. In reality, there is a range of different lines that are plausible models, not just a single one. Statistical thinking makes extensive use of this idea of a range of plausible models, rather than the single models depicted in Figure fig-point-estimate. Any straight line that falls into the band in Figure fig-wrist-ankle is a plausible model of the data. In Figure fig-birdkeepers-A, any pair of values that fall into vertical intervals are a plausible model of the data.
Each of the plausible models—for instance those in Figure fig-point-estimate—describes a specific relationship between the response and explanatory models. For the wrist/ankle relationship, the models all show a “trend” between ankle size and wrist size. For the smoking-years/lung-cancer relationship, the people with lung cancers “tend” to have smoken for more years than the no-cancer people.
The words “trend” or “tend” are very weak. Often, statistical thinkers are interested in stronger statements, like these:
- Larger ankles cause larger wrists.
- Smoking for more years increases the chances of lung cancer.
We can call these opinionated statements because they make use of some hypothesis about how the world works held by the modeler rather than being forced solely by the data. Many people would have the sensible opinion that “larger ankles cause larger wrists” is silly. It seems much more likely that “larger people have larger wrists and also larger ankles.” On the other hand, many people will be sympathetic to the statement “increases the chances of lung cancer.” They have heard such things from other respected sources.
Some of the techniques covered in these Lessons are designed to substantiate or undermine opinionated statements like these. Until we understand and use these techniques, it is dicey to quantitatively support an opinionated statement from data.
Many statisticians prefer to avoid the whole matter of opinionated statements. Weak, unopinionated language like “trend” or “tend” are used instead. Those preferring more technical-sounding language might use “associated with” or “correlated with.”
Independence
We use model annotations to display whether variables are related. It’s good to consider as well a particular type of relationship: independence. When the explanatory variable is categorical, the model annotations will be a vertical interval for each level. When the response is independent of the explanatory variable, those intervals will overlap. For instance, in Figure fig-independence(a) values of YR
near 30 are in both vertical intervals.
For a quantitative explanatory variable, as in Figure fig-independence(b), independent variables will have a model band that is more-or-less horizontal. More precisely, at least one horizontal line will fall within the band. A point plot of two independent variables will have a model annotation that is more-or-less horizontal. More precisely, there is some horizontal line that falls within the annotation band.
|> pointplot(YR ~ BK, annot="model", model_ink=0.7) +
Birdkeepers geom_hline(yintercept=30, color="green") + ylab("Age (years)") + xlab("Is a birdkeeper?")
|> pointplot(BFat ~ Height, annot="model") + geom_abline(intercept=22, slope=0, color="green") +
Anthro_F ylab("Body fat (%)") + xlab("Height (meters)")
Birdkeepers
is independent of whether the person keeps a bird. Panel (b), based on Anthro_F
is about the possible relationship between a person’s height and body fat as a percent of overall mass.Multiple explanatory variables
In Lesson sec-pointplots we used color and faceting to look at the response variable in terms of up to three explanatory variables. Statistical models can also handle multiple explanatory variables.
We’ll illustrate with a commentary from a political pundit about education spending in US schools:
[T]he 10 states with the lowest per pupil spending included four — North Dakota, South Dakota, Tennessee, Utah — among the 10 states with the top SAT scores. Only one of the 10 states with the highest per pupil expenditures — Wisconsin — was among the 10 states with the highest SAT scores. New Jersey has the highest per pupil expenditures, an astonishing $10,561, which teachers’ unions elsewhere try to use as a negotiating benchmark. New Jersey’s rank regarding SAT scores? Thirty-ninth… The fact that the quality of schools… [fails to correlate] with education appropriations will have no effect on the teacher unions’ insistence that money is the crucial variable.—–George F. Will, (September 12, 1993), “Meaningless Money Factor,” The Washington Post, C7.
The opinionated claim here is that “money is the crucial variable” in educational outcomes. George Will seeks to rebut this claim with data. Fortunately for us, actual data on SAT scores and per pupil expenditures in the mid-1990s is available in the mosaicData::SAT
data frame. The unit of observation in SAT
is a US state. Figure fig-SAT-one(a) shows an annotated point plot of state-by-state expenditures and test scores. The trend signaled by the model annotation is that SAT scores are slightly lower in high-expenditure states, consistent will Will’s observations.
Education is a complicated matter and there are factors other than expenditures that may be playing a role. One of these, shown in Figure fig-SAT-one(b), is that participation in the SAT varies very substantially from state to state. In some states, almost all students take the test. In others, fewer than 10% of students take the test. The data show a relationship between participation and scores: scores are consistently higher in low-participation states.
|> pointplot(expend ~ frac, annot="model") +
SAT xlab("Participation (%)") +
ylab("Per pupil expenditures ($1000s)")
Statistical modeling techniques enable us to use both expenditures and participation as explanatory variables. Figure fig-SAT-one does this with one variable at a time. Often more informative is to use both explanatory variables simultaneously, especially when there is a relationship between the explanatory variables, as seen in the graph of expenditures versus participation (Figure fig-expend-partic).
|> filter(expend < 8) |> pointplot(sat ~ expend + frac + frac, annot="model") + xlab("Expenditures ($1000)")
SAT |> pointplot(sat ~ expend + frac + frac, annot="model") + xlab("Expenditures ($1000)")
SAT |> pointplot(sat ~ frac + expend + expend, annot="model", palette = "D") + xlab("Participation (%)") SAT
For the person starting out in statistical thinking, it is potentially confusing that the same model has different “shapes” depending on which explanatory variable is placed on the horizontal axis. In later Lessons, we will turn to tools for displaying models that don’t introduce such confusion.
Exercises
DRAFT DESCRIPTION VERSUS UNDERSTANDING/PREDICTING COMPREHENSIVE VIEW
Ask whether the model annotation suggests that the response variable is independent of the variable plotted on the horizontal axis.
|> pointplot(Weight ~ Wrist, annot = "model")
Anthro_F |> pointplot(BFat ~ Wrist, annot = "model")
Anthro_F |> pointplot(BFat ~ Height, annot = "model")
Anthro_F |> pointplot(Weight ~ Height, annot = "model")
Anthro_F |> pointplot(Weight ~ MThigh, annot = "model")
Anthro_F |> pointplot(BFat ~ MThigh, annot = "model")
Anthro_F |> pointplot(BFat ~ Weight, annot = "model")
Anthro_F |> pointplot(BFat ~ Height + Weight + Weight, annot = "model")
Anthro_F |> pointplot(BFat ~ Weight + Height + Height, annot = "model") Anthro_F
Maybe make an exercise of this one.
The annotated point plot shows the heights of fully-grown children as a function of their mother’s and father’s heights and of the child’s sex. That is, there are three explanatory variables: mother
, sex
, and father
.
|> pointplot(height ~ mother * sex * father, annot="model",
Galton point_ink = 0.5, size=0.5, model_ink=0.5)
According to the model, the child’s height increases with both mother’s height and father’s height, and is different between the sexes.
A. What aspect of the model annotation indicates that child’s height is not independent of mother’s height? Answer: The horizontal axis is mother’s height. The annotations slope upward, indicating that child’s height increases with mother’s height. If child’s and mother’s heights were independent, the annotations would be more-or-less horizontal.
B. What aspect of the model annotation indicates that child’s height is not independent of child’s sex? Answer: The annotation bands for the different sexes do not overlap, the M annotation is higher than the F annotation. If child’s height and sex were independent, the annotation bands would overlap.