3 Variation and density
Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions. —– Stephen Jay Gould (1941- 2002), paleontologist and historian of science.
The point plots introduced in Lesson sec-pointplots are designed to show relationships between variables. Much of statistical thinking has to do with discovering, quantifying, and verifying such relationships. We have already introduced a concept structure for talking about relationships between variables: one variable is identified as the response variable and others as the explanatory variables. In point plots, we use the vertical axis for the response variable and the horizontal axis, color, and faceting for the explanatory variables.
Our focus in this Lesson is on describing the response variable. Remember that the origin of the word “variable” is in the specimen-to-specimen variation in values. We will look at variation in two distinct ways: 1) the shape of the variation, and 2) the amount of the variation.
The “shape” of variation
We turn to a familiar situation to illustrate variation: pregnancy and the duration of gestation—the time from conception to birth. It’s well known that typical human gestation is about 9 months. But it varies from one birth to another. We can describe this variation using the Births2022
data frame which is a random sample of 20,000 births from the Centers for Disease Control’s census of 3,699,040 US births in 2022. The duration
variable records the (estimated) period of gestation in weeks.
Figure fig-gestation-duration1(a) shows just the duration
variable. It’s easy to see that durations longer than 45 weeks are pretty rare. “Extremely preterm” births—defined as birth before the 28th week of gestation, are also uncommon. Most common are births at about 39 weeks, that is, about 9 months. The (vertical) spread of the dots shows the extent of variation in duration
. The most common outcomes are at the value of duration
where the dots have the most “density.”
duration ~ 1
, where the 1
is a placeholder for the explanatory variables.Code
|> pointplot(duration ~ 1,
Births2022 point_ink = 0.1, size = 0.2, jitter="y") # details
|> pointplot(duration ~ 1, annot="violin",
Births2022 point_ink = 0.1, size = 0.2, bw=0.5, jitter="y") # details
duration
(in weeks) of gestation for each of 20,000 randomly selected 2022 births in the USFor many people, the dots drawn in a point plot are reminiscent of seeds or pebbles scattered across an area. Density can be high in some areas, lower in other, negligible or nil in still others. The spatial pattern of density is called the “distribution” of the variable.
Evidently, many people can perceive density in a point plot without any need to count or calculate; it is an intuitive mode of perception. To illustrate, Figure fig-density-explain is a made-up point plot with five patches of different densities. The densities are 25, 50, 100, 200, and 400 points per unit area. Many people would find it easy and immediate to point out the least and most dense patches and even to put the patches in order by density. However, people are hard put to qualify even the relative densities. For instance, the largest patch has a smaller density than the next largest patch, but quantifying this by eye (without being told the densities) is not really possible.
Warning in geom_point(size = 0.1, point_ink = 0.5): Ignoring unknown
parameters: `point_ink`
Our eye gives a qualitative estimate of relative density, not a precise quantitative one. Our graphical perception is much more precise when it comes to length or width. Ingeniously, designers of statistical graphics have created an annotation—called a “violin”—that shows the density in terms of width. Figure fig-gestation-duration1(b) adds a violin annotation to the point plot.
Violins can be informative when comparing two or more levels of an explanatory variable. To illustrate, consider the duration of gestation for twins versus singletons. Let’s see if the distribution of durations is different for the different kinds of birth.
Code
|>
Births2022 filter(plurality < 3) |>
mutate(plurality = factor(plurality, labels = c("singleton", "twin"))) |>
pointplot(duration ~ plurality, annot="violin",
point_ink = 0.1, size = 0.2, bw=0.5, jitter="y", model_ink=0.5) # details
duration
shown separately for singletons, twins, and (a handful of) triplets.Evidently, the density of points is vastly different for different levels of plurality. The jitter-column of singletons is much denser than for twins. Singletons are much more common than twins.
Even though there are many more singletons than twins, the violins are roughly the same width. This is by design. The violins in Figure fig-duration-plurality tell the story of birth-to-birth variation of duration within each group. For twins, durations near 36 weeks are much more common than durations near 39 weeks. Similarly, comparing the two violins shows that premature births are much more likely for twins than for singletons. We can see this from the violins despite the fact that the large majority of premature births are of singletons.
It’s worth emphasizing the previous point, since it will be fundamental to many aspects of statistical thinking.
Some simple shapes
There are infinitely many different shapes of distributions. Even so, a few simple shapes are common. These are shown in panels (a)-(d) of Figure fig-violin-shapes. Panel (e), of course, is a more complicated shape, one not so often seen in practice. (Unless you are practicing music rather than statistics!)
Figure fig-violin-shapes(a) is a “uniform” distribution, where each of the possible values are more or less equally likely. It’s not so common to see this in real-world data. When you do, it’s a good sign that there is something artificial or mathematical behind the data-generating process.
Much more common is the so-called “normal” distribution of Figure fig-violin-shapes(b). The name given to it, “normal,” is an indication of how commonly it is seen. There is a region of highest density at middle values, with the density falling off in a “bell-shaped” fashion symmetrically toward higher and lower values.
Other common patterns in distribution have a peak (like the normal distribution), but have “tails” that extend much further than for the normal. These are sometimes called “long tailed distributions. In Figure fig-violin-shapes(c) the long tails are symmetrical around the peak, while in Figure fig-violin-shapes(d) there is only one long tail. Such one-sided, long-tailed distributions are called”skew” distributions. Skew distributions are particularly common in economic data such as personal or national income.
There have been important consequences to ignoring skewness in favor of “well behaved,” short-tailed distributions such as the so-called normal distribution. For instance, the 2008 “Great Recession” was due in part to mistakenly high values put on mortgage-backed and other financial securities. Financial analysts used valuation techniques that would be appropriate for normal distributions of risky events, but were utterly inadequate in the face of skew distributions.
An important setting for skew distributions concerns extreme events, such as large storms and fires.
|> pointplot(precip ~ 1, annot="violin")
Monocacy_river |> pointplot(area ~ 1, annot="violin") US_wildfires
Exercises
Students shopping for textbooks are often surprised by extremely high prices for some books, while prices for others are moderate. In this exercise, we’ll look at one possible factor influencing book price: whether the book is hardcover or paperback.
The graph shows the list price of books (according to the moderndive::amazon_books
data frame) broken down by the cover format.
A. Does the observed distribution of prices support a claim that paperbacks tend to be less expensive than hardcovers? Answer: There are both very expensive and very cheap books in each of the two cover formats. For paperbacks, however, a very large fraction are priced close to $20, while hardcovers are predominantly in the $20-30 range.
Perhaps the expensive paperback books are that way because have a lot of pages. To investigate this possibility, we can look at the number of pages in the two cover formats, as in the following graph:
B. Briefly summarize what the graph shows about the relationship between cover format and page count. Answer: The distributions are very similar.
C. Are there more paperbacks represented in the moderndive::amazon_books
data frame or more hardcovers? Answer: There are many more dots in the paperback column. Since there is one dot for each row of the data frame, there are more paperbacks than hardcovers.
DRAFT: Do we want to make an exercise around this? What would be the questions asked?
Warning in geom_point(point_ink = 0.03, size = 0.2, data = sample(Pts, 10000)):
Ignoring unknown parameters: `point_ink`
Warning in geom_violin(aes(x = 1), color = NA, fill = "blue", point_ink = 0.3,
: Ignoring unknown parameters: `point_ink`
Warning in geom_point(aes(x = spread2, y = y), size = 0.2, data = sample(Pts, :
Ignoring unknown parameters: `point_ink`
Consider this violin plot
Warning in geom_jitter(point_ink = 0.3, height = 0, width = 0.2): Ignoring
unknown parameters: `point_ink`
Warning in geom_violin(fill = "blue", point_ink = 0.4, color = NA): Ignoring
unknown parameters: `point_ink`
For each group, judge by eye what fraction of the data points have a value of 2 or below.
Which of the three groups has multiple peaks in its density?
Which group has the lowest median?