2  Data graphics

Table 2.1: Annual exports and imports in the trade between England and the East Indies
Year Exports Imports
1700 180 460
1701 170 480
1702 160 490
1703 150 500
1704 145 510
1705 140 525
1706 135 550
1707 125 565
1708 120 580
1709 110 590
1710 105 625
1711 105 650
1712 100 680
1713 100 710
1714 100 725
1715 100 755

A data frame is a specific way of organizing and storing data. To see the “big picture,” however, it can help to organize the data in other ways: drawing a literal picture of the data. We call such pictorial presentations data graphics.

Making pictures of data is a relatively modern idea. William Playfair (1759-1823) is credited as the inventor of novel graphical forms in which data values are presented graphically, rather than as numbers or text. To illustrate, consider the data from the 1700s (Table tbl-playfair-trade) that Playfair turned into a picture.

Playfair’s innovation, as in Figure fig-playfair, was successful because it was powerful. The pattern that is latent in the data frame becomes visually obvious to the human viewer. The picture shows not only the trade values each year but also the trends across the decades.

Figure 2.1: William Playfair’s 1801 presentation of year-by-year data on trade between England and the East Indies. Source: University of Pennsylvania Libraries

The American revolution is marked out by the graph; you can see the steady fall in English exports from 1775-1780, corresponding to the American boycott during the revolution. Exports pick up again after the revolution, but English imports increase even more rapidly, leading to a steady expanding trade deficit by 1800. The historical consequences of this deficit are profound with continuing implications. (See blog post.)

Figure 2.2: A data graphic showing the change in the age distribution of the population from 1990 to 2050 (projected). For the full graphic, follow this link

Data graphics are becoming an important way for ordinary citizens to find out what’s happening in the world. It’s worthwhile to study collections of data graphics to see the creativity and range of approaches of graph-makers. Some examples: how people spend their day, life expectancy, wind patterns (right now!), historical sources of death.

The graphics found in statistics textbooks (Figure fig-textbook-graphs) are often highly stylized and don’t show data directly. Some forms—pie charts and bar charts—were introduced by Playfair more than 200 years ago. One of the primary motivations for the graphical forms in Figure fig-textbook-graphs is that they can be drawn easily by hand or typewritten.


Dot plot

Bar chart

Pie chart


Playfair’s pie chart

Playfairs bar chart

Stem-and-leaf plot

Figure 2.3: Some graphics styles often featured in statistics textbooks

The types of graphics in Figure fig-textbook-graphs can be effective pedagogical tools for teaching pupils about numbers their representation. But our purpose in these Lessons is different: to display data directly along with guides to interpreting the possible patterns in the data.

Annotated point plots

This Lesson introduces a powerful form of data graphics that is particularly well suited to support statistical thinking: the annotated point plot. A point plot provides a visual display of a data frame. The annotations summarize specific patterns in the data. We will start with point plots without annotations.

A point plot contains a simple mark for each row of a data frame. Two selected columns of the data frame are depicted as the vertical and horizontal axes of the graphics frame. To illustrate, we construct point plots of latitude vs longitude for a random sample of the specimens in the maps::world.cities data frame,

A “point plot” is also known as a “scatter plot.”
head(maps::world.cities |> arrange(desc(pop)))


name           country.etc         pop      lat     long   capital
-------------  ------------  ---------  -------  -------  --------
Shanghai       China          15017783    31.23   121.47         2
Bombay         India          12883645    18.96    72.82         0
Karachi        Pakistan       11969284    24.86    67.01         0
Buenos Aires   Argentina      11595183   -34.61   -58.37         1
Delhi          India          11215130    28.67    77.21         0
Manila         Philippines    10546511    14.62   120.97         1
... for the 10,000 biggest cities

The location of each city, in terms of latitude and longitude (variable names: lat and long) are plotted, each with a simple dot. In the panels below, we select a random samples of the 10,000 biggest cities. The panel labelled n=100 has just one-hundred cities, while n=500 has five-hundred, and so on. IMPORTANT Take your time, starting with the n=100 panel. See how much detail you can make out, then switch to the next panel and see if you can discern additional detail.

One principle of statistics is that when displaying a pattern in data, a larger sample size lets you see more detail. Here, the pattern is the one you learned in elementary school. As you’ll see, the patterns we look at in these Lessons are much simpler. But we want to illustrate the point with a very familiar, non-mathematical pattern.

For the sake of simplicity, we will use the pointplot() function (from the {LST} package) to start making data graphics. pointplot() requires two inputs:

  1. a data frame that is piped into pointplot()..
  2. a “tilde expression” specifying which variables from the data frame are to be rendered in graphical form. For the world-cities graphs above, the tilde expression is lat ~ long.

As an example you can follow along, we use a short data frame, Anthro_F, which records, among other variables, the wrist, ankle, and knee circumference of 184 college-aged women. Figure fig-wrist-ankle shows a point plot of wrist versus ankle circumference.

Table 2.2: Some selected rows from the Anthro_F data frame.
Wrist Ankle Knee
18.4 23.5 37.5
13.5 18.0 32.3
18.0 22.5 38.5
19.0 24.5 41.5
Anthro_F |> pointplot(Wrist ~ Ankle)

Figure 2.4: A point plot of wrist versus ankle circumference.

Each dot in Figure fig-wrist-ankle reflects one row of the Anthro_F data frame and is placed at coordinate(Ankle, Wrist). Table Table tbl-wrist-ankle2 shows four rows from Anthro_F; you should be able to locate the four corresponding dots in the Figure fig-wrist-ankle point plot.

The computer command used to create Figure fig-wrist-ankle is typical of the commands you will use throughout these lessons. Let’s highlight the components of the command.

\(\underbrace{\texttt{Anthro_F}}_\text{data frame}\ \ \color{orange}{\underbrace{\Large\texttt{|}\!\texttt{>}}_{\text{pipe}}} \ \ \ \color{green}{\underbrace{\texttt{app(}}_\texttt{function}}\ \color{blue}{\underbrace{\texttt{Wrist}\ _{\LARGE{\texttt{~}}}\ \texttt{Ankle}}_\text{argument}}\ \color{green}{\texttt{)}}\)

This is a general pattern for computing on data frames, that is, providing a data frame as on input to a function. Every function has an identifying name, here pointplot(). The data frame Anthro_F is being piped into the function. The purpose of the function named pointplot() is to generate a data graphic.

It’s often the case that additional instructions are needed to describe exactly what the function is to do. You place such instructions—called “arguments”—inside the pair of parentheses that follow the function name.

The tilde expression specifies which variable goes on the vertical axis and which one on the horizontal. The word “tilde” is the name of the wavy character . The variable name to the left of the tilde goes on the vertical axis, the variable name to the right of the tilde goes on the horizontal axis. The variable names used must correspond to the names in the data frame being piped into the function.

If you want to display only one variable, put it on the left-hand side of the expression and use a 1 on the right-hand side, as in Wrist ~ 1.

Categorical variables and jittering

Each of the horizontal and vertical axes in Figure fig-wrist-ankle-annot represent a numerical variable, with the axis tick-mark labels (e.g. “18”) marking the link between position and numerical value.

Graphical axes can also be used with categorical variables, as in Figure fig-height-sex where the horizontal axis represents sex. To accomplish this, the axis tick marks show the levels of the categorical variable, for instance F and M. If we were to follow the mathematical conventions for numerical variables, then each point would be located exactly at its respective sex value, as in Figure fig-height-sex(a). The space between the labelled tick marks is empty.

With categorical variables, there is a benefit to suspending the mathematical convention, and slightly spreading the points randomly around the labelled position as in Figure fig-height-sex(b). This spreading is called “jittering” and makes it easier to see the individual points.

Galton |> pointplot(height ~ sex, jitter="none")
Galton |> pointplot(height ~ sex, seed = 201)

(a) A plain point plot

(b) A jittered point plot

Figure 2.5: Sex and height seen in the Galton data frame.

Color and faceting

Often, there will be more than one explanatory variable of interest. For instance, if there are two explanatory variables, the tilde expression will have two variable names on the right-hand side, for instance Wrist ~ Ankle + Knee. Graphically a third “axis” is needed for the additional explanatory variable.

Figure 2.6: A 3-dimensional point plot of Wrist ~ Ankle + Knee. Such plots are hard to make sense of.

Mathematicians will point out that in theory each cartesian axis in 3-dimensional space can be assigned to each of three variables. Figure fig-3space-knee shows what this would look like in an interactive 3-D plot. The result is very difficult to make sense of.

Experience has shown that graphics with three variables can be more effective if the third “axis” is represented by color.

::: {.column-page-right}

Anthro_F |> pointplot(Wrist ~ Ankle + Knee)
Anthro_F |> pointplot(Wrist ~ Ankle + Knee + Knee)

(a) Color alone

(b) Color and faceting together.

Figure 2.7: A point plot of wrist versus ankle and knee circumference using color to represent the knee circumference.

The right-hand panel in Figure fig-wrist-ankle-knee illustrates the technique of “faceting.” A facet of a graph is a sub-panel that represents a subset of the data. For instance, the middle panel in Figure fig-wrist-ankle-knee includes just those specimens with knee circumferences in the range 34 cm to 36 cm. Faceting is specified by the third variable (if any) on the right-hand side of the tilde expression. In the expression Wrist ~ Ankle + Knee + Knee, we are using Knee in two roles: color and faceting. The consequence is that only one color appears in each facet.


Explain why there is only one color in each facet in the right-hand panel of Figure fig-wrist-ankle-knee.

Write out the R commands to make these graphics, based on the Whickham data frame.

Graph A

Warning in geom_jitter(point_ink = 0.5): Ignoring unknown parameters:

Answer: Whickham |> pointplot(outcome ~ smoker, ink = 0.5)

Graph B

Answer: Whickham |> pointplot(outcome ~ age, ink = 0.5)

Graph C

Answer: Whickham |> pointplot(age ~ smoker, ink = 0.5)

The graphic below contains a single data layer. Four of the data points are annotated with letters in order to identify them specifically.

Part 1

  1. Is the income level of “a” greater than “b”? Answer: no
  2. Is the income level of “d” greater than “a”? Answer: no
  3. Is the number of rooms greater for “b” than for “a”? Answer: no. Even though the vertical position of “b” is higher than for “a,” they are in the same jittering band. All points within a jittering band have the equivalent value in terms of the variable that is being jittered.
  4. Is the number of rooms greater for “c” than for “a”? Answer: yes. They are in different jittering bands.

Part 2

Here is the data plotted in the figure.

 row   income   number_of_rooms
----  -------  ----------------
   1     0.90                 1
   2     1.00                 3
   3     0.31                 3
   4     0.85                 1
   5     1.09                 3
   6     1.19                 2
   7     1.01                 1
   8     1.09                 3
   9     1.16                 2
  10     2.86                 2

The points a, b, c, and d, are shown in the table. For each of a, b, c, d, say which row corresponds to the point. Answer: a is row 8, b is row 7, c is row 2, d is row 1

With reference to the graphics frame shown below, indicate whether the variable on each axis is quantitative or categorical.

  1. Horizontal axis: quantitative or categorical Answer: categorical
  2. Vertical axis: quantitative or categorical Answer: quantitative

Based on the graphic above—which violates our convention of putting statistical annotations on top of the raw data—which group, A or B, has the larger number of instances in the data? Select one

  1. Group A has more instances.
  2. Group B has more instances.
  3. The two groups have about the same number of instances.
  4. Violin plots don’t show this information. Answer: Right


All of the violins shown in a given plot will have the same area regardless of the number of points for the group being represented. If the values are spread out (e.g. low density) the violin will be narrow, if they are clumped together (e.g. high density) the violin will be relatively wide. But in comparing two violins, there’s no way to say how many data points fall into each of them.

This is one of the reasons why it’s good to show the raw data along with the statistical annotations.

Consider this data frame:

HealthGen    Age   SleepHrsNight
----------  ----  --------------
Vgood         28               9
Vgood         27               8
Vgood         17               6
Good          43               7
Good          27               6
Excellent     36               8
Good          29               6
Good          80               6
Excellent     22               8
Good          54               7
  ... and so on for 569 rows in total.

Here is a plot of the data. The identifying labels have been stripped off for the purpose of this exercise.

  1. What is the variable used for facetting? Answer: General health
  2. What is the variable on the horizontal axis? Answer: Age
  3. Is this plot jittered? Answer: No. Notice that the values of SleepHrsNight are discrete integers: 7, 8, 9 and so on. The data rows with each value of SleepHrsNight are all plotted at the same vertical positioning. If there were jittering, points with the same value of SleepHrsNight would be spread somewhat in the vertical direction.

The LST::Butterfly data frame records world records in the 100- and 200-meter butterfly swimming competition.

  1. Using tilde_plot(), make a graphic that tells an informative story about what world records depend on. When you have a graphic that you like, write a short narrative that guides a human reader through what is revealed by the graphic.

  2. The races cover different total distances (100 and 200 meters) but a given distance might be divided into multiple “lengths” according to the size of the pool. Make a graphic that shows clearly what is the effect of having to turn around at the end of each length in order to complete the total distance.

The following one connects the dots with line segments. It needs to be updated for pointplot()

The graph below is a violin plot. Using a pencil and your intuition, add a few dozen dots to the graphic as they would appear in a data layer superimposed on the violin layer. The dots should be jittered and be consistent with the shape of the violins.


Where the violin is wider, there is a greater concentration of dots. In a jittered plot, the exact horizontal position of the dots has no significance.

The SDSdata::FARS data table contains statistics on motor-vehicle related fatalities each year in the US. The following command produces a data layer of the number of crashes.

Loading required package: magrittr

Attaching package: 'magrittr'
The following object is masked from 'package:tidyr':


  1. For data where there is a time sequence to the points, it can be helpful to guide the eye by connecting the points with a line. You can do this by piping the output of gf_point() into gf_line() function. Produce the plot with the points connected by lines.


gf_point(crashes ~ year, data = FARS) |>

  1. Reading the graphic What is the numerical size of the drop from the year with the highest number of crashes to the year with the lowest number of crashes? Answer: about 10000 crashes.

  2. There is a dramatic fall in the number of crashes between 2005 and 2010. But how dramatic? For variables where zero is a meaningful value, as with crashes, it can be helpful to include zero on the y-axis. This helps the eye to see not just the change in numbers but the size of that change in proportion to the hold. You can set the scale of the y-axis by adding another function call to the graphing sequence: gf_lims(y = c(0, 40000)). Make such a graph.


gf_point(crashes ~ year, data = FARS) |>
  gf_line() |>
  gf_lims(y = c(0, 40000))

Another, more convenient way to create a similar graph is to use gf_lims(y = c(0, NA)). Here, the NA is an instruction to the computer to figure out what the top limit should be automatically.

  1. From the graph with a y-axis starting at zero, estimate the proportional change in the number of crashes from the highest value to the lowest value. Answer: a reduction of about 25%

The next exercise needs to be updated to pointplot()

The figure in Exercise 2.9 shows the number of fatal motor-vehicle related crashes in the US over the years. There is a substantial drop in humber from 2005 to 2010. What might account for this?

There are many possible hypotheses. For instance:

  1. Cars became safer in this period.
  2. Drunk-driving laws and education programs became more effective.
  3. Roads were improved.
  4. The amount of miles driven fell, reducing the number of accidents.

In this exercise, you’ll make some graphics to explore hypothesis (4).

  1. “Adjust” the number of crashes by the number of miles driven, for instance by dividing one by the other.

    ::: {.cell}

    FARS <- FARS |> mutate(crash_rate = crashes / vehicle_miles)


    Plot out the crash rate over the years. Does it show a drop from 2005 to 2010 similar to that seen in the plot of the number of crashes?


  1. Check whether crashes and vehicle_miles are related by plotting one versus the other.
  1. Add a statistics layer showing a straight-line model of crashes as a function of vehicle_miles. You can do this by piping the data layer into the function gf_lm().

  2. Add an interval layer by giving an additional argument to gf_lm(interval = "confidence")

The statistical annotations created by pointplot() always extend over an interval (or “band”). Traditionally, statisticians have distinguished between two types of statistics:

  • point statistics are a single number.
  • interval statistics such as produced by pointplot()

Often, interval statistics are drawn using an I-beam shape called an “error bar” while point statistics are drawn with a point or a horizontal line.

Warning: Removed 120 rows containing missing values (`geom_point()`).
Warning: Removed 120 rows containing missing values (`geom_point()`).
Warning: Removed 119 rows containing missing values (`geom_point()`).

For each graph, state which types of graphical layers appear.

Answer: (a) point statistic layer; Answer: (b) interval layer; Answer: (c) data layer; Answer: (d) data and interval layers; Answer: (e) point statistic and interval layers; Answer: (f) three layers: data, point statistic, and interval;

Figure fig-cat-cat involves two categorical explanatory variables.

  1. Which variable is mapped to the horizontal axis? Which to color?

  2. What is the model value of age for non-smoking survivors?

  3. What are the levels of domhand?

Horizontal: smoking status. Color: whether the nurse survived for at least 20 years after the initial interview.About 41 years old.

Each of the following plots has been made by pointplot(). The name of the data frame is given. Your job the entire command that will reproduce the plot.


  1. Some plot

  2. Another plot

  3. And so on.

In Figure fig-wrist-ankle-model(b) each of the three facets has points of only one color. Explain why.

Decide which assignment of variables to graphical qualities you think is the most important. Note that we are going to use a convention: response, explanatory1, explanatory 2

Galton |> pointplot(height ~ mother + sex)

Galton |> pointplot(height ~ sex + mother)

a. In the plot that maps `mother` to the horizontal axis, explain how you would identify a child who is relatively short for their sex but who has a tall mother.

b. In the plot the maps `mother` to color, explain how you would identify a child who, as in (1), is relatively short for their sex but has a tall mother.

There’s an interesting pattern shown in this plot:

Won’t compile to HTML

Births |> filter(year %in% c(1980)) |> 
  pointplot(births ~ date + wday, size=0.4) |>
a. The points split into two main groups based on the number of births each day. Explain in everyday terms what's going on.
b. There are some low-birth dates that are not weekends. Look at the specific date by hovering the cursor over the points. What's going on?

Outlier in Knee in Figure fig-wrist-angle. Find the specimen and filter it out.

The “body mass index” (BMI) is a familiar way of defining overweight. (Whether it is useful medically is controversial, but it is widely used.) BMI is an arithmetic combination of height and weight. Using the data in Anthro_F, make plots showing the relationship between BMI, Height, and Weight. There are six different ways of defining the graphics frame from three variables, e.g., Height ~ BMI + Weight or Weight ~ Height + BMI.

a. Three of the six possible frames just swap the x- and y-axes from the other three. Make a list of the three pairs of swapped axis graphics frames.

b. Select one frame from each of the three pairs in (a) and graph it, producing three graphs. 

c. Pick one of the three graphs from (b)---whichever you like best---and use it to explain in graphical, everyday terms, how BMI is related to height and weight. 

As you know, the .by= argument to the wrangling verbs causes the operation to be done separately for each group defined by .by=.

There is a similar .by= argument for pointplot(). For instance,

Big |> pointplot(flipper ~ mass + species, 
                  .by =  ~ species)

This groupwise splitting up of a graph is called “faceting.”

Notice that, unlike the wrangling functions, .by= uses a tilde expression. This is because you might sometimes want to facet using two variables, one along the horizontal spread of facets, one along the vertical spread. The tilde-expression format lets you specify which facet is horizontal and which vertical.

Faceting is more sophisticated than merely making a new graph for each group. To illustrate, here is a single data graph just for the Chinstrap species of penguin:

Penguins |> filter(species == "Chinstrap") |>
  pointplot(flipper ~ mass + species)

  1. Compare the x-y frame for the Chinstrap facet in the top graph to the x-y frame for the Chinstrap-only second graph. What’s different about the x- and y- axes?

  2. Explain what’s nice about the faceting way of setting the bounds of the x- and y-axis.

DRAFT: Using jittering and transparency for quantitative variables. Point out that numerical values are sometimes discrete, as in the number of hours of sleep each night.

NHANES::NHANES |> gf_point(SleepHrsNight ~ Depressed, point_ink = 0.3)
Warning: Removed 2245 rows containing missing values (`geom_point()`).

NHANES::NHANES |> gf_jitter(SleepHrsNight ~ Depressed, point_ink = 0.3)
Warning: Removed 2245 rows containing missing values (`geom_point()`).

You’ll need to explain what the NA refers to.

DRAFT: a graph of newborn babies weights versus the age of the mother. Use the model annotation to describe the relationship, if any.

Gestation |> pointplot(wt ~ age, point_ink = 0.1, annot="model")
Warning: Removed 2 rows containing missing values (`geom_point()`).

DRAFT: Re-create the East-India graphic.

Consider this annotated point plot.

Whickham |> pointplot(age ~ smoker, point_ink = 0.3, annot="violin")

  1. What tilde expression was used?
  2. Which group, smokers or non-smokers, has a greater density of people over age 60?

The Births2022 data frame records a random sample of 20,000 births in the US in 2022. Two of the variables, meduc and feduc, give the educational level of the mother and father respectively. The levels of these categorical variables correspond to “eighth grade or less”, “twelfth grade or less”, “high-school graduate,” “high-school graduate plus some college (but no degree),”associate’s degree,” “bachelor’s degree,” “master’s degree,” and “professional degree” (such as a PhD, EdD, MD, LLB, DDS, JD). Educational data is missing (“NA”) for about 5% of mothers and 15% of fathers.

The graph is a point plot of the mother’s education level versus the father’s.

ggplot(Births2022, aes(y=meduc, x=feduc)) + 
  geom_jitter(point_ink = 0.05, size=0.02, height=0.4, width=0.4) +
  theme_bw() +
  theme(aspect.ratio=1, axis.text.x = 
          element_text(angle = 45, vjust=0.9, hjust=1)) +
  labs(x="Father's education", y="Mother's education") 
Warning in geom_jitter(point_ink = 0.05, size = 0.02, height = 0.4, width =
0.4): Ignoring unknown parameters: `point_ink`

  1. Is this a jittered point plot? Explain briefly how you can tell. Answer: Yes, it’s jittered both horizontally and vertically. The axis tick marks correspond to discrete categorical levels, but the points themselves are spread out a little bit around the discrete levels.
  2. Is transparency used? Explain briefly how you can tell. Answer: Yes. In the blocks with a low number of points, each dot is not a solid color.
  3. In principle, there are 9 \(\times\) 9 = 81 possible combinations of the mother’s and father’s education. Which combination is the most common? What’s the second most common combination? Answer: Most common: HS for both mother and father. Second most common: Bachelors for both mother and father.
  4. Is it more common for a woman with a Bachelor’s degree to marry a man with a high-school degree or vice versa? Answer: The square at mother=bachelors, father=HS is much darker than the similar square on the other side of the diagonal, that is, at father=bachelors, mother=HS
  5. What would the graphic look like if jittering had not been used? Answer: There would be a single dot at each of the populated intersections, rather than the square cloud of dots seen in the actual graph.

Guides, scales, pallettes

Identifying points

[Still in draft]

This works but won’t compile to HTML

AAUP |> pointplot(acsal ~ nonacsal + licensed) |>

Learning a new way of thinking is genuinely hard. As you learn statistical thinking, it may help to have a concise definition. The following definition captures much of the essence of statistical thinking:

Statistic thinking is the accounting for variation in the context of what remains unaccounted for.

Implicit in this definition is a pathway for learning to think statistically:

  1. Learn how to measure variation;
  2. Learn how to account for variation;
  3. Learn how to measure what remains unaccounted for.

In this Lesson, we will consider graphical ways to display variation.


Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions. —– Stephen Jay Gould (1941- 2002), paleontologist and historian of science.

We will use a _vari_ety of words to express differences from specimen-to-specimen, such as the diverse durations of gestation. Variation is about how things vary. Variance has a non-technical meaning, as in a “zoning variance” which gives permission to depart from zoning rules. For us, variance will always be used in a technical sense: a number summarizing variation of the values in a variable. Whenever you see the stem “var”, you should be thinking of specimen-to-specimen differences.

To illustrate variation, let’s consider a process fundamental to human life: gestation. We all know that human pregnancy “typically” lasts around nine-months but differs unpredictably from one birth to another.

Figure fig-gestation-jitter shows data from the Gestation data frame. In this data frame, each of the 1200 rows is one pregnancy and birth about which several measurements were made. The gestation variable records the length of the pregnancy (in days).

Gestation <- Gestation |> 
  mutate(parity = ifelse(parity == 0, "first-time", "previous-preg")) 
Plot1 <- Gestation |>
  ggplot(aes(x=parity, y=gestation)) + 
  geom_jitter(point_ink = 0.2, width=0.2, height=0) 

Figure 2.8: Gestational period for first-time mothers and mothers with a previous pregancy.

Figure fig-gestation-jitter divides the 1200 births in the Gestation data frame according to the variable parity, which describes whether or not the pregnancy is the mother’s first.

The variation in gestation is evident directly from the dots in the graph. One strategy for describing variation is to specify an interval: the span between a lower and an upper value. For instance,

  • The large majority of pregancies last between 250 and 310 days. Or,
  • The majority of pregnancies are between 275 and 290 days.

A more subtle description avoids setting hard bounds in favor of saying which durations are common and which not. This common-or-not description is called a “distribution.” The “histogram” is a famous style of presentation of a distribution. Even elementary-school students are introduced to histograms; they are easy to draw.

There are good reasons to avoid the busy display of a histogram. For instance, we want to be able to show relationships between variables and we want, whenever possible, to put the graphical summaries of data as a layer on top of the data themselves. And we have the computer as a tool for making graphics. Consequently, our preferred format for displaying distributions is a smooth shape, oriented along the vertical axis. The width of the shape expresses how common is the corresponding region of the vertical axis. The word “density” is often used when talking about distributions. Where the data points are closely spaced to one another, the density is high. Where data points are sparse, the density is low. You can see the density at any level of the vertical axis, just as you can read by eye the density of tufts of grass sprouting in a newly tilled field.

Figure fig-violin-intro shows the density display layered on top of the pregnancy data. For reasons that may be evident, this sort of display is called a “violin plot.”

Plot1 +
              fill="blue", point_ink = 0.65, color=NA)

Figure 2.9: A violin plot. The long axis of the violin-like shape is oriented along the response-variable axis (that is, the vertical axis in our standard format). The width of the violin for each possible value of the response variable is proportional to the density of data near that value.

The shapes of the two violins in Figure fig-violin-intro are similar, suggesting that the variation in the duration of pregnancy is about the same for first-time mothers as for mothers in a second or later pregnancy.