# 2 Data graphics

Year | Exports | Imports |
---|---|---|

1700 | 180 | 460 |

1701 | 170 | 480 |

1702 | 160 | 490 |

1703 | 150 | 500 |

1704 | 145 | 510 |

1705 | 140 | 525 |

1706 | 135 | 550 |

1707 | 125 | 565 |

1708 | 120 | 580 |

1709 | 110 | 590 |

1710 | 105 | 625 |

1711 | 105 | 650 |

1712 | 100 | 680 |

1713 | 100 | 710 |

1714 | 100 | 725 |

1715 | 100 | 755 |

A data frame is a specific way of organizing and storing data. To see the “big picture,” however, it can help to organize the data in other ways: drawing a literal picture of the data. We call such pictorial presentations **data graphics**.

Making pictures of data is a relatively modern idea. William Playfair (1759-1823) is credited as the inventor of novel graphical forms in which data values are presented graphically, rather than as numbers or text. To illustrate, consider the data from the 1700s (Table tbl-playfair-trade) that Playfair turned into a picture.

Playfair’s innovation, as in Figure fig-playfair, was successful because it was powerful. The pattern that is latent in the data frame becomes visually obvious to the human viewer. The picture shows not only the trade values each year but also the *trends* across the decades.

The American revolution is marked out by the graph; you can see the steady fall in English exports from 1775-1780, corresponding to the American boycott during the revolution. Exports pick up again after the revolution, but English imports increase even more rapidly, leading to a steady expanding trade deficit by 1800. The historical consequences of this deficit are profound with continuing implications. (See blog post.)

Data graphics are becoming an important way for ordinary citizens to find out what’s happening in the world. It’s worthwhile to study collections of data graphics to see the creativity and range of approaches of graph-makers. Some examples: how people spend their day, life expectancy, wind patterns (right now!), historical sources of death.

The graphics found in statistics textbooks (Figure fig-textbook-graphs) are often highly stylized and don’t show data directly. Some forms—pie charts and bar charts—were introduced by Playfair more than 200 years ago. One of the primary motivations for the graphical forms in Figure fig-textbook-graphs is that they *can be drawn easily by hand* or typewritten.

The types of graphics in Figure fig-textbook-graphs can be effective pedagogical tools for teaching pupils about numbers their representation. But our purpose in these *Lessons* is different: to display data directly along with guides to interpreting the possible patterns in the data.

## Annotated point plots

This *Lesson* introduces a powerful form of data graphics that is particularly well suited to support statistical thinking: the **annotated point plot**. A point plot provides a visual display of a data frame. The annotations summarize specific patterns in the data. We will start with point plots without annotations.

A point plot contains a simple mark for each row of a data frame. Two selected columns of the data frame are depicted as the vertical and horizontal axes of the **graphics frame**. To illustrate, we construct point plots of latitude vs longitude for a random sample of the specimens in the `maps::world.cities`

data frame,

`head(maps::world.cities |> arrange(desc(pop)))`

```
name country.etc pop lat long capital
------------- ------------ --------- ------- ------- --------
Shanghai China 15017783 31.23 121.47 2
Bombay India 12883645 18.96 72.82 0
Karachi Pakistan 11969284 24.86 67.01 0
Buenos Aires Argentina 11595183 -34.61 -58.37 1
Delhi India 11215130 28.67 77.21 0
Manila Philippines 10546511 14.62 120.97 1
... for the 10,000 biggest cities
```

The location of each city, in terms of latitude and longitude (variable names: `lat`

and `long`

) are plotted, each with a simple dot. In the panels below, we select a random samples of the 10,000 biggest cities. The panel labelled n=100 has just one-hundred cities, while n=500 has five-hundred, and so on. **IMPORTANT** Take your time, starting with the n=100 panel. See how much detail you can make out, then switch to the next panel and see if you can discern additional detail.

One principle of statistics is that when displaying a pattern in data, a larger sample size lets you see more detail. Here, the pattern is the one you learned in elementary school. As you’ll see, the patterns we look at in these *Lessons* are much simpler. But we want to illustrate the point with a very familiar, non-mathematical pattern.

For the sake of simplicity, we will use the `pointplot()`

function (from the `{LST}`

package) to start making data graphics. `pointplot()`

requires two inputs:

- a data frame that is piped into
`pointplot()`

.. - a “
**tilde expression**” specifying which variables from the data frame are to be rendered in graphical form. For the world-cities graphs above, the tilde expression is`lat ~ long`

.

As an example you can follow along, we use a short data frame, `Anthro_F`

, which records, among other variables, the wrist, ankle, and knee circumference of 184 college-aged women. Figure fig-wrist-ankle shows a point plot of wrist versus ankle circumference.

Wrist | Ankle | Knee |
---|---|---|

18.4 | 23.5 | 37.5 |

13.5 | 18.0 | 32.3 |

18.0 | 22.5 | 38.5 |

19.0 | 24.5 | 41.5 |

`|> pointplot(Wrist ~ Ankle) Anthro_F `

Each dot in Figure fig-wrist-ankle reflects one row of the `Anthro_F`

data frame and is placed at coordinate(`Ankle`

, `Wrist`

). Table Table tbl-wrist-ankle2 shows four rows from `Anthro_F`

; you should be able to locate the four corresponding dots in the Figure fig-wrist-ankle point plot.

The computer command used to create Figure fig-wrist-ankle is typical of the commands you will use throughout these lessons. Let’s highlight the components of the command.

\(\underbrace{\texttt{Anthro_F}}_\text{data frame}\ \ \color{orange}{\underbrace{\Large\texttt{|}\!\texttt{>}}_{\text{pipe}}} \ \ \ \color{green}{\underbrace{\texttt{app(}}_\texttt{function}}\ \color{blue}{\underbrace{\texttt{Wrist}\ _{\LARGE{\texttt{~}}}\ \texttt{Ankle}}_\text{argument}}\ \color{green}{\texttt{)}}\)

This is a general pattern for computing on data frames, that is, providing a data frame as on input to a *function*. Every function has an identifying name, here `pointplot()`

. The data frame `Anthro_F`

is being piped into the function. The purpose of the function named `pointplot()`

is to generate a data graphic.

It’s often the case that additional instructions are needed to describe exactly what the function is to do. You place such instructions—called “**arguments**”—inside the pair of parentheses that follow the function name.

The tilde expression specifies which variable goes on the vertical axis and which one on the horizontal. The word “tilde” is the name of the wavy character . The variable name to the left of the tilde goes on the vertical axis, the variable name to the right of the tilde goes on the horizontal axis. The variable names used must correspond to the names in the data frame being piped into the function.

`1`

on the right-hand side, as in `Wrist ~ 1`

.## Categorical variables and jittering

Each of the horizontal and vertical axes in Figure fig-wrist-ankle-annot represent a numerical variable, with the axis tick-mark labels (e.g. “18”) marking the link between position and numerical value.

Graphical axes can also be used with *categorical* variables, as in Figure fig-height-sex where the horizontal axis represents `sex`

. To accomplish this, the axis tick marks show the levels of the categorical variable, for instance **F** and **M**. If we were to follow the mathematical conventions for numerical variables, then each point would be located *exactly* at its respective `sex`

value, as in Figure fig-height-sex(a). The space between the labelled tick marks is empty.

With categorical variables, there is a benefit to suspending the mathematical convention, and slightly spreading the points randomly around the labelled position as in Figure fig-height-sex(b). This spreading is called “**jittering**” and makes it easier to see the individual points.

## Code

```
|> pointplot(height ~ sex, jitter="none")
Galton |> pointplot(height ~ sex, seed = 201) Galton
```

`Galton`

data frame.## Color and faceting

Often, there will be more than one explanatory variable of interest. For instance, if there are two explanatory variables, the tilde expression will have two variable names on the right-hand side, for instance `Wrist ~ Ankle + Knee`

. Graphically a third “axis” is needed for the additional explanatory variable.

Mathematicians will point out that in theory each cartesian axis in 3-dimensional space can be assigned to each of three variables. Figure fig-3space-knee shows what this would look like in an interactive 3-D plot. The result is very difficult to make sense of.

Experience has shown that graphics with three variables can be more effective if the third “axis” is represented by **color**.

::: {.column-page-right}

```
|> pointplot(Wrist ~ Ankle + Knee)
Anthro_F |> pointplot(Wrist ~ Ankle + Knee + Knee) Anthro_F
```

The right-hand panel in Figure fig-wrist-ankle-knee illustrates the technique of “**faceting**.” A facet of a graph is a sub-panel that represents a subset of the data. For instance, the middle panel in Figure fig-wrist-ankle-knee includes just those specimens with knee circumferences in the range 34 cm to 36 cm. Faceting is specified by the third variable (if any) on the right-hand side of the tilde expression. In the expression `Wrist ~ Ankle + Knee + Knee`

, we are using `Knee`

in two roles: color and faceting. The consequence is that only one color appears in each facet.

## Exercises

Explain why there is only one color in each facet in the right-hand panel of Figure fig-wrist-ankle-knee.

Write out the R commands to make these graphics, based on the `Whickham`

data frame.

**Graph A**

```
Warning in geom_jitter(point_ink = 0.5): Ignoring unknown parameters:
`point_ink`
```

*Answer*: `Whickham |> pointplot(outcome ~ smoker, ink = 0.5)`

**Graph B**

*Answer*: `Whickham |> pointplot(outcome ~ age, ink = 0.5)`

**Graph C**

*Answer*: `Whickham |> pointplot(age ~ smoker, ink = 0.5)`

The graphic below contains a single data layer. Four of the data points are annotated with letters in order to identify them specifically.

**Part 1**

- Is the income level of “a” greater than “b”?
*Answer*: no - Is the income level of “d” greater than “a”?
*Answer*: no - Is the number of rooms greater for “b” than for “a”?
*Answer*: no. Even though the vertical position of “b” is higher than for “a,” they are in the*same jittering band*. All points within a jittering band have the equivalent value in terms of the variable that is being jittered. - Is the number of rooms greater for “c” than for “a”?
*Answer*: yes. They are in different jittering bands.

**Part 2**

Here is the data plotted in the figure.

```
row income number_of_rooms
---- ------- ----------------
1 0.90 1
2 1.00 3
3 0.31 3
4 0.85 1
5 1.09 3
6 1.19 2
7 1.01 1
8 1.09 3
9 1.16 2
10 2.86 2
```

The points a, b, c, and d, are shown in the table. For each of a, b, c, d, say which row corresponds to the point. *Answer*: a is row 8, b is row 7, c is row 2, d is row 1

With reference to the graphics frame shown below, indicate whether the variable on each axis is quantitative or categorical.

- Horizontal axis: quantitative or categorical
*Answer*: categorical - Vertical axis: quantitative or categorical
*Answer*: quantitative

Based on the graphic above—which violates our convention of putting statistical annotations on top of the raw data—which group, A or B, has the larger number of instances in the data? Select one

- Group A has more instances.
- Group B has more instances.
- The two groups have about the same number of instances.
- Violin plots don’t show this information.
*Answer*: Right

*Answer*:

All of the violins shown in a given plot will have the same **area** regardless of the number of points for the group being represented. If the values are spread out (e.g. low density) the violin will be narrow, if they are clumped together (e.g. high density) the violin will be relatively wide. But in comparing two violins, there’s no way to say how many data points fall into each of them.

This is one of the reasons why it’s good to show the raw data along with the statistical annotations.

Consider this data frame:

```
HealthGen Age SleepHrsNight
---------- ---- --------------
Vgood 28 9
Vgood 27 8
Vgood 17 6
Good 43 7
Good 27 6
Excellent 36 8
Good 29 6
Good 80 6
Excellent 22 8
Good 54 7
... and so on for 569 rows in total.
```

Here is a plot of the data. The identifying labels have been stripped off for the purpose of this exercise.

- What is the variable used for facetting?
*Answer*: General health - What is the variable on the horizontal axis?
*Answer*: Age - Is this plot jittered?
*Answer*: No. Notice that the values of`SleepHrsNight`

are discrete integers: 7, 8, 9 and so on. The data rows with each value of`SleepHrsNight`

are all plotted at the same vertical positioning. If there were jittering, points with the same value of`SleepHrsNight`

would be spread somewhat in the vertical direction.

The `LST::Butterfly`

data frame records world records in the 100- and 200-meter butterfly swimming competition.

Using

`tilde_plot()`

, make a graphic that tells an informative story about what world records depend on. When you have a graphic that you like, write a short narrative that guides a human reader through what is revealed by the graphic.The races cover different total distances (100 and 200 meters) but a given distance might be divided into multiple “lengths” according to the size of the pool. Make a graphic that shows clearly what is the effect of having to turn around at the end of each length in order to complete the total distance.

The following one connects the dots with line segments. It needs to be updated for `pointplot()`

The graph below is a violin plot. Using a pencil and your intuition, add a few dozen dots to the graphic as they would appear in a data layer superimposed on the violin layer. The dots should be jittered and be consistent with the shape of the violins.

*Answer*:

Where the violin is wider, there is a greater concentration of dots. In a jittered plot, the exact horizontal position of the dots has no significance.

The `SDSdata::FARS`

data table contains statistics on motor-vehicle related fatalities each year in the US. The following command produces a data layer of the number of crashes.

`Loading required package: magrittr`

```
Attaching package: 'magrittr'
```

```
The following object is masked from 'package:tidyr':
extract
```

- For data where there is a time sequence to the points, it can be helpful to guide the eye by connecting the points with a line. You can do this by piping the output of
`gf_point()`

into`gf_line()`

function. Produce the plot with the points connected by lines.

*Answer*:

```
gf_point(crashes ~ year, data = FARS) |>
gf_line()
```

Reading the graphic What is the numerical size of the drop from the year with the highest number of crashes to the year with the lowest number of crashes?

*Answer*: about 10000 crashes.There is a dramatic fall in the number of crashes between 2005 and 2010. But how dramatic? For variables where zero is a meaningful value, as with

`crashes`

, it can be helpful to include zero on the y-axis. This helps the eye to see not just the change in numbers but the size of that change in proportion to the hold. You can set the scale of the y-axis by adding another function call to the graphing sequence:`gf_lims(y = c(0, 40000))`

. Make such a graph.

*Answer*:

```
gf_point(crashes ~ year, data = FARS) |>
gf_line() |>
gf_lims(y = c(0, 40000))
```

Another, more convenient way to create a similar graph is to use `gf_lims(y = c(0, NA))`

. Here, the `NA`

is an instruction to the computer to figure out what the top limit should be automatically.

- From the graph with a y-axis starting at zero, estimate the proportional change in the number of crashes from the highest value to the lowest value.
*Answer*: a reduction of about 25%

The next exercise needs to be updated to `pointplot()`

The figure in Exercise 2.9 shows the number of fatal motor-vehicle related crashes in the US over the years. There is a substantial drop in humber from 2005 to 2010. What might account for this?

There are many possible hypotheses. For instance:

- Cars became safer in this period.
- Drunk-driving laws and education programs became more effective.
- Roads were improved.
- The amount of miles driven fell, reducing the number of accidents.

In this exercise, you’ll make some graphics to explore hypothesis (4).

“Adjust” the number of crashes by the number of miles driven, for instance by dividing one by the other.

::: {.cell}

`<- FARS |> mutate(crash_rate = crashes / vehicle_miles) FARS`

:::

Plot out the crash rate over the years. Does it show a drop from 2005 to 2010 similar to that seen in the plot of the number of crashes?

*Answer*:

- Check whether
`crashes`

and`vehicle_miles`

are related by plotting one versus the other.

Add a statistics layer showing a straight-line model of

`crashes`

as a function of`vehicle_miles`

. You can do this by piping the data layer into the function`gf_lm()`

.Add an interval layer by giving an additional argument to

`gf_lm(interval = "confidence")`

The statistical annotations created by `pointplot()`

always extend over an interval (or “band”). Traditionally, statisticians have distinguished between two types of statistics:

**point statistics**are a single number.**interval statistics**such as produced by`pointplot()`

Often, interval statistics are drawn using an I-beam shape called an “**error bar**” while point statistics are drawn with a point or a horizontal line.

`Warning: Removed 120 rows containing missing values (`geom_point()`).`

`Warning: Removed 120 rows containing missing values (`geom_point()`).`

`Warning: Removed 119 rows containing missing values (`geom_point()`).`

For each graph, state which types of graphical layers appear.

*Answer*: (a) point statistic layer; *Answer*: (b) interval layer; *Answer*: (c) data layer; *Answer*: (d) data and interval layers; *Answer*: (e) point statistic and interval layers; *Answer*: (f) three layers: data, point statistic, and interval;

Figure fig-cat-cat involves two categorical explanatory variables.

Which variable is mapped to the horizontal axis? Which to color?

What is the model value of age for non-smoking survivors?

What are the

*levels*of`domhand`

?

Each of the following plots has been made by `pointplot()`

. The name of the data frame is given. Your job the entire command that will reproduce the plot.

DRAFT

Some plot

Another plot

And so on.

In Figure fig-wrist-ankle-model(b) each of the three facets has points of only one color. Explain why.

Decide which assignment of variables to graphical qualities you think is the most important. Note that we are going to use a convention: response, explanatory1, explanatory 2

`|> pointplot(height ~ mother + sex) Galton `

`|> pointplot(height ~ sex + mother) Galton `

```
a. In the plot that maps `mother` to the horizontal axis, explain how you would identify a child who is relatively short for their sex but who has a tall mother.
b. In the plot the maps `mother` to color, explain how you would identify a child who, as in (1), is relatively short for their sex but has a tall mother.
```

There’s an interesting pattern shown in this plot:

Won’t compile to HTML

```
|> filter(year %in% c(1980)) |>
Births pointplot(births ~ date + wday, size=0.4) |>
::ggplotly() plotly
```

```
a. The points split into two main groups based on the number of births each day. Explain in everyday terms what's going on.
b. There are some low-birth dates that are not weekends. Look at the specific date by hovering the cursor over the points. What's going on?
```

Outlier in `Knee`

in Figure fig-wrist-angle. Find the specimen and filter it out.

The “body mass index” (BMI) is a familiar way of defining overweight. (Whether it is useful medically is controversial, but it is widely used.) BMI is an arithmetic combination of height and weight. Using the data in `Anthro_F`

, make plots showing the relationship between `BMI`

, `Height`

, and `Weight`

. There are six different ways of defining the graphics frame from three variables, e.g., `Height ~ BMI + Weight`

or `Weight ~ Height + BMI`

.

```
a. Three of the six possible frames just swap the x- and y-axes from the other three. Make a list of the three pairs of swapped axis graphics frames.
b. Select one frame from each of the three pairs in (a) and graph it, producing three graphs.
c. Pick one of the three graphs from (b)---whichever you like best---and use it to explain in graphical, everyday terms, how BMI is related to height and weight.
```

As you know, the `.by=`

argument to the wrangling verbs causes the operation to be done separately for each group defined by `.by=`

.

There is a similar `.by=`

argument for `pointplot()`

. For instance,

```
|> pointplot(flipper ~ mass + species,
Big .by = ~ species)
```

This groupwise splitting up of a graph is called “**faceting**.”

Notice that, unlike the wrangling functions, `.by=`

uses a tilde expression. This is because you might sometimes want to facet using two variables, one along the horizontal spread of facets, one along the vertical spread. The tilde-expression format lets you specify which facet is horizontal and which vertical.

Faceting is more sophisticated than merely making a new graph for each group. To illustrate, here is a single data graph just for the Chinstrap species of penguin:

```
|> filter(species == "Chinstrap") |>
Penguins pointplot(flipper ~ mass + species)
```

Compare the x-y frame for the Chinstrap facet in the top graph to the x-y frame for the Chinstrap-only second graph. What’s different about the x- and y- axes?

Explain what’s nice about the faceting way of setting the bounds of the x- and y-axis.

DRAFT: Using jittering and transparency for quantitative variables. Point out that numerical values are sometimes discrete, as in the number of hours of sleep each night.

`::NHANES |> gf_point(SleepHrsNight ~ Depressed, point_ink = 0.3) NHANES`

`Warning: Removed 2245 rows containing missing values (`geom_point()`).`

`::NHANES |> gf_jitter(SleepHrsNight ~ Depressed, point_ink = 0.3) NHANES`

`Warning: Removed 2245 rows containing missing values (`geom_point()`).`

You’ll need to explain what the `NA`

refers to.

DRAFT: a graph of newborn babies weights versus the age of the mother. Use the model annotation to describe the relationship, if any.

`|> pointplot(wt ~ age, point_ink = 0.1, annot="model") Gestation `

`Warning: Removed 2 rows containing missing values (`geom_point()`).`

DRAFT: Re-create the East-India graphic.

Consider this annotated point plot.

`|> pointplot(age ~ smoker, point_ink = 0.3, annot="violin") Whickham `

- What tilde expression was used?
- Which group, smokers or non-smokers, has a greater density of people over age 60?

The `Births2022`

data frame records a random sample of 20,000 births in the US in 2022. Two of the variables, `meduc`

and `feduc`

, give the educational level of the mother and father respectively. The levels of these categorical variables correspond to “eighth grade or less”, “twelfth grade or less”, “high-school graduate,” “high-school graduate plus some college (but no degree),”associate’s degree,” “bachelor’s degree,” “master’s degree,” and “professional degree” (such as a PhD, EdD, MD, LLB, DDS, JD). Educational data is missing (“NA”) for about 5% of mothers and 15% of fathers.

The graph is a point plot of the mother’s education level versus the father’s.

```
ggplot(Births2022, aes(y=meduc, x=feduc)) +
geom_jitter(point_ink = 0.05, size=0.02, height=0.4, width=0.4) +
theme_bw() +
theme(aspect.ratio=1, axis.text.x =
element_text(angle = 45, vjust=0.9, hjust=1)) +
labs(x="Father's education", y="Mother's education")
```

```
Warning in geom_jitter(point_ink = 0.05, size = 0.02, height = 0.4, width =
0.4): Ignoring unknown parameters: `point_ink`
```

- Is this a jittered point plot? Explain briefly how you can tell.
*Answer*: Yes, it’s jittered both horizontally and vertically. The axis tick marks correspond to discrete categorical levels, but the points themselves are spread out a little bit around the discrete levels. - Is transparency used? Explain briefly how you can tell.
*Answer*: Yes. In the blocks with a low number of points, each dot is not a solid color. - In principle, there are 9 \(\times\) 9 = 81 possible combinations of the mother’s and father’s education. Which combination is the most common? What’s the second most common combination?
*Answer*: Most common: HS for both mother and father. Second most common: Bachelors for both mother and father. - Is it more common for a woman with a Bachelor’s degree to marry a man with a high-school degree or vice versa?
*Answer*: The square at mother=bachelors, father=HS is much darker than the similar square on the other side of the diagonal, that is, at father=bachelors, mother=HS - What would the graphic look like if jittering had not been used?
*Answer*: There would be a single dot at each of the populated intersections, rather than the square cloud of dots seen in the actual graph.

Guides, scales, pallettes

Identifying points

[Still in draft]

This works but won’t compile to HTML

```
|> pointplot(acsal ~ nonacsal + licensed) |>
AAUP ::ggplotly() plotly
```

Learning a new way of thinking is genuinely hard. As you learn statistical thinking, it may help to have a concise definition. The following definition captures much of the essence of statistical thinking:

Statistic thinking is the accounting for variationin the context ofwhat remains unaccounted for.

Implicit in this definition is a pathway for learning to think statistically:

- Learn how to measure variation;
- Learn how to account for variation;
- Learn how to measure what remains unaccounted for.

In this Lesson, we will consider graphical ways to display variation.

## Variation

Variation itself is nature’s only irreducible essence. Variation is the hard reality, not a set of imperfect measures for a central tendency. Means and medians are the abstractions.—– Stephen Jay Gould (1941- 2002), paleontologist and historian of science.

**vari**_ety of words to express differences from specimen-to-specimen, such as the di

**ver**se durations of gestation.

**Vari**ation is about how things

**vary**.

**Vari**ance has a non-technical meaning, as in a “zoning variance” which gives permission to depart from zoning rules. For us,

**variance**will always be used in a technical sense: a number summarizing

**vari**ation of the values in a

**vari**able. Whenever you see the stem “

**var**”, you should be thinking of specimen-to-specimen dif

**fer**ences.

To illustrate variation, let’s consider a process fundamental to human life: gestation. We all know that human pregnancy “typically” lasts around nine-months but differs unpredictably from one birth to another.

Figure fig-gestation-jitter shows data from the `Gestation`

data frame. In this data frame, each of the 1200 rows is one pregnancy and birth about which several measurements were made. The `gestation`

variable records the length of the pregnancy (in days).

## Code

```
<- Gestation |>
Gestation mutate(parity = ifelse(parity == 0, "first-time", "previous-preg"))
<- Gestation |>
Plot1 ggplot(aes(x=parity, y=gestation)) +
geom_jitter(point_ink = 0.2, width=0.2, height=0)
Plot1
```

Figure fig-gestation-jitter divides the 1200 births in the `Gestation`

data frame according to the variable `parity`

, which describes whether or not the pregnancy is the mother’s first.

The variation in `gestation`

is evident directly from the dots in the graph. One strategy for describing variation is to specify an **interval**: the span between a lower and an upper value. For instance,

- The large majority of pregancies last between 250 and 310 days. Or,
- The majority of pregnancies are between 275 and 290 days.

A more subtle description avoids setting hard bounds in favor of saying which durations are common and which not. This common-or-not description is called a “**distribution**.” The “**histogram**” is a famous style of presentation of a distribution. Even elementary-school students are introduced to histograms; they are easy to draw.

There are good reasons to avoid the busy display of a histogram. For instance, we want to be able to show relationships between variables and we want, whenever possible, to put the graphical summaries of data as a layer on top of the data themselves. And we have the computer as a tool for making graphics. Consequently, our preferred format for displaying distributions is a smooth shape, oriented along the vertical axis. The width of the shape expresses how common is the corresponding region of the vertical axis. The word “density” is often used when talking about distributions. Where the data points are closely spaced to one another, the density is high. Where data points are sparse, the density is low. You can see the density at any level of the vertical axis, just as you can read by eye the density of tufts of grass sprouting in a newly tilled field.

Figure fig-violin-intro shows the density display layered on top of the pregnancy data. For reasons that may be evident, this sort of display is called a “**violin plot**.”

## Code

```
+
Plot1 geom_violin(aes(group=parity),
fill="blue", point_ink = 0.65, color=NA)
```

The shapes of the two violins in Figure fig-violin-intro are similar, suggesting that the variation in the duration of pregnancy is about the same for first-time mothers as for mothers in a second or later pregnancy.

–>