# 2 Data graphics

Year | Exports | Imports |
---|---|---|

1700 | 180 | 460 |

1701 | 170 | 480 |

1702 | 160 | 490 |

1703 | 150 | 500 |

1704 | 145 | 510 |

1705 | 140 | 525 |

1706 | 135 | 550 |

1707 | 125 | 565 |

1708 | 120 | 580 |

1709 | 110 | 590 |

1710 | 105 | 625 |

1711 | 105 | 650 |

1712 | 100 | 680 |

1713 | 100 | 710 |

1714 | 100 | 725 |

1715 | 100 | 755 |

A data frame is a specific way of organizing and storing data. To see the “big picture,” however, it can help to organize the data in other ways: drawing a literal picture of the data. We call such pictorial presentations **data graphics**.

Making pictures of data is a relatively modern idea. William Playfair (1759-1823) is credited as the inventor of novel graphical forms in which data values are presented graphically, rather than as numbers or text. To illustrate, consider the data from the 1700s (Table 2.1) that Playfair turned into a picture.

Playfair’s innovation, as in Figure 2.1, was successful because it was powerful. The pattern that is latent in the data frame becomes visually obvious to the human viewer. The picture shows not only the trade values each year but also the *trends* across the decades.

The American revolution is marked out by the graph; you can see the steady fall in English exports from 1775-1780, corresponding to the American boycott during the revolution. Exports pick up again after the revolution, but English imports increase even more rapidly, leading to a steady expanding trade deficit by 1800. The historical consequences of this deficit are profound with continuing implications. (See https://dtkaplan.github.io/Math300blog/posts/graphics-and-history/).)

Data graphics are becoming an important way for ordinary citizens to find out what’s happening in the world. It’s worthwhile to study collections of data graphics to see the creativity and range of approaches of graph-makers. Some examples: how people spend their day, life expectancy, wind patterns (right now!), historical sources of death.

The graphics found in statistics textbooks (Figure 2.3) are often highly stylized and don’t show data directly. Some forms—pie charts and bar charts—were introduced by Playfair more than 200 years ago. One of the primary motivations for the graphical forms in Figure 2.3 is that they *can be drawn easily by hand* or typewritten.

The types of graphics in Figure 2.3 can be effective pedagogical tools for teaching pupils about numbers their representation. But our purpose in these *Lessons* is different: to display data directly along with guides to interpreting the possible patterns in the data.

## Annotated point plots

In the 200+ years since trade was first graphed, many different formats for drawing pictures of data have been invented. Two of these, the pie chart and the bar chart, were invented by Playfair himself. (See ?fig-playfair-pie.)

This *Lesson* introduces a powerful form of data graphics that is particularly well suited to support statistical thinking: the **annotated point plot**. A point plot provides a visual display of a data frame. The annotations summarize specific patterns in the data.

A point plot contains a simple mark for each row of a data frame. Two selected columns of the data frame are depicted as the vertical and horizontal axes of the **graphics frame**. For the sake of simplicity, we will use the `app()`

function (from the `{math300}`

package) to start making data graphics. `app()`

requires two inputs:

- a data frame.
- a “
**tilde expression**” specifying which variables from the data frame are to be rendered in graphical form.

To illustrate, consider the `Anthro_F`

data frame which records, among other variables, the wrist, ankle, and knee circumference of 184 college-aged women. Figure 2.4 shows a point plot of wrist versus ankle circumference.

Wrist | Ankle | Knee |
---|---|---|

18.4 | 23.5 | 37.5 |

13.5 | 18.0 | 32.3 |

18.0 | 22.5 | 38.5 |

19.0 | 24.5 | 41.5 |

`|> pointplot(Wrist ~ Ankle) Anthro_F `

Each dot in Figure 2.4 reflects one row of the `Anthro_F`

data frame and is placed at coordinate(`Ankle`

, `Wrist`

). Table Table 2.2 shows four rows from `Anthro_F`

; you should be able to locate the four corresponding dots in the Figure 2.4 point plot.

The computer command used to create Figure 2.4 is typical of the commands you will use throughout these lessons. Let’s highlight the components of the command.

\(\underbrace{\texttt{Anthro_F}}_\text{data frame}\ \ \color{orange}{\underbrace{\Large\texttt{|}\!\texttt{>}}_{\text{pipe}}} \ \ \ \color{green}{\underbrace{\texttt{app(}}_\texttt{function}}\ \color{blue}{\underbrace{\texttt{Wrist}\ _{\LARGE{\texttt{~}}}\ \texttt{Ankle}}_\text{argument}}\ \color{green}{\texttt{)}}\)

The point of the whole command is to perform a *function* on a *data frame*. Every function has an identifying name, here `tilde_plot()`

. The data frame `Anthro_F`

is being piped into the function. The purpose of the function named `tilde_plot()`

is to generate a data graphic.

It’s often the case that additional instructions are needed to describe exactly what the function is to do. You place such instructions—called “**arguments**”—inside the pair of parentheses that follow the function name.

The critical instruction needed for the `tilde_plot()`

function is what variables to use, which one goes on the vertical axis and which one on the horizontal. Such a which-variable-to-use instruction is written in the form of a “**tilde expression**.” The word “tilde” is the name of the wavy character . The variable name to the left of the tilde goes on the vertical axis, the variable name to the right of the tilde goes on the horizontal axis. The variable names used must correspond to the names in the data frame being piped into the function.

Figure 2.4 is an un-annotated point plot. To add an annotation, an additional instruction must be provided as a second argument to the function. To illustrate, here’s the command to create an annotated point plot:

`|> pointplot(Wrist ~ Ankle, annot = "model") Anthro_F `

The argument `annot = "model"`

(with “model” in quotes) directs `tilde_plot()`

to look for a relationship between the variables shown in the plot and to graph that relationship. Many of the following *Lessons* are devoted to understanding what a model is and what it shows you, so we won’t go into any detail here. For now, note the graphical form of the model: not a dot but a band.

## Categorical variables and jittering

Each of the horizontal and vertical axes in Figure 2.5 represent a numerical variable, with the axis tick-mark labels (e.g. “18”) marking the link between position and numerical value.

Graphical axes can also be used with *categorical* variables, as in Figure 2.6 where the horizontal axis represents `sex`

. To accomplish this, the axis tick marks show the levels of the categorical variable, for instance **F** and **M**. If we were to follow the mathematical conventions for numerical variables, then each point would be located *exactly* at its respective `sex`

value, as in Figure 2.6(a). The space between the labelled tick marks is empty.

With categorical variables, there is a benefit to suspending the mathematical convention, and slightly spreading the points randomly around the labelled position as in Figure 2.6(b). This spreading is called “**jittering**” and makes it easier to see Categorical variables have “**discrete**”

## Code

```
|> pointplot(height ~ sex, jitter="none")
Galton |> pointplot(height ~ sex, seed = 201) Galton
```

`Galton`

data frame.## Color and faceting

Often, there will be more than one explanatory variable of interest. For instance, if there are two explanatory variables, the tilde expression will have two variable names on the right-hand side, for instance `Wrist ~ Ankle + Knee`

. Graphically a third “axis” is needed for the additional explanatory variable.

Mathematicians will point out that in theory each cartesian axis in 3-dimensional space can be assigned to each of three variables. Figure 2.7 shows what this would look like in an interactive 3-D plot. The result is very difficult to make sense of.

Experience has shown that graphics with three variables are more effective if the third “axis” is represented by **color**.

::: {.column-page-right}

```
|> pointplot(Wrist ~ Ankle + Knee, annot="model")
Anthro_F |> pointplot(Wrist ~ Ankle + Knee + Knee, annot="model") Anthro_F
```

ANOTHER EFFECTIVE WAY TO REPRESENT A THIRD VARIABLE IS with FACETS. The variable to facet by goes in the third right-hand slot of tilde expression

## Violins for density

`|> pointplot(mage ~ meduc, alpha=.01, size=0.1, annot="violin") Births2022 `

```
Warning in pointplot(Births2022, mage ~ meduc, alpha = 0.01, size = 0.1, : x-axis variable is numerical, so only one violin drawn for all rows.
Perhaps you want to use ntiles() or factor() on that variable?
```

Are hardcovers (H) more likely to have many pages than paperback (P) books.

`::amazon_books |> pointplot(num_pages ~ hard_paper, alpha=0.1, annot="violin") moderndive`

```
Warning in pointplot(moderndive::amazon_books, num_pages ~ hard_paper, alpha = 0.1, : x-axis variable is numerical, so only one violin drawn for all rows.
Perhaps you want to use ntiles() or factor() on that variable?
```

`Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).`

`Warning: Removed 2 rows containing missing values (`geom_point()`).`

## Graphics frame

The data frame provides our standard organization of data. As you know, it consists of rows and columns. Each row is one specimen (also known as “unit of observation”). Each column is a variable, consisting of a series of **values**, one for each row. The values are either numerical or text: text for a categorical variable, numbers for a quantitative variable.

Another way to represent graphically the value of a variable is by showing discrete **facets**: mini-graphs that each show the data that fall into a particular value or range of a variable. This creates the possibility of representing a fourth or fifth variable in the graphic. In practice, however, such multi-variable graphics are difficult for humans to comprehend, which defeats much of the purpose of displaying data in graphical as opposed to spreadsheet form.

In these Lessons, a typical data graphic will represent two or three variables, using in order of precedence: vertical position, horizontal position, and last, color.

```
|> pointplot(Wrist ~ Ankle + Knee, alpha=0.5)
Anthro_F |> pointplot(Knee ~ Wrist + Ankle, alpha=0.5)
Anthro_F |> pointplot(Ankle ~ Knee + Wrist, alpha=0.5) Anthro_F
```

`Wrist ~ Ankle + Knee`

`Knee ~ Wrist + Ankle`

`Ankle ~ Knee + Wrist`

`Anthro_F`

It is usually most effective to show a relationship between two variables by placing them on the two axes. To judge from the plots, people with small wrists tend to have small ankles, people with small knees tend to have small wrists, and people who have small ankles tend to have small knees.

It takes some practice to comprehend relationships involving quantitative variables depicted with color. But color is a good choice for categorical variables with a handful of levels.

`|> pointplot(outcome ~ age + smoker ) Whickham `

```
|>
Births pointplot(births ~ date + wday)
```

## Statistical annotations

[STILL IN DRAFT]

Show violins, means and model values, intervals, confidence bands

`|> pointplot(height ~ 1, annot="violin", alpha=0.3) Galton `

```
Warning in pointplot(Galton, height ~ 1, annot = "violin", alpha = 0.3): x-axis variable is numerical, so only one violin drawn for all rows.
Perhaps you want to use ntiles() or factor() on that variable?
```

`|> pointplot(height ~ sex, annot="violin", alpha=0.1) Galton `

```
Warning in pointplot(Galton, height ~ sex, annot = "violin", alpha = 0.1): x-axis variable is numerical, so only one violin drawn for all rows.
Perhaps you want to use ntiles() or factor() on that variable?
```

## Distributions and density

For many people, the dots drawn in a point plot (or jitter plot) are reminiscent of seeds or pebbles scattered across an area. With this is mind, a way to interpret some aspects of point plots in terms of the “**density**” of data points; density is high in some areas, lower in other, negligible or nil in still others.

In general, “density” refers to a ratio: a count or amount per unit of space. In point plots, the “unit of space” is area. A high-density region has many data dots in each patch of area. Evidently, many people can perceive density in a point plot without any need to count, measure area, or calculate the ratio; it is an intuitive mode of perception.

Figure 2.10 is a made-up point plot with five patches of different densities. The densities are 25, 50, 100, 200, and 400 points per unit area. Many people would find it easy and immediate to point out the least and most dense patches and even to put the patches in order by density. However, people are hard put to qualify even the *relative* densities. For instance, the largest patch has a smaller density than the next largest patch, but quantifying this by eye (without being told the densities) is not really possible.

IN A POINT PLOT, the density tells us ABOUT THE CENTER AND FRINGES [SHOW A COUPLE OF jittered point plots of a normal and exponential distribution, and bimodal distribution and narrate them.]

Our eye can give a qualitative estimate of relative density, but not a precise quantitative one. Our graphical perception is much more precise when it comes to length or width. Ingeniously, designers of statistical graphics have created a device to display the density not in it’s native way but as a **length**.

For the reader this makes it easy to see small differences in density to which we would otherwise be insensitive.

It’s also a source of confusion, since width is being used when the real matter of interest is density.

## Examples

```
::Natality_2014_10k |>
natality2014pointplot(dbwt ~ ntiles(combgest,5, format="interval") + sex,
size=0.1, alpha=0.1, model_alpha=1, annot="violin")
```

```
::Natality_2014_10k |>
natality2014pointplot(dbwt ~ splines ::ns(combgest,4) + sex,
size=0.1, alpha=0.1, model_alpha=1, annot="model")
```

Why the leveling off for very long pregnancies? Perhaps they are very long only if the fetus is relatively small. Or perhaps the length of gestation has been overstated by a month.

`|> pointplot(height ~ mother * sex * father, annot="model", alpha=0.5, size=0.5, model_alpha=0.5) Galton `

## Data graphics

DRAFT DRAFT DRAFT

Show examples of data graphics and distinguish them from statistical annotations.

## Exercises

Figure 9.3 involves two categorical explanatory variables.

Which variable is mapped to the horizontal axis? Which to color?

What is the model value of age for non-smoking survivors?

What are the

*levels*of`domhand`

?

Each of the following plots has been made by `tilde_plot()`

. The name of the data frame is given. Your job the entire command that will reproduce the plot.

DRAFT

Some plot

Another plot

And so on.

In ?fig-wrist-ankle-model(b) each of the three facets has points of only one color. Explain why.

Decide which assignment of variables to graphical qualities you think is the most important. Note that we are going to use a convention: response, explanatory1, explanatory 2

`|> pointplot(height ~ mother + sex) Galton `

`|> pointplot(height ~ sex + mother) Galton `

```
a. In the plot that maps `mother` to the horizontal axis, explain how you would identify a child who is relatively short for their sex but who has a tall mother.
b. In the plot the maps `mother` to color, explain how you would identify a child who, as in (1), is relatively short for their sex but has a tall mother.
```

There’s an interesting pattern shown in this plot:

Won’t compile to HTML

```
|> filter(year %in% c(1980)) |>
Births pointplot(births ~ date + wday, size=0.4) |>
::ggplotly() plotly
```

```
a. The points split into two main groups based on the number of births each day. Explain in everyday terms what's going on.
b. There are some low-birth dates that are not weekends. Look at the specific date by hovering the cursor over the points. What's going on?
```

Outlier in `Knee`

in ?fig-wrist-angle. Find the specimen and filter it out.

The “body mass index” (BMI) is a familiar way of defining overweight. (Whether it is useful medically is controversial, but it is widely used.) BMI is an arithmetic combination of height and weight. Using the data in `Anthro_F`

, make plots showing the relationship between `BMI`

, `Height`

, and `Weight`

. There are six different ways of defining the graphics frame from three variables, e.g., `Height ~ BMI + Weight`

or `Weight ~ Height + BMI`

.

```
a. Three of the six possible frames just swap the x- and y-axes from the other three. Make a list of the three pairs of swapped axis graphics frames.
b. Select one frame from each of the three pairs in (a) and graph it, producing three graphs.
c. Pick one of the three graphs from (b)---whichever you like best---and use it to explain in graphical, everyday terms, how BMI is related to height and weight.
```

As you know, the `.by=`

argument to the wrangling verbs causes the operation to be done separately for each group defined by `.by=`

.

There is a similar `.by=`

argument for `pointplot()`

. For instance,

```
|> pointplot(flipper ~ mass + species,
Big .by = ~ species)
```

This groupwise splitting up of a graph is called “**faceting**.”

Notice that, unlike the wrangling functions, `.by=`

uses a tilde expression. This is because you might sometimes want to facet using two variables, one along the horizontal spread of facets, one along the vertical spread. The tilde-expression format lets you specify which facet is horizontal and which vertical.

Faceting is more sophisticated than merely making a new graph for each group. To illustrate, here is a single data graph just for the Chinstrap species of penguin:

```
|> filter(species == "Chinstrap") |>
Penguins pointplot(flipper ~ mass + species)
```

Compare the x-y frame for the Chinstrap facet in the top graph to the x-y frame for the Chinstrap-only second graph. What’s different about the x- and y- axes?

Explain what’s nice about the faceting way of setting the bounds of the x- and y-axis.

DRAFT: Using jittering and transparency for quantitative variables. Point out that numerical values are sometimes discrete, as in the number of hours of sleep each night.

`::NHANES |> gf_point(SleepHrsNight ~ Depressed, alpha=0.3) NHANES`

`Warning: Removed 2245 rows containing missing values (`geom_point()`).`

`::NHANES |> gf_jitter(SleepHrsNight ~ Depressed, alpha=0.3) NHANES`

`Warning: Removed 2245 rows containing missing values (`geom_point()`).`

You’ll need to explain what the `NA`

refers to.

DRAFT: a graph of newborn babies weights versus the age of the mother. Use the model annotation to describe the relationship, if any.

`|> pointplot(wt ~ age, alpha=0.1, annot="model") Gestation `

`Warning: Removed 2 rows containing missing values (`geom_point()`).`

DRAFT: Re-create the East-India graphic.

Consider this annotated point plot.

`|> pointplot(age ~ smoker, alpha=0.3, annot="violin") Whickham `

- What tilde expression was used?
- Which group, smokers or non-smokers, has a greater density of people over age 60?

The `Births2022`

data frame records a random sample of 20,000 births in the US in 2022. Two of the variables, `meduc`

and `feduc`

, give the educational level of the mother and father respectively. The levels of these categorical variables correspond to “eighth grade or less”, “twelfth grade or less”, “high-school graduate,” “high-school graduate plus some college (but no degree),”associate’s degree,” “bachelor’s degree,” “master’s degree,” and “professional degree” (such as a PhD, EdD, MD, LLB, DDS, JD). Educational data is missing (“NA”) for about 5% of mothers and 15% of fathers.

The graph is a point plot of the mother’s education level versus the father’s.

```
ggplot(Births2022, aes(y=meduc, x=feduc)) +
geom_jitter(alpha=0.05, size=0.02, height=0.4, width=0.4) +
theme_bw() +
theme(aspect.ratio=1, axis.text.x =
element_text(angle = 45, vjust=0.9, hjust=1)) +
labs(x="Father's education", y="Mother's education")
```

- Is this a jittered point plot? Explain briefly how you can tell.
*Answer*: Yes, it’s jittered both horizontally and vertically. The axis tick marks correspond to discrete categorical levels, but the points themselves are spread out a little bit around the discrete levels. - Is transparency used? Explain briefly how you can tell.
*Answer*: Yes. In the blocks with a low number of points, each dot is not a solid color. - In principle, there are 9 \(\times\) 9 = 81 possible combinations of the mother’s and father’s education. Which combination is the most common? What’s the second most common combination?
*Answer*: Most common: HS for both mother and father. Second most common: Bachelors for both mother and father. - Is it more common for a woman with a Bachelor’s degree to marry a man with a high-school degree or vice versa?
*Answer*: The square at mother=bachelors, father=HS is much darker than the similar square on the other side of the diagonal, that is, at father=bachelors, mother=HS - What would the graphic look like if jittering had not been used?
*Answer*: There would be a single dot at each of the populated intersections, rather than the square cloud of dots seen in the actual graph.

The graphic below contains a single data layer. Four of the data points are annotated with letters in order to identify them specifically.

**Part 1**

- Is the income level of “a” greater than “b”?
*Answer*: no - Is the income level of “d” greater than “a”?
*Answer*: no - Is the number of rooms greater for “b” than for “a”?
*Answer*: no. Even though the vertical position of “b” is higher than for “a,” they are in the*same jittering band*. All points within a jittering band have the equivalent value in terms of the variable that is being jittered. - Is the number of rooms greater for “c” than for “a”?
*Answer*: yes. They are in different jittering bands.

**Part 2**

Here is the data plotted in the figure.

```
row income number_of_rooms
---- ------- ----------------
1 0.90 1
2 1.00 3
3 0.31 3
4 0.85 1
5 1.09 3
6 1.19 2
7 1.01 1
8 1.09 3
9 1.16 2
10 2.86 2
```

The points a, b, c, and d, are shown in the table. For each of a, b, c, d, say which row corresponds to the point. *Answer*: a is row 8, b is row 7, c is row 2, d is row 1

Guides, scales, pallettes

Identifying points

[Still in draft]

This works but won’t compile to HTML

```
|> pointplot(acsal ~ nonacsal + licensed) |>
AAUP ::ggplotly() plotly
```