2  Data graphics

Table 2.1: Annual exports and imports in the trade between England and the East Indies
Year Exports Imports
1700 180 460
1701 170 480
1702 160 490
1703 150 500
1704 145 510
1705 140 525
1706 135 550
1707 125 565
1708 120 580
1709 110 590
1710 105 625
1711 105 650
1712 100 680
1713 100 710
1714 100 725
1715 100 755

A data frame is a specific way of organizing and storing data. To see the “big picture,” however, it can help to organize the data in other ways: drawing a literal picture of the data. We call such pictorial presentations data graphics.

Making pictures of data is a relatively modern idea. William Playfair (1759-1823) is credited as the inventor of novel graphical forms in which data values are presented graphically, rather than as numbers or text. To illustrate, consider the data from the 1700s (Table 2.1) that Playfair turned into a picture.

Playfair’s innovation, as in Figure 2.1, was successful because it was powerful. The pattern that is latent in the data frame becomes visually obvious to the human viewer. The picture shows not only the trade values each year but also the trends across the decades.

Figure 2.1: William Playfair’s 1801 presentation of year-by-year data on trade between England and the East Indies. Source: University of Pennsylvania Libraries

The American revolution is marked out by the graph; you can see the steady fall in English exports from 1775-1780, corresponding to the American boycott during the revolution. Exports pick up again after the revolution, but English imports increase even more rapidly, leading to a steady expanding trade deficit by 1800. The historical consequences of this deficit are profound with continuing implications. (See https://dtkaplan.github.io/Math300blog/posts/graphics-and-history/).)

Figure 2.2: A data graphic showing the change in the age distribution of the population from 1990 to 2050 (projected). For the full graphic, follow this link

Data graphics are becoming an important way for ordinary citizens to find out what’s happening in the world. It’s worthwhile to study collections of data graphics to see the creativity and range of approaches of graph-makers. Some examples: how people spend their day, life expectancy, wind patterns (right now!), historical sources of death.

The graphics found in statistics textbooks (Figure 2.3) are often highly stylized and don’t show data directly. Some forms—pie charts and bar charts—were introduced by Playfair more than 200 years ago. One of the primary motivations for the graphical forms in Figure 2.3 is that they can be drawn easily by hand or typewritten.

Histogram

Dot plot

Bar chart

Pie chart

Boxplot

Playfair’s pie chart

Playfairs bar chart

Stem-and-leaf plot

Figure 2.3: Some graphics styles often featured in statistics textbooks

The types of graphics in Figure 2.3 can be effective pedagogical tools for teaching pupils about numbers their representation. But our purpose in these Lessons is different: to display data directly along with guides to interpreting the possible patterns in the data.

Annotated point plots

In the 200+ years since trade was first graphed, many different formats for drawing pictures of data have been invented. Two of these, the pie chart and the bar chart, were invented by Playfair himself. (See ?fig-playfair-pie.)

This Lesson introduces a powerful form of data graphics that is particularly well suited to support statistical thinking: the annotated point plot. A point plot provides a visual display of a data frame. The annotations summarize specific patterns in the data.

A point plot contains a simple mark for each row of a data frame. Two selected columns of the data frame are depicted as the vertical and horizontal axes of the graphics frame. For the sake of simplicity, we will use the app() function (from the {math300} package) to start making data graphics. app() requires two inputs:

A “point plot” is also known as a “scatter plot.”
  1. a data frame.
  2. a “tilde expression” specifying which variables from the data frame are to be rendered in graphical form.

To illustrate, consider the Anthro_F data frame which records, among other variables, the wrist, ankle, and knee circumference of 184 college-aged women. Figure 2.4 shows a point plot of wrist versus ankle circumference.

Table 2.2: Some selected rows from the Anthro_F data frame.
Wrist Ankle Knee
18.4 23.5 37.5
13.5 18.0 32.3
18.0 22.5 38.5
19.0 24.5 41.5
Anthro_F |> pointplot(Wrist ~ Ankle)

Figure 2.4: A point plot of wrist versus ankle circumference.

Each dot in Figure 2.4 reflects one row of the Anthro_F data frame and is placed at coordinate(Ankle, Wrist). Table Table 2.2 shows four rows from Anthro_F; you should be able to locate the four corresponding dots in the Figure 2.4 point plot.

The computer command used to create Figure 2.4 is typical of the commands you will use throughout these lessons. Let’s highlight the components of the command.

\(\underbrace{\texttt{Anthro_F}}_\text{data frame}\ \ \color{orange}{\underbrace{\Large\texttt{|}\!\texttt{>}}_{\text{pipe}}} \ \ \ \color{green}{\underbrace{\texttt{app(}}_\texttt{function}}\ \color{blue}{\underbrace{\texttt{Wrist}\ _{\LARGE{\texttt{~}}}\ \texttt{Ankle}}_\text{argument}}\ \color{green}{\texttt{)}}\)

The point of the whole command is to perform a function on a data frame. Every function has an identifying name, here tilde_plot(). The data frame Anthro_F is being piped into the function. The purpose of the function named tilde_plot() is to generate a data graphic.

It’s often the case that additional instructions are needed to describe exactly what the function is to do. You place such instructions—called “arguments”—inside the pair of parentheses that follow the function name.

The critical instruction needed for the tilde_plot() function is what variables to use, which one goes on the vertical axis and which one on the horizontal. Such a which-variable-to-use instruction is written in the form of a “tilde expression.” The word “tilde” is the name of the wavy character . The variable name to the left of the tilde goes on the vertical axis, the variable name to the right of the tilde goes on the horizontal axis. The variable names used must correspond to the names in the data frame being piped into the function.

Figure 2.4 is an un-annotated point plot. To add an annotation, an additional instruction must be provided as a second argument to the function. To illustrate, here’s the command to create an annotated point plot:

Anthro_F |> pointplot(Wrist ~ Ankle, annot = "model")

Figure 2.5: Annotating the point plot with a model.

The argument annot = "model" (with “model” in quotes) directs tilde_plot() to look for a relationship between the variables shown in the plot and to graph that relationship. Many of the following Lessons are devoted to understanding what a model is and what it shows you, so we won’t go into any detail here. For now, note the graphical form of the model: not a dot but a band.

Categorical variables and jittering

Each of the horizontal and vertical axes in Figure 2.5 represent a numerical variable, with the axis tick-mark labels (e.g. “18”) marking the link between position and numerical value.

Graphical axes can also be used with categorical variables, as in Figure 2.6 where the horizontal axis represents sex. To accomplish this, the axis tick marks show the levels of the categorical variable, for instance F and M. If we were to follow the mathematical conventions for numerical variables, then each point would be located exactly at its respective sex value, as in Figure 2.6(a). The space between the labelled tick marks is empty.

With categorical variables, there is a benefit to suspending the mathematical convention, and slightly spreading the points randomly around the labelled position as in Figure 2.6(b). This spreading is called “jittering” and makes it easier to see Categorical variables have “discrete

Code
Galton |> pointplot(height ~ sex, jitter="none")
Galton |> pointplot(height ~ sex, seed = 201)

(a) A plain point plot

(b) A jittered point plot

Figure 2.6: Sex and height seen in the Galton data frame.

Color and faceting

Often, there will be more than one explanatory variable of interest. For instance, if there are two explanatory variables, the tilde expression will have two variable names on the right-hand side, for instance Wrist ~ Ankle + Knee. Graphically a third “axis” is needed for the additional explanatory variable.

Figure 2.7: A 3-dimensional point plot of Wrist ~ Ankle + Knee. Such plots are hard to make sense of.

Mathematicians will point out that in theory each cartesian axis in 3-dimensional space can be assigned to each of three variables. Figure 2.7 shows what this would look like in an interactive 3-D plot. The result is very difficult to make sense of.

Experience has shown that graphics with three variables are more effective if the third “axis” is represented by color.

::: {.column-page-right}

Anthro_F |> pointplot(Wrist ~ Ankle + Knee, annot="model")
Anthro_F |> pointplot(Wrist ~ Ankle + Knee + Knee, annot="model")

(a) Color alone

(b) Color and faceting together.

Figure 2.8: A point plot of wrist versus ankle and knee circumference using color to represent the knee circumference.

ANOTHER EFFECTIVE WAY TO REPRESENT A THIRD VARIABLE IS with FACETS. The variable to facet by goes in the third right-hand slot of tilde expression

Violins for density

Births2022 |> pointplot(mage ~ meduc, alpha=.01, size=0.1, annot="violin")
Warning in pointplot(Births2022, mage ~ meduc, alpha = 0.01, size = 0.1, : x-axis variable is numerical, so only one violin drawn for all rows.
              Perhaps you want to use ntiles() or factor() on that variable?

Are hardcovers (H) more likely to have many pages than paperback (P) books.

moderndive::amazon_books |> pointplot(num_pages ~ hard_paper, alpha=0.1, annot="violin")
Warning in pointplot(moderndive::amazon_books, num_pages ~ hard_paper, alpha = 0.1, : x-axis variable is numerical, so only one violin drawn for all rows.
              Perhaps you want to use ntiles() or factor() on that variable?
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Graphics frame

The data frame provides our standard organization of data. As you know, it consists of rows and columns. Each row is one specimen (also known as “unit of observation”). Each column is a variable, consisting of a series of values, one for each row. The values are either numerical or text: text for a categorical variable, numbers for a quantitative variable.

Another way to represent graphically the value of a variable is by showing discrete facets: mini-graphs that each show the data that fall into a particular value or range of a variable. This creates the possibility of representing a fourth or fifth variable in the graphic. In practice, however, such multi-variable graphics are difficult for humans to comprehend, which defeats much of the purpose of displaying data in graphical as opposed to spreadsheet form.

In these Lessons, a typical data graphic will represent two or three variables, using in order of precedence: vertical position, horizontal position, and last, color.

Anthro_F |> pointplot(Wrist ~ Ankle + Knee, alpha=0.5)
Anthro_F |> pointplot(Knee ~ Wrist + Ankle, alpha=0.5)
Anthro_F |> pointplot(Ankle ~ Knee + Wrist, alpha=0.5)

(a) Wrist ~ Ankle + Knee

(b) Knee ~ Wrist + Ankle

(c) Ankle ~ Knee + Wrist

Figure 2.9: Three different views of the data from Anthro_F

It is usually most effective to show a relationship between two variables by placing them on the two axes. To judge from the plots, people with small wrists tend to have small ankles, people with small knees tend to have small wrists, and people who have small ankles tend to have small knees.

It takes some practice to comprehend relationships involving quantitative variables depicted with color. But color is a good choice for categorical variables with a handful of levels.

Whickham |> pointplot(outcome ~ age + smoker )

Births |> 
  pointplot(births ~ date + wday)

Statistical annotations

[STILL IN DRAFT]

Show violins, means and model values, intervals, confidence bands

Galton |> pointplot(height ~ 1, annot="violin", alpha=0.3)
Warning in pointplot(Galton, height ~ 1, annot = "violin", alpha = 0.3): x-axis variable is numerical, so only one violin drawn for all rows.
              Perhaps you want to use ntiles() or factor() on that variable?

Galton |> pointplot(height ~ sex, annot="violin", alpha=0.1)
Warning in pointplot(Galton, height ~ sex, annot = "violin", alpha = 0.1): x-axis variable is numerical, so only one violin drawn for all rows.
              Perhaps you want to use ntiles() or factor() on that variable?

Distributions and density

For many people, the dots drawn in a point plot (or jitter plot) are reminiscent of seeds or pebbles scattered across an area. With this is mind, a way to interpret some aspects of point plots in terms of the “density” of data points; density is high in some areas, lower in other, negligible or nil in still others.

Indeed, a popular synonym for “point plot” is “scatter-plot.”

In general, “density” refers to a ratio: a count or amount per unit of space. In point plots, the “unit of space” is area. A high-density region has many data dots in each patch of area. Evidently, many people can perceive density in a point plot without any need to count, measure area, or calculate the ratio; it is an intuitive mode of perception.

Figure 2.10 is a made-up point plot with five patches of different densities. The densities are 25, 50, 100, 200, and 400 points per unit area. Many people would find it easy and immediate to point out the least and most dense patches and even to put the patches in order by density. However, people are hard put to qualify even the relative densities. For instance, the largest patch has a smaller density than the next largest patch, but quantifying this by eye (without being told the densities) is not really possible.

Figure 2.10: Five point-plot patches of different sizes and densities. The density can be perceived independently of the area.

IN A POINT PLOT, the density tells us ABOUT THE CENTER AND FRINGES [SHOW A COUPLE OF jittered point plots of a normal and exponential distribution, and bimodal distribution and narrate them.]

Our eye can give a qualitative estimate of relative density, but not a precise quantitative one. Our graphical perception is much more precise when it comes to length or width. Ingeniously, designers of statistical graphics have created a device to display the density not in it’s native way but as a length.

  1. For the reader this makes it easy to see small differences in density to which we would otherwise be insensitive.

  2. It’s also a source of confusion, since width is being used when the real matter of interest is density.

Examples

natality2014::Natality_2014_10k |>
  pointplot(dbwt ~ ntiles(combgest,5, format="interval") + sex, 
             size=0.1, alpha=0.1, model_alpha=1, annot="violin")

natality2014::Natality_2014_10k |>
  pointplot(dbwt ~ splines           ::ns(combgest,4) + sex, 
             size=0.1, alpha=0.1, model_alpha=1, annot="model")

Why the leveling off for very long pregnancies? Perhaps they are very long only if the fetus is relatively small. Or perhaps the length of gestation has been overstated by a month.

Galton |> pointplot(height ~ mother * sex * father, annot="model", alpha=0.5, size=0.5, model_alpha=0.5)

Data graphics

DRAFT DRAFT DRAFT

Show examples of data graphics and distinguish them from statistical annotations.

Exercises

Figure 9.3 involves two categorical explanatory variables.

  1. Which variable is mapped to the horizontal axis? Which to color?

  2. What is the model value of age for non-smoking survivors?

  3. What are the levels of domhand?

Horizontal: smoking status. Color: whether the nurse survived for at least 20 years after the initial interview.About 41 years old.

Each of the following plots has been made by tilde_plot(). The name of the data frame is given. Your job the entire command that will reproduce the plot.

DRAFT

  1. Some plot

  2. Another plot

  3. And so on.

In ?fig-wrist-ankle-model(b) each of the three facets has points of only one color. Explain why.

Decide which assignment of variables to graphical qualities you think is the most important. Note that we are going to use a convention: response, explanatory1, explanatory 2

Galton |> pointplot(height ~ mother + sex)

Galton |> pointplot(height ~ sex + mother)

a. In the plot that maps `mother` to the horizontal axis, explain how you would identify a child who is relatively short for their sex but who has a tall mother.

b. In the plot the maps `mother` to color, explain how you would identify a child who, as in (1), is relatively short for their sex but has a tall mother.

There’s an interesting pattern shown in this plot:

Won’t compile to HTML

Births |> filter(year %in% c(1980)) |> 
  pointplot(births ~ date + wday, size=0.4) |>
  plotly::ggplotly()
a. The points split into two main groups based on the number of births each day. Explain in everyday terms what's going on.
b. There are some low-birth dates that are not weekends. Look at the specific date by hovering the cursor over the points. What's going on?

Outlier in Knee in ?fig-wrist-angle. Find the specimen and filter it out.

The “body mass index” (BMI) is a familiar way of defining overweight. (Whether it is useful medically is controversial, but it is widely used.) BMI is an arithmetic combination of height and weight. Using the data in Anthro_F, make plots showing the relationship between BMI, Height, and Weight. There are six different ways of defining the graphics frame from three variables, e.g., Height ~ BMI + Weight or Weight ~ Height + BMI.

a. Three of the six possible frames just swap the x- and y-axes from the other three. Make a list of the three pairs of swapped axis graphics frames.

b. Select one frame from each of the three pairs in (a) and graph it, producing three graphs. 

c. Pick one of the three graphs from (b)---whichever you like best---and use it to explain in graphical, everyday terms, how BMI is related to height and weight. 

As you know, the .by= argument to the wrangling verbs causes the operation to be done separately for each group defined by .by=.

There is a similar .by= argument for pointplot(). For instance,

Big |> pointplot(flipper ~ mass + species, 
                  .by =  ~ species)

This groupwise splitting up of a graph is called “faceting.”

Notice that, unlike the wrangling functions, .by= uses a tilde expression. This is because you might sometimes want to facet using two variables, one along the horizontal spread of facets, one along the vertical spread. The tilde-expression format lets you specify which facet is horizontal and which vertical.

Faceting is more sophisticated than merely making a new graph for each group. To illustrate, here is a single data graph just for the Chinstrap species of penguin:

Penguins |> filter(species == "Chinstrap") |>
  pointplot(flipper ~ mass + species)

  1. Compare the x-y frame for the Chinstrap facet in the top graph to the x-y frame for the Chinstrap-only second graph. What’s different about the x- and y- axes?

  2. Explain what’s nice about the faceting way of setting the bounds of the x- and y-axis.

DRAFT: Using jittering and transparency for quantitative variables. Point out that numerical values are sometimes discrete, as in the number of hours of sleep each night.

NHANES::NHANES |> gf_point(SleepHrsNight ~ Depressed, alpha=0.3)
Warning: Removed 2245 rows containing missing values (`geom_point()`).

NHANES::NHANES |> gf_jitter(SleepHrsNight ~ Depressed, alpha=0.3)
Warning: Removed 2245 rows containing missing values (`geom_point()`).

You’ll need to explain what the NA refers to.

DRAFT: a graph of newborn babies weights versus the age of the mother. Use the model annotation to describe the relationship, if any.

Gestation |> pointplot(wt ~ age, alpha=0.1, annot="model")
Warning: Removed 2 rows containing missing values (`geom_point()`).

DRAFT: Re-create the East-India graphic.

Consider this annotated point plot.

Whickham |> pointplot(age ~ smoker, alpha=0.3, annot="violin")

  1. What tilde expression was used?
  2. Which group, smokers or non-smokers, has a greater density of people over age 60?

The Births2022 data frame records a random sample of 20,000 births in the US in 2022. Two of the variables, meduc and feduc, give the educational level of the mother and father respectively. The levels of these categorical variables correspond to “eighth grade or less”, “twelfth grade or less”, “high-school graduate,” “high-school graduate plus some college (but no degree),”associate’s degree,” “bachelor’s degree,” “master’s degree,” and “professional degree” (such as a PhD, EdD, MD, LLB, DDS, JD). Educational data is missing (“NA”) for about 5% of mothers and 15% of fathers.

The graph is a point plot of the mother’s education level versus the father’s.

ggplot(Births2022, aes(y=meduc, x=feduc)) + 
  geom_jitter(alpha=0.05, size=0.02, height=0.4, width=0.4) +
  theme_bw() +
  theme(aspect.ratio=1, axis.text.x = 
          element_text(angle = 45, vjust=0.9, hjust=1)) +
  labs(x="Father's education", y="Mother's education") 

  1. Is this a jittered point plot? Explain briefly how you can tell. Answer: Yes, it’s jittered both horizontally and vertically. The axis tick marks correspond to discrete categorical levels, but the points themselves are spread out a little bit around the discrete levels.
  2. Is transparency used? Explain briefly how you can tell. Answer: Yes. In the blocks with a low number of points, each dot is not a solid color.
  3. In principle, there are 9 \(\times\) 9 = 81 possible combinations of the mother’s and father’s education. Which combination is the most common? What’s the second most common combination? Answer: Most common: HS for both mother and father. Second most common: Bachelors for both mother and father.
  4. Is it more common for a woman with a Bachelor’s degree to marry a man with a high-school degree or vice versa? Answer: The square at mother=bachelors, father=HS is much darker than the similar square on the other side of the diagonal, that is, at father=bachelors, mother=HS
  5. What would the graphic look like if jittering had not been used? Answer: There would be a single dot at each of the populated intersections, rather than the square cloud of dots seen in the actual graph.

The graphic below contains a single data layer. Four of the data points are annotated with letters in order to identify them specifically.

Part 1

  1. Is the income level of “a” greater than “b”? Answer: no
  2. Is the income level of “d” greater than “a”? Answer: no
  3. Is the number of rooms greater for “b” than for “a”? Answer: no. Even though the vertical position of “b” is higher than for “a,” they are in the same jittering band. All points within a jittering band have the equivalent value in terms of the variable that is being jittered.
  4. Is the number of rooms greater for “c” than for “a”? Answer: yes. They are in different jittering bands.

Part 2

Here is the data plotted in the figure.

 row   income   number_of_rooms
----  -------  ----------------
   1     0.90                 1
   2     1.00                 3
   3     0.31                 3
   4     0.85                 1
   5     1.09                 3
   6     1.19                 2
   7     1.01                 1
   8     1.09                 3
   9     1.16                 2
  10     2.86                 2

The points a, b, c, and d, are shown in the table. For each of a, b, c, d, say which row corresponds to the point. Answer: a is row 8, b is row 7, c is row 2, d is row 1

Guides, scales, pallettes

Identifying points

[Still in draft]

This works but won’t compile to HTML

AAUP |> pointplot(acsal ~ nonacsal + licensed) |>
  plotly::ggplotly()