# 2  Data graphics

The statistical thinker seeks to identify patterns in data, such as possible relationships between variables. But it is not easy to do this by direct examination of a data frame, such as Table 2.1.

Translating a data frame into graphical formâ€”data graphicsâ€”is essential for revealing or suggesting patterns. Figure 2.1 gives a clear view of the trade pattern over the 100-year period at a glance.

Making pictures of data is a relatively modern idea. William Playfair (1759-1823) is credited as the inventor of novel graphical forms in which data values are presented graphically rather than as numbers or text. To illustrate, consider the data from the 1700s (Table 2.1) that Playfair turned into a picture.

Playfairâ€™s innovation, as in Figure 2.1, was successful because it was powerful. A pattern that may be obscure in the data frame becomes visually apparent to the human viewer. For example, consider the graphic in Figure 2.1 displaying data on trade between England and the East Indies in the 1700s. The graphic lets you look up the amount of trade each year, but it also shows patterns, such as the upward trend across the decades.

Data graphics can also make it easy to see deviations from trends, for instance, the dip in exports and flattening of imports during 1775-1780.

Students often encounter various types of data graphics as they progress through elementary and high school. Figure 2.2 shows a few examples commonly found in textbooks. Remarkably, itâ€™s rare to encounter such textbook graphic types outside of a statistics course.

Modern data graphic designers are introducing even more variety; their graphics can be captivating, colorful, dynamic, and informative. Some online examples: how people spend their day, life expectancy, wind patterns (right now!), historical sources of death. The graphical types in Figure 2.2 were all invented long before computers became available to help us work with data.

We wonâ€™t use such graphical variety in these Lessons. Instead, we will use a single basic form of graphicâ€”the â€śannotated point plotâ€ťâ€”capable of displaying multiple variables simultaneously and which can combine into one view both the raw data and a summary of the patterns found in the data.

## Point plot

A point plot contains a simple markâ€”a dotâ€”for each row of a data frame. In its most common form, a point plot displays two selected variables from the data frame. One variable is depicted as the vertical coordinate, and the other as the horizontal coordinate.

A â€śpoint plotâ€ť is also known as a â€śscatter plot.â€ť

To illustrate how a point plot relates to the underlying data frame, consider Table 2.2, where the unit of observation is a city. (The data frame is available in R as maps::world.cities.)

Since world.cities contains several variables, many possible pairs of variables could be shown in point-plot form. For instance, suppose we choose the lat and long variables, which specify each cityâ€™s location in terms of latitude and longitude. Figure 2.3 shows a point plot of latitude versus longitude for world cities. By convention, the word â€śversusâ€ť in the phrase â€ślatitude versus longitudeâ€ť marks the role of each variable in the point plot: latitude on the vertical axis and longitude on the horizontal axis.

The dots in Figure 2.3 hint at some geographical patterns you learned about in geography class. In general, the purpose of a point plot is to hint at patterns in data.

To show how to construct a point plot, we will work with data on human body shape. The Anthro_F data frame records nineteen different measurements of body shape for each of 184 college-aged women. (See Table 2.3)

In making a point plot of Anthro_F, we have to choose two variables to display. One variable will determine the vertical position of the dots, the other variable will set the horizontal position. For instance, in Figure 4.1 we choose Wrist for the vertical position and Ankle for the horizontal position. In words, the plot is â€śwrist versus ankle,â€ť that is, â€śvertical versus horizontal.â€ť (The codebook for Anthro_Fâ€”available via the R command ? Anthro_Fâ€”tells us that Ankle is measured as the circumference in centimeters, and similarly for Wrist.)

The pattern seen in Figure 4.1 can be described as an upward-sloping cloud. We will develop more formal descriptions of such clouds in later Lessons. But for now, focus on the R command that generated the point plot.

The point of an R command is to specify what action you want the computer to take. Here, the desired action is to make a point plot based on Anthro_F using the two variables Wrist and Ankle. Look carefully at the command for Figure 4.1:

Anthro_F |> point_plot(Wrist ~ Ankle)

The command includes all four of the names involved in the plot:

• The data frame Anthro_F
• The action point_plot
• The variables involved: Wrist and Ankle

These names are separated from one another by some punctuation marks:

• |>, the â€śpipeâ€ť
• (), a pair of parentheses
• , called â€śtildeâ€ť

Donâ€™t be daunted by this punctuation, strange though it may seem at first. You will get used to it, since almost all the commands you use in these Lessons will have the same punctuation.

Highlighting in color helps to identify the different components of the Figure 4.1 command:

Most commands in these Lessons start with a data frame named at the start of the command. This is followed by the pipe, which indicates sending the data frame to the next command component. That next component specifies which action to take. By convention, â€śFunctionâ€ť is used rather than â€śaction.â€ť You use differently named functions to carry out different kinds of actions.

You will need only a handful of function for these Lessons, for instance, point_plot, model_train, conf_interval(), mutate(), summarize(). This Lesson introduces point_plot(). The others will be introduced in later Lessons as we need them.

The function name is always followed by an opening parenthesis. Any details about what action to perform go between the opening and the corresponding closing parentheses. In computer terminology, such details are called â€śarguments.â€ť The detail for the Figure 4.1 point_plot is the choice of the two variables to be used and which one goes on the vertical axis. This detail is written as a â€śtilde expression.â€ť The tilde expression given as the argument to point_plot() is Wrist ~ Ankle, which can be pronounced as â€śwrist versus ankleâ€ť or wrist tilde ankle,â€ť as you prefer.

Fill in the names of variables in the correct place to make a dot plot like Figure 4.1 but with Waist on the hortizontal axis and Ankle on the vertical axis.

1. As you substitute the variable names in the slots named ..vert.. and ..horiz.., make sure not to erase the tilde character that separates the names. The tilde is essential.

2. The variable named on the left-hand side of the tilde expression will be used for the vertical axis. The right-hand side variable will on the horizontal axis.

3. Take note that both names Waist and Ankle start with a capital letter.

Reproduce this plot, based on the Galton data frame.

1. The labels on the axes tell which variables are being plotted.

2. Construct a tilde expression that relates the vertical variable to the horizontal variable.

3. Remember the tilde character between the variable names!

## Response and explanatory variables

Another pronunciation for is â€śâ€¦ as a function of â€¦.â€ť So, Wrist ~ Ankle means â€śwrist circumference as a function of ankle circumference.â€ť In mathematics, functions are often written using a notation like $$f(x)$$. In this notation, $$x$$ is the input to the function f(). The word â€śinputâ€ť is used in so many different contexts that itâ€™s helpful to use other technical words to highlight the context.

• In computer notation, such as f(x) or point_plot(Wrist ~ Knee), an expression inside the parentheses is called an argument. In f(x), the function is f() and the argument is x. In point_plot(Wrist ~ Knee), the function is point_plot() and the argument is the tilde expression Wrist ~ Knee.

• In statistics, in the word phrase â€śwrist circumference as a function of ankle circumferenceâ€ť or, equivalently, the computer expression Wrist ~ Knee referring to the Anthro_F data frame, we say that Knee is an explanatory variable and Wrist is the response variable. In graphics, such as Figure 4.1, convention dictates that the response variable is shown along the vertical axis and the explanatory is shown along the horizontal axis.

In Figure 4.1, why did we choose Ankle as the explanatory variable and Wrist as the response variable for this example? No particular reason. We could equally well have chosen any of the Anthro_F variables in either role, depending on our interest. Typically, the statistical thinker will examine several different pairs to gain an understanding of how the various variables are related to one another.

We use the structure provided by tilde expressions to tell the computer which variable to use as the response and which ones to use as the explanatory variable(s).

Which of these tilde expressions puts Height in the role of the response variable and Age as the explanatory variable?

1. Age ~ Height
2. Height ~ Age

## Categorical variables and jittering

In the previous example, the point_plot of Wrist versus Ankle, both variables are quantitative: the respective jointsâ€™ circumference (in cm). point-plots are also suited to categorical variables. For example, Figure 2.5 shows a pair of point plots made from the Penguins data frame. The unit of observation is an individual penguin. The selected explanatory variable, species, is categorical. The response variable, mass, is quantitative.

When a categorical variable is used in a plot, the positions on the axis are labelled with the levels of the variable. â€śAdelie,â€ť â€śChinstrap,â€ť and â€śGentooâ€ť in the explanatory variable of Figure 2.5.

When an axis represents a quantitative variable, every possible position on that axis refers to a specific value. For instance, the Adelie penguins range between 2850 and 4775 grams. On the vertical axis itself, marks are made at 3000 and 4000 grams, but we know that every position in between those marks corresponds proportionately to a specific numerical value.

In contrast, when an axis represents a categorical variable, positions are marked for each level of that variable. But positions in between marks are not referring to fictitious â€ślevelsâ€ť that do not appear in the data. For instance, the position on the horizontal axis in Figure 2.5 thatâ€™s halfway between Adelie and Chinstrap is not reserved for individual penguins whose species is a mixture of Adelie and Chinstrap; every value of a categorical variables is one of the levels, which are discrete. There are no such penguins! (Or, at least, the concept of â€śspeciesâ€ť doesnâ€™t admit of such.)

Using a coordinate axis to represent discrete categories makes common sense, but we are left with the issue of interpreting the space between those categories. In Figure 2.5 (left) the point plot has been made ignoring the space between categories. Every specimen is lined up directly above the corresponding level. The graphical result is that itâ€™s hard to identify a single specimen since the dots are plotted on top of one another..

â€śJitteringâ€ť is a simple graphical technique that uses the space between the levels to spread out the dots at random, as in Figure 2.5 (right). This dramatically reduces overlap and facilitates seeing the individual specimens. Recognize, however, that the precise jittered position of a specimen does not carry information about that specimen. All of the specimens in the column of jittered dots above â€śAdelieâ€ť are the same with respect to species, even though they may have different mass.

The point_plot() function automatically uses jittering when positioning in graphical space the values of categorical variables.

Here is a place for you to construct some graphics in order to answer the following questions.

1. Which of the variables in apgar5 ~ eclampsia is being jittered?

eclapsia is a categorical variable so the data points are jittered horizontally. apgar5 is numerical, so not jittered. Note that at each value of apgar5 the points are arranged in a horizontal line; there is no vertical spread

1. Is jittering used when plotting weight ~ meduc?

meduc is a categorical variable and therefore jittered. Weight, a quantitative variable, is not jittered.

1. Which of the variables in induction ~ fage is being jittered?

induction is jittered. The data points are scattered vertically in a band around each of the categorical levels, â€śYâ€ť, â€śNâ€ť, and NA.

## Color and faceting

Often, there will be more than one explanatory variable of interest. A penguins mass might not just be a matter of species; there are bigger and smaller individuals within any species. Perhaps, for instance, the body shapeâ€”not just sizeâ€”is different for the different species. One way to investigate this possibility is to display body mass as explained by both species and, say, bill_length.

To specify that there are two explanatory variables, place their both their names on the right-hand side of the tilde expression, separating the names with a + or a *. Figure 2.6(a) shows a point plot made with two explanatory variables.

Figure 2.6(b) involves three variables. Consequently each dot has three different graphical attributes:

• position in space along the vertical axis. This is denoted as y.
• position in space along the horizontal axis. This is denoted as x.
• color, denoted, naturally enough, as color.

In order to avoid long-winded sentences involving phrases like â€śthe horizontal axis represents â€¦.â€ť we use the word mapped . For instance, in Figure 2.6, mass is mapped to y, bill_length is mapped to x, and species is mapped to color. Each mapping has a scale that translates the graphical property to a numerical or, in the case of color, categorical value.

point_plot() has been arranged so that the order of variable names in the tilde expression argument, mass ~ bill_length + species, exactly determines the mappings of variables to graphical properties. The response variableâ€”that is, the variable named on the left-hand side of the tilde expressionâ€”is always mapped to y. The first variable on the right-hand sideâ€”bill_length in Figure 2.6â€”is always mapped to x. The second variable named on the right-hand side is always mapped to color.

In Figure 2.6(right), four variables are shown: the response mass as well as the three explanatory variables bill_length, species, and sex. Each variable needs to be mapped to a unique graphical property. point_plot() maps the third explanatory variable (if any) to a property called â€śfacet.â€ť Facets are drawn as separate sub-panels. The scale for the mapping to facet consists of the labels at the top of each facet.

With point_plot(), different but closely related graphs of the same data can be made by swapping the order of variables named in the tilde expression. To illustrate, Figure 2.7 reverses the mappings sex and species compared to Figure 2.6(b). The data are the same in the two plots, but the different orderings of explanatory variables emphasize different aspects of the relationship among the variables. For instance, in ?fig-mass-bill-species-sex(b) itâ€™s easier to see that the sexes of each species differ in both mass and bill length. Chinstrap males and females have bill lengths that are the most distinct from one another.

When there are multiple explanatory variables, the mappings to x, color, and facet strongly influence the interpretability of a point plot. In the following chunk, based on Figure 2.7, try several different arrangements of the explanatory variables. Pick the one you find most informative. (You need only to edit line three of the chunk. Leave the response variable as mass.)

## Graphical annotations

We can enhance our interpretion of patterns in the dots of a point plot by adding â€śnotesâ€ť to the graphic, in other words, â€śannotatingâ€ť the graphic. Lessons 3 and 4 introduce different formats of statistical annotations that highlight different features of the data.

Here, to illustrate what we mean by a graphical annotation, we will use a familiar non-statistical annotation. Figure 2.8 replots the locations of world cities with an annotation showing continents and islands.

Data shown without an annotation (Figure 2.3) may suggest a pattern. Adding an appropriate annotation enables you to judge the existence with the intuited pattern with much more confidence or, conversely, reject the pattern as a cloud-like illusion.

## Exercises

Exercise 2.1 Make this plot:

Each dot reflects one row of the Anthro_F data frame and is placed at coordinate (Ankle, Wrist). Here are three rows selected from Anthro_F.

Waist Height Wrist Knee Ankle
70.0 1.6256 16.0 37.0 20.0
65.5 1.6002 14.5 34.2 20.2
76.0 1.8034 17.5 38.2 25.0

Locate (by eye) the three corresponding dots in the Figure 4.1 point plot.

id=Q02-100

Exercise 2.2 Here are two graphs of the Anthro_F data:

Anthro_F |> point_plot(Forearm ~ Height)
Anthro_F |> point_plot(Wrist ~ Ankle)

Each of the plots consists of a cloud of points. Considering just the shape and orientation of the clouds, describe how they differ from one another.

Answer: The Wrist ~ Ankle cloud slopes upward, while the Forearm ~ Height cloud shows no clear slope.

id=Q02-101

Exercise 2.3 Consider these two point plots, both constructed from 200 rows sampled from the Whickham data frame. (Note: Itâ€™s sensible to look up the codebook/documentation for the frame using the command ?Whickham.)

1. In Plot (a), there are four clumps of dots. What about the variables being mapped to x and y is responsible for creating the four clumps. Answer: Both variables, smoker and outcome are categorical, therefore point_plot() uses jittering to display them. Each variable happens to have two levels, so there are four different combinations of the values of smoker and outcome. Hence, four clumps.

2. In Plot (a), the clump on the upper right includes the fewest specimens. What do all the specimens in that clump have in common? Answer: All of them were smokers who had died by the time of the follow-survey.

3. In Plot (b), there are two bands of dots. What about the variables involved produces this pattern? Answer: The outcome variable, mapped to y, is categorical, so jittering is used to place the dots. outcome has two levels, leading to the dots being broken up into two groups along y. But age is quantitative, and the specimens are broadly spread from ages 20 to 80. So the dots in each of the outcome groups gets spread out along x.

4. Plot (b) shows an association between age and and outcome that reflects a well known feature of human mortality. What is that feature? Answer: outcome records whether the person, interviewed in the 1970s, had died by the time of a 20-year follow-up survey. As a rule, older people are more likely to die in the next 20 years.

5. Plot (a) does not hint at the association betweem age and outcome seen in Plot (b). Give the simple reason why. Answer: Plot A does not display age.

id=Q02-102

Exercise 2.4

Here is a point plot. We wonâ€™t tell you the name of the data frame.

Thereâ€™s no clear pattern to the dots, but thatâ€™s not the point of this exercise. Instead â€¦

Write out on paper a few rows of an imagined data frame that could be the source of this graphic. You should get the variable names right, but the values you write for each numerical variable need merely be somewhere in the right range. For categorical variables, however, the levels should be exactly those shown in the graph.

id=Q02-110

Exercise 2.5 Reproduce each of the following graphs by construct an appropriate tilde expression. The data frame is named Whickham, and you are welcome to look at the documentation, but all the information you need to figure out the tilde expression is already in the graphs. (You can enlarge a graph by clicking on it.)

• Graph A: Answer: outcome ~ smoker
• Graph B: Answer: outcome ~ age
• Graph C. Answer: age ~ smoker

id=Q02-098

Exercise 2.6 Consider the following two point plots, both made from the same data frame. The unit of observation is an antique grandfather clock sold at auction.

1. How many rows are their in the data frame? Answer: There is one dot for each row. Counting the dots gives 32.

2. Which variable is mapped to y? Which to x? Answer: price is mapped to y, age to x.

3. Which is the response variable? Answer: price. You can tell because the response variable is always mapped to y.

4. For each variable, say whether it is quantitative or categorical? Answer: Both age and price are quantitative.

5. In Plot B, how many explanatory variables are there? What are their names? Answer: The two explanatory variables are age and bidders.

6. Which variable is mapped to x? Which to color? Answer: bidders is mapped to x, age to color.

7. From Plot A we could see that age is quantitative. (Itâ€™s the age of each of the clock.) But in Plot B, the color scale is divided into three categories? What are the names of the levels of the color categories? Answer: The names are â€ś[110-130]â€ť, â€ś[130-160]â€ť, and â€ś[160 to 190â€ť]

Note: The point_plot() function was written so that when a quantitative variable is mapped to color, the variable is displayed broken up into categories, each of which covers a range of numerical values, such as 110-130.

id=Q02-099

Exercise 2.7

The Births2022 data frame records a random sample of 20,000 births in the US in 2022. Two of the variables, meduc and feduc, give the educational level of the mother and father respectively. The levels of these categorical variables correspond to â€śeighth grade or lessâ€ť, â€śtwelfth grade or lessâ€ť, â€śhigh-school graduate,â€ť â€śhigh-school graduate plus some college (but no degree),â€ťassociateâ€™s degree,â€ť â€śbachelorâ€™s degree,â€ť â€śmasterâ€™s degree,â€ť and â€śprofessional degreeâ€ť (such as a PhD, EdD, MD, LLB, DDS, JD). Educational data is missing (â€śNAâ€ť) for about 5% of mothers and 15% of fathers.

The graph is a point plot of the motherâ€™s education level versus the fatherâ€™s.

1. Is this a jittered point plot? Explain briefly how you can tell. Answer: Yes, itâ€™s jittered both horizontally and vertically. The axis tick marks correspond to discrete categorical levels, but the points themselves are spread out a little bit around the discrete levels.
2. Is transparency used? Explain briefly how you can tell. Answer: Yes. In the blocks with a low number of points, each dot is not a solid color.
3. In principle, there are 9 $$\times$$ 9 = 81 possible combinations of the motherâ€™s and fatherâ€™s education. Which combination is the most common? Whatâ€™s the second most common combination? Answer: Most common: HS for both mother and father. Second most common: Bachelors for both mother and father.
4. Is it more common for a woman with a Bachelorâ€™s degree to marry a man with a high-school degree or vice versa? Answer: The square at mother=bachelors, father=HS is much darker than the similar square on the other side of the diagonal, that is, at father=bachelors, mother=HS
5. What would the graphic look like if jittering had not been used? Answer: There would be a single dot at each of the populated intersections, rather than the square cloud of dots seen in the actual graph.

id=Q02-107

Exercise 2.8 Hereâ€™s a point_plot() variable mapping that you would never see in practice, but which may help you better understand the use of color and facets.

Explain why the graph consists of lines of dots in different locations and different colors in each of the panels. Answer: The variable mass is being mapped to all four graphical properties: x, y, color, and facet. Since each dot has the same x and y coordinate, the dots all appear on the same diagonal line in each facet. Likewise, each facet corresponds to one color.

id=Q02-104

Exercise 2.9

Here are two point plots that differ only in the way the explanatory variables are mapped to graphical properties.

Galton |> point_plot(height ~ mother + sex)
Galton |> point_plot(height ~ sex + mother)
1. In the plot that maps mother to x, explain how you would identify a child who is relatively short for their sex but who has a tall mother.

2. In the plot that maps mother to color, explain how you would identify a child who, as in (1), is relatively short for their sex but has a tall mother.

id=Q02-105

Exercise 2.10

The â€śbody mass indexâ€ť (BMI) is a familiar way of defining overweight. (Whether it is useful medically is controversial, but it is widely used.) BMI is an arithmetic combination of height and weight. Using the data in Anthro_F, make plots showing the relationship between BMI, Height, and Weight. There are six different ways of defining the graphics frame from three variables, e.g., Height ~ BMI + Weight or Weight ~ Height + BMI, and so on.

1. Make a list of the tilde expressions corresponding to the six different graphical frames using these variables.

2. Plot each one of the six possible graphical frames. From these choose oneâ€”whichever you like bestâ€”and use it to explain in graphical, everyday terms, how BMI is related to height and weight.

id=Q02-106

## Enrichment topics

Itâ€™s very easy to use point_plot() to draw a graph, but the resulting graph often violates standards for good communication. For instance, make this graph:

The y-axis is labeled â€śpriceâ€ť, the x-axis â€śbidders,â€ť and the color scale â€śage.â€ť

Itâ€™s good practice to label axes so that the units of the quantity are shown clearly. Referring to the documentation for Clock_auction, the units of price are US dollars, the units of age are years, and bidders is a count of the number of people who put in bids for the particular clock. A good choice for the labels in the plot would be: y-axis: â€śPrice (USD)â€ť x-axis: â€śNumber of biddersâ€ť color: â€śClock age (yrs)â€ť

The add_plot_labels() function allows you to enforce your own choices of labels. To see it in action, modify your code in the previous chunk to look like this:

Clock_auction |>
point_plot(price ~ bidders + age) |>
x = "Number of bidders",
color = "Clock age (yrs)")

The examples in these Lessons tend not to apply the accepted conventions for labels. Instead, we typically use simply the name of the variable. That makes it easier for you to figure out how any particular graph was made. And you can always look at the documentation to find out about units, etc. That might be appropriate in the context of these Lessons. But, more generally, when communicating with people, labels on scales ought to be more informative than just the variable name.

Consider the relationship between the duration of pregnancy and the birth weight of the baby. Hereâ€™s a basic plot:

The cloud of points resembles a fish. The long tail corresponds to extremely to moderately pre-mature babies.

Notice that duration has not been jittered. Thatâ€™s because it is a quantitative variable. But jittering would be appropriate because the vertical stripes are an artifact of round duration to the nearest week.

You can force point_plot() to jitter the variable mapped to x by adding this argument to the command: jitter = "x". (Note the quotes around "x".) Try it!

Even with jittering, there is a lot of overplotting. The effect is to make it difficult to see what weight/duration values are most common.

A good way to deal with the over-plotting is by making the dots transluscent. Do this by adding yet another argument, point_ink = 0.1. The number refers to the degree of translucency: 0 means completely transparent, 1 means completely opaque. There are so many data points that a point_ink value of 0.1 is actually quite large. Try making it smaller until you can easily see which value of duration is most common.

The â€śbestâ€ť choice of transparency depends on what you are focusing on. To see the most common duration, a very low value of point_ink is called for. But such a low value would be counter-productive if the interest is in pre-mature babies.

Now consider the issue of twins. The plurality variable records such information.

Conventional wisdom is that wwins tend to be lower in birth weight than singletons. Twins also tend to be born somewhat earlier than singletons. Can we see this in the data?

Hereâ€™s a possible graphic:

We had to increase point_ink to 0.5 in order to see the twins individually. But how do we know if the high-weight twins are hidden by the singletons, who are the vast majority.

Letâ€™s try improving the plot by showing the different pluralities side by side. You can do this by modifying the tilde expression for point_plot() to weight ~ duration + 1 + plurality. This may look odd, but when you try it, youâ€™ll see immediately why it makes sense.

With the twins drawn separately, you can afford to make point_ink` smaller so that you can tell what are the most common values of weight and duration for each group. Do this until you can answer these questions:

1. Do twins tend to be lighter at birth and/or have shorter duration of pregancy?
2. Looking at, say, 37 weeks duration for both singletons and twins, are the twins birth weights still discernably lower? This sort of examination is usually described as â€śholding constantâ€ť a variable. â€śHolding constantâ€ť will be a major theme of these Lessons.

We will return to the question of how best to display these data in ?exr-06-900. We will need some new tools.