Data graphics are built from parts. Chapter 5 showed the parts assembled together. This chapter looks at the parts individually.
Of course, a data frame provides the basis for drawing a data graphic. The relationship between a data frame and a graphic is simple: Each case in the data frame becomes a mark in the graph. The designer of the graphic — you — chooses which variables the graphic will display and how each variable is to be represented graphically: position, size, color, and so on. The marks themselves are called glyphs. A data graphic has one glyph for each case in the data frame.
Key Graphics Vocabulary
frame: The relationship between position and the data being plotted.
glyph: The basic graphical “unit” that represents one case. Other terms used include “mark” and “symbol.” Variables set graphical attributes of the shape: size, color, shape, and so on. The location of the glyph — location is an important graphical attribute! — is set by the two variables defining the frame.
aesthetic: Any graphical attribute of a glyph: size, location, shape, color, etc.
scale: The relationship between the value of a variable and the graphical attribute to be displayed for that value.
guide: An indication of the scale for a human viewer in order to show how a variable encodes into its graphical attribute. Common guides are x- and y-axis tick marks and color keys.
The frame of a graphic provides the space for drawing glyphs. But there is more to a frame than a blank canvas or piece of paper. The frame defines what position means. Most often, the frame is a rectangular region and position is described in terms of the familiar \((x, y)\) Cartesian coordinate system. In creating a frame, you must decide which variable in your data will correspond to the \(x\) coordinate, and which to the \(y\) coordinate.
For instance, consider a dataset relevant to economic productivity. Table 6.1 gives per capita GDP for each country as well as some of the explanatory candidates: average educational level in the population, length of roadways per unit area, Internet use as a fraction of the population.
Table 6.1: Data relevant to economic performance. This is an excerpt from
CountryData found in the
You define a frame by selecting two variables from the glyph-ready data frame. For instance, Figure 6.1 shows a frame based on GDP and length of roadways. The frame provides the meaning to location in space.
The frame itself doesn’t display any of the cases. Instead, the glyphs positioned in the frame represent the cases. There will be one glyph for each case in the data frame.
The basic shape used in scatter plots is a simple glyph: a dot, a square, a triangle, an x, and so on. Figure 6.2 uses small dots. Since each case is a country, each dot represents one country.
In Figure 6.2, the glyphs are simple. Only position in the frame distinguishes one glyph from another. The shape, size, etc. of all of the glyphs are identical. There’s nothing about the glyph itself which identifies the country. It’s possible to use a glyph with several attributes. Figure 6.3 location and label, mapping country name to the label.
But glyphs can have several properties. The aspects of each glyph that we can perceive are called aesthetics, or equivalently graphical attributes. The word aesthetics applied in the context of glyphs is not used in the modern sense. Nowadays, most people associate aesthetics with notions of beauty and artistic taste. The earlier meaning of the word, properties relating to perception by the senses, is the one intended when it comes to glyphs.
Location in the frame are the \((x, y)\) aesthetics for a glyph, but other aesthetics can display variables in the data frame. For instance, color could be used to show Internet use (as a fraction of the population), as in Figure 6.4. Another aesthetic is size. The size is fixed in 6.4; the same for every country. Figure 6.5 maps the average years of eduction onto the size aesthetic.
There are four aesthetics in Figure 6.5. Each of the four aethetics is set in correspondence with a variable; we say the variable is mapped to the aesthetic. Length of roadways is being mapped to horizontal position, GDP to vertical position, Internet connectivity to color, and educational attainment to size.
A scale is the relationship between a variable and the aesthetic to which it is mapped. For
roadways, the scale says what value of the variable will correspond to position at the bottom of the frame, what value will correspond to the top of the frame, and where things fall inbetween.
Not all scales are about position. For instance, in Figure 6.5,
net_users is translated to color. Similarly, average educational attainment (in years) is translated to size: the middle-sized dot corresponds 7½ years of education.
Scales translate values into aesthetic properties. Guides help the human reader to do the back translation. For position aesthetics, the most common sort of guide is the familiar axis with its tick marks and labels. But notice also the guide that tells how dot color corresponds to Internet connectivity. There’s still another guide telling how dot size corresponds to education.
Using multiple aesthetics such as shape, color, and size to display multiple variables can produce a confusing, hard-to-read graph. Facets provide a simple and effective alternative. Figure 6.6 uses facets to show different levels of Internet connectivity, providing a better view than Figure 6.5.
On occasion, data from more than one data frame are graphed together. For instance, suppose you want a display of one state’s hospital providers’ charges for different medical procedures. The glyph-ready data frame for New Jersey looks like Table 6.2. The glyph-ready table can be translated to a chart (Figure 6.7 (top)) using bars to give a fair impression of the range in charges for different medical procedures in New Jersey.
Table 6.2: Glyph-ready data for the barplot layer in Figure 6.7
|… and so on for 100 rows altogether.|
How do the New Jersey charges compare to those in other states? Tables 6.2 and 6.3 provide relevant data. The two data frames, one for New Jersey and one for the whole country, can be plotted with different types of glyph: bars for New Jersey and dots for the whole country as in Figure 6.8.
Table 6.3: Glyph-ready data frame for the scatter-plot layer in Figure 6.8
|… and so on for 5,025 rows altogether.|
With the context provided by the individual states, it’s easy to see the charges in New Jersey are among the highest in the country for each medical procedure. (A description of each medical procedure number is given in the data frame
DirectRecoveryGroups in the
Problem 6.1: The following chart contains four facets. Each shows the amount of a substance in different conditions:
Let’s deconstruct the chart to see if it follows the conventions for facets in graphics used in this book.
Problem 6.2: Consider this graph
Here are some of the variables and their levels:
concentration: numerical \(-3\) to \(5\)
target: CcpN, Uptake, Other
flux: zero or positive
gene: MaeN, PtsG, DctP, …
molecule: Glocose, Fructose, Gluconate, …
Problem 6.3: Consider this graphic:
Suppose the glyph-ready data underlying the graphic were structured as follows:
Consider these two kinds of glyph present in the graph: and
Problem 6.4: The graph, from Google Maps, shows mass transit options on a Monday morning for getting from Orinda, CA (in the East Bay), to Palo Alto, CA (in the West Bay).
Figure accompanying Problems 6.5 through 6.9 The figure presents forecasts for the US Senate elections in Nov. 2014. The numbers or words give the forecast probability of one party’s candidate — Democrat or Republican — winning. The forecasts are made based on polls up through the end of August 2014. Individual results from several different polling organization are shown. The graphic is an excerpt from the full graphic at , which shows predictions for all 36 senate seats up for election in 2014. Source: New York Times