You have already seen some of the basic types of graphs. These notes are about the underlying similarities and commonalities of the various graph types. Recognizing these can help you design graphics that serve your specific purpose in displaying data.

Data Tables

A data graph is a visualization of the cases in a single data table.¹ Each case will become a mark in the graph. That mark might be simple, like a dot. The mark might be complex: intricate shapes, text, and so on. Remember

A single data table is the origin of each data graph. Each individual case in the data table will become one mark.

Frames

Underlying data graphics is the frame. Whether the graphic is on paper or on a screen, the frame defines what each location means. Most often, the frame is a rectangular region. Position in the frame is identified by two numbers.

You create a frame when you decide which two variables in your data will correspond to the two coordinates.

For instance, consider a dataset as you might use to explore the determinants of outward migration from countries. You might start your investigation with a data table like this, giving the migration count for each country as well as some of the explanatory candidates: inflation, unemployment, infant mortality.

country	unemployment	inflation	infant	migr
Serbia	20.10	2.20	6.16	0.00
United Arab Emirates	2.40	1.30	10.92	13.58
Philippines	7.40	2.80	17.64	1.23
Mongolia	9.00	8.20	23.15	0.85
Austria	4.90	2.10	4.16	1.76
Seychelles	2.00	4.30	10.77	1.00

The cases are countries, the variables are migration, inflation, unemployment, and infant mortality. To save space, only six of the 192 cases are shown here.

Typically, you will define a frame by selecting two variables from your data table. For instance, here’s a frame based on migration and unemployment:

plot of chunk unnamed-chunk-4

The frame provides the meaning to location in space. For instance, imagine a country with an migration rate of 40 per 1000 people and unemployment of 25%. That country’s position in this frame would be:

plot of chunk unnamed-chunk-5

Glyphs

The frame itself doesn’t display any of the cases. Instead, marks of ink in the frame will represent the cases. There will be one mark for each case.

A simple glyph is the basic shape used in scatter plots: a dot, a square, a triangle, an x, and so on. The following graph uses small blue dots. Since each case is a country, each blue dot represents one country.

plot of chunk unnamed-chunk-6

The data table in the graph has 192 cases. So there are 192 dots.²

You can see from the graph that there’s one dot — one country — near the top of the frame. That country happens to be Syria. The high migration results from the long civil war wracking that country.

There are six countries with unemployment rates greater than 50%. These are Zimbabwe, Djibouti, Liberia, Burkina Faso, Turkmenistan and the Republic of the Congo.

The glyphs are simple. Only position in the frame distinquishes one glyphs from another. The shape, size, etc. of all of the glyphs are identical. There’s nothing about the glyph itself which identifies the country: there is no graphical attribute for the glyph that names the country. It’s no problem to add one, but the result is not very satisfactory:

plot of chunk unnamed-chunk-7

The aspects of each glyph that we can perceive are called aesthetics, or graphical attributes. The word “aesthetics” applied in the context of glyphs is not used in the modern sense. Nowadays, most people associate “aesthetics” with notions of beauty and artistic taste. The earlier meaning of the work, properties relating to perception by the senses, is the one intended when it comes to glyphs.

Location in the frame is just one of the aesthetics for a glyph, that is, one of the ways glyphs can represent data to your visual perception. For instance, color could be used to show inflation rate.

plot of chunk unnamed-chunk-8

Another aesthetic is size. Here size reflects the infant mortality rate:

plot of chunk unnamed-chunk-9

Scales and Guides

There are four aesthetics in the graph above. Each of the four aethetics is set in correspondence with a variable; we say the variable is mapped to the aesthetic. Migration is being mapped to horizontal position, unemployment to vertical position, inflation to color, and infant mortality to size.

A scale is the relationship between a variable and the aesthetic to which it is mapped. For unemployment, the scale says what value of the variable will correspond to position at the bottom of the frame, what value will correspond to the top of the frame, and where things fall inbetween.

Not all scales are about position. For instance, inflation is translated to color: black for inflation near zero, blue for inflation need 50%. Similarly, infant mortality is translated to size: the middle-sized dot corresponds to a yearly mortality of 60 per 100.

Scales translate values into aesthetic properties. Guides help to human reader to do the back translation. For position aesthetics, the most common sort of guide is the familiar axis with it’s tick marks and labels. But notice also the guide that tells how dot color corresponds to inflation. There’s still another guide telling how dot size corresponds to infant mortality.

Scaling Sensibly

The chart above makes it difficult to draw any inferences about the relationship between migration and the other variables. Almost all the countries are in the same, small region. The dots overlap extensively; it’s hard to tell how many of the small dots there are or how they are spread out. There’s almost no information visible about inflation.

The scales used in the graph are the problem here. The scale that maps migration to position, is concentrating the dots near 0. The scale that maps inflation to color is too concentrated: the viewer can’t distinguish between rates of 1% per year and, say, 10% per year, even though these may be very different economically. The infant mortality scale, with dot radius proportional to mortality, tends to focus attention on the very high-mortality countries.

Each of these difficulties can be addressed by an appropriate change of the scaling relationship between the variable and the graphical attribute. The following graph shows the same data, but on different scales:

Logarithmic spatial scales spread apart the clustered points.
The dot size is scaled by area insted of radius, producing a more visible (and more honest) perception of size.
The color is broken into discrete groups so that the viewer can easily distinguish high-inflation from low-inflation countries.

plot of chunk unnamed-chunk-10

With the appropriate scaling, the countries are spread out in an approachable way. You have a hope of making an inference about the relationship among these variables. For instance, use the graph to judge the validity of this hypothesis: “Low inflation countries tend to have more outward migration than high inflation countries.” The answer may not be obvious, but at least the question is approachable.

Components of Data Graphics

You’ve seen four general components of data graphics:

Frame: Specified by picking two variables.
Glyph: A shape representing a single case. Variables set graphical attributes of the shape: size, color, shape, and so on. The location of the glyph — location is an important graphical attribute! — is set by the two variables defining the frame.
Scale: The relationship between the value of a variable and the graphical attribute to be displayed for that value.
Guide: An indication for the human viewer of the scale, that is, graphics how a variable encodes into its graphical attribute. Common guides are x- and y-axis tick marks and color keys.

These concepts apply to data graphics generally. Here are some examples based on a data table derived from the NHANES data.

Recall that in NHANES each case is an individual person. Here’s a small part of the data table:

age	sex	smoker	death	weight	bmi
81.00	female	no	alive	48.70	23.72
41.00	female	no	alive	52.70	20.59
65.00	male	no	cardiovascular death	81.20	28.91
28.00	male	no	alive	125.40	35.86
18.00	female	no	alive	52.80	25.25
20.00	female	yes	alive	63.80	25.56

It’s certainly possible to graph the NHANES data table itself. There are 31126 cases in the data, so there are 31126 marks in the graph. Here’s a graph showing the relationship between age, smoking status, and mortality for all of the individual people in the NHANES data.

MY CAPTION IS HERE.

In the following examples, the individual-by-individual data in NHANES is part of the backstory. The next graphs will be made with tables where the cases are groups of people and summarize survival for different groups.

ageGroup	count	total	fracAlive
Twenties	2777	2793	0.99
Thirties	2520	2549	0.99
Forties	2347	2411	0.97
Fifties	1870	1975	0.95
Sixties	2066	2271	0.91
Seventies	1365	1764	0.77
Eighties	752	1277	0.59

plot of chunk unnamed-chunk-16

Seven cases in the data table, seven glyphs in the graph. Here are the details:

Frame: defined by two variables, “fraction alive” and “age.” Fraction alive is an ordinary quantitative variable, but “age” is categorical.
Glyph: a gray-colored bar. The base of the bar is always at 0, the top at the fraction alive.
Scales: “Fraction alive,” a variable running from 0 to 1, is scaled in a linear way in the vertical direction. “Age,” a categorical variable, is arranged in the natural order and spread evenly along the horizontal direction.
Guides: The usual number-line guide for the vertical axis. For the horizontal axis, the label for each age group at the position set by the scale for that group.

Here’s another graph depicting survival rates. This one includes an additional variable describing whether the people in the group smoked. The data are:

smoker	ageGroup	count	total	fracAlive
yes	Twenties	774	779	0.99
yes	Thirties	695	708	0.98
yes	Forties	702	732	0.96
yes	Fifties	431	478	0.90
yes	Sixties	310	381	0.81
yes	Seventies	112	173	0.65
yes	Eighties	29	51	0.57
no	Twenties	1995	2006	0.99
no	Thirties	1811	1826	0.99
no	Forties	1618	1652	0.98
no	Fifties	1405	1460	0.96
no	Sixties	1706	1837	0.93
no	Seventies	1226	1553	0.79
no	Eighties	704	1195	0.59

There are 14 cases in the data table, so there will be 14 glyphs.

plot of chunk unnamed-chunk-19

A graphical attribute of bar color has been added. The color is set by the “Smoking Status” variable. A guide to the relatonship between Smoking Status and color has been added.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Sometimes you will want to layer data graphs on top of one another, but each layer is a visualization of a single data table.↩
On occasion, a case may have a missing value for one or more variable. Typically, such cases are omitted when making the graph.↩