You have already seen some of the basic types of graphs. These notes are about the underlying similarities and commonalities of the various graph types. Recognizing these can help you design graphics that serve your specific purpose in displaying data.

Data Tables

A data graph is a visualization of the cases in a single data table.1 Each case will become a mark in the graph. That mark might be simple, like a dot. The mark might be complex: intricate shapes, text, and so on. Remember

A single data table is the origin of each data graph. Each individual case in the data table will become one mark.

Frames

Underlying data graphics is the frame. Whether the graphic is on paper or on a screen, the frame defines what each location means. Most often, the frame is a rectangular region. Position in the frame is identified by two numbers.

You create a frame when you decide which two variables in your data will correspond to the two coordinates.

For instance, consider a dataset as you might use to explore the determinants of outward migration from countries. You might start your investigation with a data table like this, giving the migration count for each country as well as some of the explanatory candidates: inflation, unemployment, infant mortality.

country unemployment inflation infant migr
Serbia 20.10 2.20 6.16 0.00
United Arab Emirates 2.40 1.30 10.92 13.58
Philippines 7.40 2.80 17.64 1.23
Mongolia 9.00 8.20 23.15 0.85
Austria 4.90 2.10 4.16 1.76
Seychelles 2.00 4.30 10.77 1.00

The cases are countries, the variables are migration, inflation, unemployment, and infant mortality. To save space, only six of the 192 cases are shown here.

Typically, you will define a frame by selecting two variables from your data table. For instance, here’s a frame based on migration and unemployment:

plot of chunk unnamed-chunk-4

The frame provides the meaning to location in space. For instance, imagine a country with an migration rate of 40 per 1000 people and unemployment of 25%. That country’s position in this frame would be:

plot of chunk unnamed-chunk-5

Glyphs

The frame itself doesn’t display any of the cases. Instead, marks of ink in the frame will represent the cases. There will be one mark for each case.

A simple glyph is the basic shape used in scatter plots: a dot, a square, a triangle, an x, and so on. The following graph uses small blue dots. Since each case is a country, each blue dot represents one country.

plot of chunk unnamed-chunk-6

The data table in the graph has 192 cases. So there are 192 dots.2

You can see from the graph that there’s one dot — one country — near the top of the frame. That country happens to be Syria. The high migration results from the long civil war wracking that country.

There are six countries with unemployment rates greater than 50%. These are Zimbabwe, Djibouti, Liberia, Burkina Faso, Turkmenistan and the Republic of the Congo.

The glyphs are simple. Only position in the frame distinquishes one glyphs from another. The shape, size, etc. of all of the glyphs are identical. There’s nothing about the glyph itself which identifies the country: there is no graphical attribute for the glyph that names the country. It’s no problem to add one, but the result is not very satisfactory:

plot of chunk unnamed-chunk-7

The aspects of each glyph that we can perceive are called aesthetics, or graphical attributes. The word “aesthetics” applied in the context of glyphs is not used in the modern sense. Nowadays, most people associate “aesthetics” with notions of beauty and artistic taste. The earlier meaning of the work, properties relating to perception by the senses, is the one intended when it comes to glyphs.

Location in the frame is just one of the aesthetics for a glyph, that is, one of the ways glyphs can represent data to your visual perception. For instance, color could be used to show inflation rate.

plot of chunk unnamed-chunk-8

Another aesthetic is size. Here size reflects the infant mortality rate:

plot of chunk unnamed-chunk-9

Scales and Guides

There are four aesthetics in the graph above. Each of the four aethetics is set in correspondence with a variable; we say the variable is mapped to the aesthetic. Migration is being mapped to horizontal position, unemployment to vertical position, inflation to color, and infant mortality to size.

A scale is the relationship between a variable and the aesthetic to which it is mapped. For unemployment, the scale says what value of the variable will correspond to position at the bottom of the frame, what value will correspond to the top of the frame, and where things fall inbetween.

Not all scales are about position. For instance, inflation is translated to color: black for inflation near zero, blue for inflation need 50%. Similarly, infant mortality is translated to size: the middle-sized dot corresponds to a yearly mortality of 60 per 100.

Scales translate values into aesthetic properties. Guides help to human reader to do the back translation. For position aesthetics, the most common sort of guide is the familiar axis with it’s tick marks and labels. But notice also the guide that tells how dot color corresponds to inflation. There’s still another guide telling how dot size corresponds to infant mortality.

Scaling Sensibly

The chart above makes it difficult to draw any inferences about the relationship between migration and the other variables. Almost all the countries are in the same, small region. The dots overlap extensively; it’s hard to tell how many of the small dots there are or how they are spread out. There’s almost no information visible about inflation.

The scales used in the graph are the problem here. The scale that maps migration to position, is concentrating the dots near 0. The scale that maps inflation to color is too concentrated: the viewer can’t distinguish between rates of 1% per year and, say, 10% per year, even though these may be very different economically. The infant mortality scale, with dot radius proportional to mortality, tends to focus attention on the very high-mortality countries.

Each of these difficulties can be addressed by an appropriate change of the scaling relationship between the variable and the graphical attribute. The following graph shows the same data, but on different scales:

plot of chunk unnamed-chunk-10

With the appropriate scaling, the countries are spread out in an approachable way. You have a hope of making an inference about the relationship among these variables. For instance, use the graph to judge the validity of this hypothesis: “Low inflation countries tend to have more outward migration than high inflation countries.” The answer may not be obvious, but at least the question is approachable.

Components of Data Graphics

You’ve seen four general components of data graphics:

  1. Frame: Specified by picking two variables.
  2. Glyph: A shape representing a single case. Variables set graphical attributes of the shape: size, color, shape, and so on. The location of the glyph — location is an important graphical attribute! — is set by the two variables defining the frame.
  3. Scale: The relationship between the value of a variable and the graphical attribute to be displayed for that value.
  4. Guide: An indication for the human viewer of the scale, that is, graphics how a variable encodes into its graphical attribute. Common guides are x- and y-axis tick marks and color keys.

These concepts apply to data graphics generally. Here are some examples based on a data table derived from the NHANES data.

Recall that in NHANES each case is an individual person. Here’s a small part of the data table:

age sex smoker death weight bmi diabetic
81.00 female no alive 48.70 23.72 0.00
41.00 female no alive 52.70 20.59 0.00
65.00 male no cardiovascular death 81.20 28.91 0.00
28.00 male no alive 125.40 35.86 0.00
18.00 female no alive 52.80 25.25 0.00
20.00 female yes alive 63.80 25.56 0.00

It’s certainly possible to graph the NHANES data table itself. There are 31126 cases in the data, so there are 31126 marks in the graph. Here’s a graph showing the relationship between age, smoking status, and mortality for all of the individual people in the NHANES data.

MY CAPTION IS HERE.

In the following examples, the individual-by-individual data in NHANES is part of the backstory. The next graphs will be made with tables where the cases are groups of people and summarize survival for different groups.

ageGroup count total fracAlive
Twenties 2777 2793 0.99
Thirties 2520 2549 0.99
Forties 2347 2411 0.97
Fifties 1870 1975 0.95
Sixties 2066 2271 0.91
Seventies 1365 1764 0.77
Eighties 752 1277 0.59

plot of chunk unnamed-chunk-16

Seven cases in the data table, seven glyphs in the graph. Here are the details:

Here’s another graph depicting survival rates. This one includes an additional variable describing whether the people in the group smoked. The data are:

smoker ageGroup count total fracAlive
yes Twenties 774 779 0.99
yes Thirties 695 708 0.98
yes Forties 702 732 0.96
yes Fifties 431 478 0.90
yes Sixties 310 381 0.81
yes Seventies 112 173 0.65
yes Eighties 29 51 0.57
no Twenties 1995 2006 0.99
no Thirties 1811 1826 0.99
no Forties 1618 1652 0.98
no Fifties 1405 1460 0.96
no Sixties 1706 1837 0.93
no Seventies 1226 1553 0.79
no Eighties 704 1195 0.59

There are 14 cases in the data table, so there will be 14 glyphs.

plot of chunk unnamed-chunk-19

A graphical attribute of bar color has been added. The color is set by the “Smoking Status” variable. A guide to the relatonship between Smoking Status and color has been added.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus


  1. Sometimes you will want to layer data graphs on top of one another, but each layer is a visualization of a single data table.

  2. On occasion, a case may have a missing value for one or more variable. Typically, such cases are omitted when making the graph.