Although graphics are often decorative, the point of data graphics is to convey information, not to decorate.

These notes describe some general principles of informative data graphics, identifies some common obstacles in creating informative graphics, and suggests ways to overcome them.

Principles

Compare: with clouds of points, mark specially the quantities you want compared. Dynamite plot.

Contrast: are the red dots higher up than the blue dots?

Relate: Do two quantities go along with one another, e.g. correlation, curve, … draw in a smoother.

Play to Perceptual Strengths

You’re communicating with a human, so you have to work around the weaknesses of human perception and play to its strengths.

Relative differences are easier to perceive than absolute
Nearby objects are easier to compare.
Objects in the same context are easier to compare.

Order Matters

Numbers have a natural order, but categories do not.

How to set the order of display of a categorical variable by another variable’s value. See the Medicare example in GraphicsCommands.Rmd

Order in Colors

Colors that you use to compare quantities should appear ordered.

Perhaps the most widely used ordering scheme for colors is the rainbow, as described by the mnemonic ROYGBIV. This order reflects the wavelength of light: short wavelength for violet, long wavelength for red.

Human perception, however, works differently. The rainbow colors are not perceived as strongly ordered, which is perhaps why we need the mnemonic ROYGBIV to related the color to the physical quantity of wavelength. To illustrate, glance at the following set of squares and see if you immediately know which color is at one extreme and which color is at the other extreme.

plot of chunk unnamed-chunk-3

Now check your intuition. Here are the same colored squares but with letters added to indicate the intended order:

plot of chunk unnamed-chunk-4

A good way to think about color (for the purposes of selecting an ordered palette) is as three separate scales: hue, saturation, and value. Many people find it intuitive to discern an order in saturation and value, but hue is unordered. To illustrate, here are three palettes along each of the scales.

Hue

plot of chunk unnamed-chunk-6

Saturation

plot of chunk unnamed-chunk-7 #### Value

plot of chunk unnamed-chunk-8

A Quiz on Color Order

Here are three palettes, one based on hue, one on saturation, one on value. The color squares have been shuffled differently in each palette. For each palette, try to pick out the sequence of squares that would put them in color order.

plot of chunk unnamed-chunk-9 Saturation

plot of chunk unnamed-chunk-10 Hue

plot of chunk unnamed-chunk-11 Value

The middle palette varies hue while keeping saturation and value constant. Was it easier to see order in hue, or in saturation and value?

The Answers

plot of chunk unnamed-chunk-12

plot of chunk unnamed-chunk-13

plot of chunk unnamed-chunk-14

Example Palettes

The following are palettes from ColorBrewer2.org. There are three major categories — sequential, diverging, and categorical¹ The most important choice is that of the category. Within a category, selection of a palette is largely a matter of taste. Given that about 8% of males and 0.5% of females in the US have some form of color blindness, it can be important to choose a palette that is well perceived by the colorblind.² Other potentially important issues include whether the palette is well represented in printed form, on an LCD screen, or can be photocopied well. The site gives information about all of these qualities.

These palettes are readily available in R³.

Sequential Palettes

When you want to use colors to represent numerical values, or ordered categorical values, a sequential palette is a good choice, particularly those palettes that keep hue constant.

plot of chunk unnamed-chunk-15

Diverging Palettes

Sometimes you will want to indicate clearly an important “central” value with cases being on either side of this. For instance, zero is an appropriate value when comparing positive and negative values. The mean value is often appropiate to use as the central value.

These are diverging color palettes⁴. In each, a neutral color, e.g. white, is at the center. The name given can be used directly in scale_color_brewer() as the palette= argument.

plot of chunk unnamed-chunk-16

Palettes for contrasting

When using color to represent an unordered, categorical variable, avoid palettes that suggest an order. Since hue is perceived as unordered, palettes with a range of hues but constant luminance and chrominance are appropriate.

plot of chunk unnamed-chunk-17

Contrast vs Compare

Support Inference

Problems

Too Wide a Spread?

Log axes

discretization, e.g. ntiles logarithmic cut points.

Too Many Cases?

transparency

density

Too Many Variables?

Too Much Data?

Redo this plot with the entire NHANES data: plot of chunk unnamed-chunk-18

Use this to talk about overplotting and alpha=.

Log Axes

charges for different DRG in MedicareCharges data as in GraphicsCommands.Rmd. Use log charges to show proportional differences.

Show the data!

Dynamite plots: why they are bad and how you can do better. Bars are good when you want to emphasize positive versus negative. Including 0 in the plot is enough to show position.

Jitter and alpha

See the weight example.

Continuous or discrete

Conversion of quantitative to categorical.

Example: Blood pressure and diabetes

The NHANES data gives … for 31000 people. Suppose you are interested in the question, does waist to height ratio differ between diabetics and non-diabetics.

group_by(NHANES, sex, diabetic) %>% 
  summarise(myIndex=mean(waist/height,na.rm=TRUE),n=n(),
            se=sd(waist/height, na.rm=TRUE)/sqrt(n)) -> foo 
ggplot(foo, aes(x=sex, group=diabetic, color=diabetic,y=myIndex,width=.5))  + geom_bar(alpha=.3,aes(fill=diabetic),position='dodge',stat='identity') + geom_errorbar(aes(ymax=myIndex+2*se,ymin=myIndex-2*se,),position='dodge',weight=3,size=2)  + ylim(0,80)

plot of chunk unnamed-chunk-19

Perhaps a better plot would be:

ggplot(NHANES, aes(x=sex, y=myIndex, fill=diabetic)) + geom_boxplot(notch=TRUE)

NB. For some reason, geom_box() doesn’t use group. But color works.

MAYBE ALWAYS DO A SELECT to highlight the variables being used.

Exercises

What could you improve about this graphic if your point is to show that pulse pressure differs from diabetics and non-diabetics? Hint: Missing data is shown in gray, even though it isn’t in the legend.

group_by(NHANES, sex, diabetic) %>% 
  summarise(mn=mean(myIndex,na.rm=TRUE),n=n(),
                                              se=sd(myIndex, na.rm=TRUE)/sqrt(n)) -> foo 
ggplot(foo, aes(group=sex, x=diabetic, y=mn,width=.5))  + geom_bar(alpha=.3,aes(fill=sex),position='dodge',stat='identity') + geom_errorbar(aes(ymax=mn+2*se,ymin=mn-2*se,),position='dodge',weight=3,size=2)  + ylim(0,80) + ylab("Waist to Height Ratio (cm/m)") + 
  xlab("Diabetes status")

Churchill War Rooms

Tons of bombs in World War II. http://www.peacockworks.com/wp-content/uploads/2012/12/Bombs-in-Weight-Dropped-in-WW2.jpg

$Bombs dropped$

Bombs dropped

Disposition of German submarines at end of war.

Stacked bar charts Source

U-boats Source

Flying bomb stats. Source: DTK July 2014

flying bombs

FOR GRAPHICAL DECISIONS

Redo the NHANES mortality as a logistic regression.

Show fraction alive as a function of sex and smoking status.

Show proportion in each age of the height ntiles. Stack them.

(Some color palettes](xhttp://learnr.wordpress.com/2009/04/15/ggplot2-qualitative-colour-palettes/)

Tuning up graphics

Labels with xlab() and ylab()
Setting the x-y scale by hand.
Removing xxtraneous factors: use droplevels() to drop unpopulated levels
Setting the overall style, e.g. Economist, NYT,

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.

The term “qualitative” is sometimes used instead of “categorical.↩
In terms of the names shown below for palettes, the colorblind suitable palettes are these: Diverging — BrBG, PiYG, PRGn, PuOr, RdBu, RdYlBu; Categorical — Dark2, Paired, Set2; Sequential: any.↩
ggplot2 knows about these palettes, by name. If you’re not using ggplot2 graphics, see the RColorBrewer package.↩
From and the RColorBrewer package.↩