Although graphics are often decorative, the point of data graphics is to convey information, not to decorate.
These notes describe some general principles of informative data graphics, identifies some common obstacles in creating informative graphics, and suggests ways to overcome them.
Compare: with clouds of points, mark specially the quantities you want compared. Dynamite plot.
Contrast: are the red dots higher up than the blue dots?
Relate: Do two quantities go along with one another, e.g. correlation, curve, … draw in a smoother.
You’re communicating with a human, so you have to work around the weaknesses of human perception and play to its strengths.
Numbers have a natural order, but categories do not.
How to set the order of display of a categorical variable by another variable’s value. See the Medicare example in GraphicsCommands.Rmd
Colors that you use to compare quantities should appear ordered.
Perhaps the most widely used ordering scheme for colors is the rainbow, as described by the mnemonic ROYGBIV. This order reflects the wavelength of light: short wavelength for violet, long wavelength for red.
Human perception, however, works differently. The rainbow colors are not perceived as strongly ordered, which is perhaps why we need the mnemonic ROYGBIV to related the color to the physical quantity of wavelength. To illustrate, glance at the following set of squares and see if you immediately know which color is at one extreme and which color is at the other extreme.
Now check your intuition. Here are the same colored squares but with letters added to indicate the intended order:
A good way to think about color (for the purposes of selecting an ordered palette) is as three separate scales: hue, saturation, and value. Many people find it intuitive to discern an order in saturation and value, but hue is unordered. To illustrate, here are three palettes along each of the scales.
#### Value
Here are three palettes, one based on hue, one on saturation, one on value. The color squares have been shuffled differently in each palette. For each palette, try to pick out the sequence of squares that would put them in color order.
Saturation
Hue
Value
The middle palette varies hue while keeping saturation and value constant. Was it easier to see order in hue, or in saturation and value?
The following are palettes from ColorBrewer2.org. There are three major categories — sequential, diverging, and categorical1 The most important choice is that of the category. Within a category, selection of a palette is largely a matter of taste. Given that about 8% of males and 0.5% of females in the US have some form of color blindness, it can be important to choose a palette that is well perceived by the colorblind.2 Other potentially important issues include whether the palette is well represented in printed form, on an LCD screen, or can be photocopied well. The
These palettes are readily available in R3.
When you want to use colors to represent numerical values, or ordered categorical values, a sequential palette is a good choice, particularly those palettes that keep hue constant.
Sometimes you will want to indicate clearly an important “central” value with cases being on either side of this. For instance, zero is an appropriate value when comparing positive and negative values. The mean value is often appropiate to use as the central value.
These are diverging color palettes4. In each, a neutral color, e.g. white, is at the center. The name given can be used directly in scale_color_brewer()
as the palette=
argument.
When using color to represent an unordered, categorical variable, avoid palettes that suggest an order. Since hue is perceived as unordered, palettes with a range of hues but constant luminance and chrominance are appropriate.
Log axes
discretization, e.g. ntiles logarithmic cut points.
transparency
density
Redo this plot with the entire NHANES data:
Use this to talk about overplotting and alpha=
.
charges for different DRG in MedicareCharges data as in GraphicsCommands.Rmd. Use log charges to show proportional differences.
See the weight example.
Conversion of quantitative to categorical.
The NHANES data gives … for 31000 people. Suppose you are interested in the question, does waist to height ratio differ between diabetics and non-diabetics.
group_by(NHANES, sex, diabetic) %>%
summarise(myIndex=mean(waist/height,na.rm=TRUE),n=n(),
se=sd(waist/height, na.rm=TRUE)/sqrt(n)) -> foo
ggplot(foo, aes(x=sex, group=diabetic, color=diabetic,y=myIndex,width=.5)) + geom_bar(alpha=.3,aes(fill=diabetic),position='dodge',stat='identity') + geom_errorbar(aes(ymax=myIndex+2*se,ymin=myIndex-2*se,),position='dodge',weight=3,size=2) + ylim(0,80)
Perhaps a better plot would be:
ggplot(NHANES, aes(x=sex, y=myIndex, fill=diabetic)) + geom_boxplot(notch=TRUE)
NB. For some reason, geom_box() doesn’t use group. But color works.
MAYBE ALWAYS DO A SELECT to highlight the variables being used.
What could you improve about this graphic if your point is to show that pulse pressure differs from diabetics and non-diabetics? Hint: Missing data is shown in gray, even though it isn’t in the legend.
group_by(NHANES, sex, diabetic) %>%
summarise(mn=mean(myIndex,na.rm=TRUE),n=n(),
se=sd(myIndex, na.rm=TRUE)/sqrt(n)) -> foo
ggplot(foo, aes(group=sex, x=diabetic, y=mn,width=.5)) + geom_bar(alpha=.3,aes(fill=sex),position='dodge',stat='identity') + geom_errorbar(aes(ymax=mn+2*se,ymin=mn-2*se,),position='dodge',weight=3,size=2) + ylim(0,80) + ylab("Waist to Height Ratio (cm/m)") +
xlab("Diabetes status")
Redo the NHANES mortality as a logistic regression.
Show fraction alive as a function of sex and smoking status.
Show proportion in each age of the height ntiles. Stack them.
(Some color palettes](xhttp://learnr.wordpress.com/2009/04/15/ggplot2-qualitative-colour-palettes/)
xlab()
and ylab()
droplevels()
to drop unpopulated levelsPlease use the comment system to make suggestions, point out errors, or to discuss the topic.
Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.
The term “qualitative” is sometimes used instead of “categorical.↩
In terms of the names shown below for palettes, the colorblind suitable palettes are these: Diverging — BrBG, PiYG, PRGn, PuOr, RdBu, RdYlBu; Categorical — Dark2, Paired, Set2; Sequential: any.↩
ggplot2
knows about these palettes, by name. If you’re not using ggplot2
graphics, see the RColorBrewer
package.↩
From