Data graphics provide one of the most accessible, compelling, and expressive modes to investigate and depict patterns in data. Statistical models are another important mode, but it typically takes considerable training to understand statistical models.

The point of these notes is to

Common Kinds of Graphs

There are many, many different genres of data graphics, and many different variations on each genre. Here are some of the most commonly encountered kinds.

You are likely to be familiar with some or all of these genres.

Scatter plots

The main purpose of a scatter plot is to show the relationship between two variables across several or many cases. Most often, there is a Cartesian coordinate system in which the x-axis represents one variable and the y-axis the value of a second variable.

Example: Consider the NHANES data giving medical and morphometric measurements of individual people. Here is a scatter plot showing the relationship between two variables: height and age.

plot of chunk unnamed-chunk-3

Each dot is one case. The position of that dot signifies the value of the two variables for that case.

Displays of Distribution

A histogram shows how many cases fall into given ranges of the variable. For instance, here’s a histogram of heights from NHANES:

plot of chunk unnamed-chunk-4

The most common height is about 1.65 m — that’s the location of the tallest bar. Only a handful are taller than 2.0 m.

A simple alternative to a histogram is a frequency polygon. It’s like a histogram but without the bars.

To illustrate, here’s a frequency polygon of NHANES height. To show the relationship of the polygon with the histogram, the histogram has been drawn in the background.

plot of chunk unnamed-chunk-5

Frequency polygons make it straightforward to include other variables, like sex:

plot of chunk unnamed-chunk-6

It’s also common to display the distribution of cases across two variables.

plot of chunk unnamed-chunk-7

This is much like a scatter plot, but without the distraction of the individual dots.

Bar Charts

The familiar bar chart is effective when the objective is to compare a few different quantities.

Example: From the NHANES data, how likely is a person to have died during the follow-up period, based on their age group?

plot of chunk unnamed-chunk-9

It’s easy to compare bars to their neighbors and to discern the overall trend: survival goes down as age goes up.

There’s plenty of space taken up the bars to include information about other variables.

For instance, here’s a comparison between smokers and non-smokers:

plot of chunk unnamed-chunk-10

At each age, smokers were less likely than non-smokers to have survived to the end of the study’s follow-up period.

Maps

Using a map to plot data helps both to identify particular cases and to show spatial patterns and discrepancy.

mosaic::mWorldMap( CountryData, 
                   key="country", 
                   fill="growth" )

plot of chunk unnamed-chunk-11

From this map of population growth (in % per year), you can see the handful of countries with negative growth (that is, population decline) such as Russia and Poland. The countries in Africa, with the exception of South Africa and Namibia, have fast population growth. (A rate of 5% per year means that the population will double every 14 years.) There is just one country with a large decline: Syria, just to the right of the center of the map.

Networks

A network is a set of connections, called edges, between elements, called vertices. A vertex corresponds to a case. The network describes which vertices are connected to other vertices.

The NCI60 data set (in the DCF package) is about the genetics of cancer. The data set looks at more than 40,000 probes for the expression of genes, in each of 60 cancer cell lines. In the network here, a vertex is a given cell line — a sample of cells from one person with a particularly kind of cancer.

To measure the connections between cell lines, the correlation coefficient for probe expression in every pair of cell lines was calculated. Each pair’s correlation coefficient indicates the strength of connection between the two cell lines in that pair. Here, the 200 largest correlations are shown.

plot of chunk unnamed-chunk-12

Every vertex, that is, every cell line, is depicted as a dot. The dot’s color and label gives the type of cancer involved. These are Ovarian, Colon, Central Nervous System, Melanoma, Renal, Breast, and Lung cancers.

The network shows that the melanoma cell lines (ME) are closely related to each other and not so much to other cell lines. The same is true for colon cancer cell lines (CO) and for central nervous system (CN) cell lines.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus