Source file ⇒ TidyGapMinder.Rmd
Gapminder makes available a number of data sets at http://www.gapminder.org/data/.
Links there will lead you to publicly viewable google spreadsheets.
In the example below, we will use the first one in the list, which provides data for many countries about the percentage of the population (age 14-49) that had HIV over a number of years. We will use this data to make a plot something like this one
You can view this data here.
If you look at it, you will see that we have some work ahead of us to make this usable (i.e. glyph ready) for our plot. But first let’s import the data, then we will figure out how to get it glyph ready.
We can use the googlesheets
package makes it possible to grab the data directly from google. Make sure you have googlesheets
installed.
First we establish a connection to the google spreadsheet.
library(googlesheets)
google_connection <-
gs_url("https://docs.google.com/spreadsheets/d/1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA/pub?gid=0", visibility="public")
## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
## Putative key: 1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA
## Authentication will not be used.
## Worksheets feed constructed with public visibility
Then we import the data from the first worksheet at this connection. (You can specify other spreadsheets if you don’t want the first one, but the first one is the default.)
HIVdata <- gs_read(google_connection)
## Accessing worksheet titled "Data"
## Error in gsheets_GET(this_ws$exportcsv, to_xml = FALSE, use_auth = !ss$is_public): Not expecting content-type to be:
## text/html; charset=UTF-8
Unfortunately, this particular spreadsheet doesn’t read in quite right. But is we specify the range of cells that contain data, then it works. The connection object reports a bunch of information, including the number of rows and columns in the worksheets.
google_connection$ws$row_extent
## [1] 276 24 20 20 20 20
google_connection$ws$col_extent
## [1] 34 6 6 6 6 6
We can use this information to select a range, and then things work.
HIVdata <-
gs_read(google_connection,
range = cell_limits(c(1,1), c(276, 34))
)
## Accessing worksheet titled "Data"
dim(HIVdata)
## [1] 275 34
If you have trouble getting the data from google, you can load it from this CSV file instead.
HIVdata <- read.csv("HIV.csv", header = TRUE, stringsAsFactors = FALSE)
If we consider a case to to an observation of a country in a given year, then these data are not tidy. We would like to have a data table with variables country
, year
, and HIV.perc
There are several problems that need fixing.
head(HIVdata, 3)
## Estimated.HIV.Prevalence.....Ages.15.49. X1979 X1980 X1981 X1982 X1983
## 1 Abkhazia NA NA NA NA NA
## 2 Afghanistan NA NA NA NA NA
## 3 Akrotiri and Dhekelia NA NA NA NA NA
## X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## X2008 X2009 X2010 X2011
## 1 NA NA NA NA
## 2 NA 0.06 0.06 0.06
## 3 NA NA NA NA
The “name” of the first variable is really a title for the data set and should be changed to country
. We can do this by selecting all of the columns and renaming the ones that begin with "Estimated"
. (We’ll deal with those names that start with X
at a different point in the process – those columns have more important issues to fix first.)
HIVdata2 <-
HIVdata %>%
select(country = starts_with("Estimated"), starts_with("X"))
head(HIVdata2, 3)
## country X1979 X1980 X1981 X1982 X1983 X1984 X1985 X1986
## 1 Abkhazia NA NA NA NA NA NA NA NA
## 2 Afghanistan NA NA NA NA NA NA NA NA
## 3 Akrotiri and Dhekelia NA NA NA NA NA NA NA NA
## X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010
## 1 NA NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA 0.06 0.06
## 3 NA NA NA NA NA NA NA NA NA NA NA NA
## X2011
## 1 NA
## 2 0.06
## 3 NA
For each row of the form:
country X1979 X1980 X1981 ... X2011
countryname v2 v3 v4 ... v34
we would like to create multiple rows that look like
country year HIV.perc
countryname X1979 v2
countryname X1980 v3
countryname X1981 v4
... ... ...
countryname X2011 v34
That is, we want to gather the values from multiple columns into a single column using the names of those columns as another column (generically called the the key). That is we are turning the column names and column values into key-value pairs. All the key-value pairs from a given row of the wide data will be associated with copies of the variables that were not part of the key-value pair (in this case, the country gets repeated 33 times).
To do this we
gather()
function from the tidyr
packageyear
and HIV.perc
)X
, or equivalently, all columns except the country
column).HIVdata3 <-
HIVdata2 %>% gather( year, HIV.perc, -country)
head(HIVdata3, 3)
## country year HIV.perc
## 1 Abkhazia X1979 NA
## 2 Afghanistan X1979 NA
## 3 Akrotiri and Dhekelia X1979 NA
Now we need to remove the X
from the years. We can do this with the extract_numeric()
function from tidyr
:
HIVdata4 <-
HIVdata3 %>% mutate( year = extract_numeric(year))
head(HIVdata4, 3)
## country year HIV.perc
## 1 Abkhazia 1979 NA
## 2 Afghanistan 1979 NA
## 3 Akrotiri and Dhekelia 1979 NA
Now the data are ready for a plot. Let’s compare just a few countries. There is very little data before 1990, so we’ll filter the data to include just the more recent years.
HIVdata4 %>%
filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
"Zimbabwe", "United States")) %>%
filter(year > 1988) %>%
ggplot(aes(x = year, y = HIV.perc, color = country)) +
geom_line(size=2, alpha=0.5)
## Warning: Removed 6 rows containing missing values (geom_path).
If we reorder the countries based on the HIV prevalence, then our legend will be in a better order compared with the plot.
HIVdata4 %>%
filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
"Zimbabwe", "United States")) %>%
filter(year > 1988) %>%
mutate(country =
reorder(country, HIV.perc,
function(x) - max(x, na.rm=TRUE))) %>%
ggplot(aes(x = year, y = HIV.perc, color = country)) +
geom_line(size=2, alpha=0.5)
## Warning: Removed 6 rows containing missing values (geom_path).
## Exercises
Download an additional data file from Gapminder, tidy it, and merge it with the HIV data so that you can make a scatter plot of HIV prevelance vs some other variable that might reveal and interesting pattern? What will you do about the fact that you have multiple years?
Download still more data files and use them to add additional aesthetics to the plot.
Note: We haven’t talked much about dynamic plots, but Hans Rosling’s famous moving bubble plots are one way to deal with the time variable.