Using Gapminder Data

Obtaining Data from Gapminder

Gapminder makes available a number of data sets at http://www.gapminder.org/data/.
Links there will lead you to publicly viewable google spreadsheets.
In the example below, we will use the first one in the list, which provides data for many countries about the percentage of the population (age 14-49) that had HIV over a number of years. We will use this data to make a plot something like this one

You can view this data here.

If you look at it, you will see that we have some work ahead of us to make this usable (i.e. glyph ready) for our plot. But first let’s import the data, then we will figure out how to get it glyph ready.

Grabbing the data directly from google

We can use the googlesheets package makes it possible to grab the data directly from google. Make sure you have googlesheets installed.

First we establish a connection to the google spreadsheet.

library(googlesheets)
google_connection <-
  gs_url("https://docs.google.com/spreadsheets/d/1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA/pub?gid=0", visibility="public")

## Sheet-identifying info appears to be a browser URL.
## googlesheets will attempt to extract sheet key from the URL.
## Putative key: 1kWH_xdJDM4SMfT_Kzpkk-1yuxWChfurZuWYjfmv51EA
## Authentication will not be used.
## Worksheets feed constructed with public visibility

Then we import the data from the first worksheet at this connection. (You can specify other spreadsheets if you don’t want the first one, but the first one is the default.)

HIVdata <- gs_read(google_connection)

## Accessing worksheet titled "Data"

## Error in gsheets_GET(this_ws$exportcsv, to_xml = FALSE, use_auth = !ss$is_public): Not expecting content-type to be:
## text/html; charset=UTF-8

Unfortunately, this particular spreadsheet doesn’t read in quite right. But is we specify the range of cells that contain data, then it works. The connection object reports a bunch of information, including the number of rows and columns in the worksheets.

google_connection$ws$row_extent

## [1] 276  24  20  20  20  20

google_connection$ws$col_extent

## [1] 34  6  6  6  6  6

We can use this information to select a range, and then things work.

HIVdata <- 
  gs_read(google_connection,
          range = cell_limits(c(1,1), c(276, 34))
  )

## Accessing worksheet titled "Data"

dim(HIVdata)

## [1] 275  34

Grabbing the data via csv

If you have trouble getting the data from google, you can load it from this CSV file instead.

HIVdata <- read.csv("HIV.csv", header = TRUE, stringsAsFactors = FALSE)

Getting the Data Glyph Ready

If we consider a case to to an observation of a country in a given year, then these data are not tidy. We would like to have a data table with variables country, year, and HIV.perc There are several problems that need fixing.

head(HIVdata, 3)

##   Estimated.HIV.Prevalence.....Ages.15.49. X1979 X1980 X1981 X1982 X1983
## 1                                 Abkhazia    NA    NA    NA    NA    NA
## 2                              Afghanistan    NA    NA    NA    NA    NA
## 3                    Akrotiri and Dhekelia    NA    NA    NA    NA    NA
##   X1984 X1985 X1986 X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##   X1996 X1997 X1998 X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##   X2008 X2009 X2010 X2011
## 1    NA    NA    NA    NA
## 2    NA  0.06  0.06  0.06
## 3    NA    NA    NA    NA

The “name” of the first variable is really a title for the data set and should be changed to country. We can do this by selecting all of the columns and renaming the ones that begin with "Estimated". (We’ll deal with those names that start with X at a different point in the process – those columns have more important issues to fix first.)

HIVdata2 <- 
  HIVdata %>% 
  select(country = starts_with("Estimated"), starts_with("X"))
head(HIVdata2, 3)

##                 country X1979 X1980 X1981 X1982 X1983 X1984 X1985 X1986
## 1              Abkhazia    NA    NA    NA    NA    NA    NA    NA    NA
## 2           Afghanistan    NA    NA    NA    NA    NA    NA    NA    NA
## 3 Akrotiri and Dhekelia    NA    NA    NA    NA    NA    NA    NA    NA
##   X1987 X1988 X1989 X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##   X1999 X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007 X2008 X2009 X2010
## 1    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 2    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA  0.06  0.06
## 3    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
##   X2011
## 1    NA
## 2  0.06
## 3    NA

For each row of the form:

country      X1979  X1980  X1981  ... X2011
countryname     v2     v3     v4  ...   v34

we would like to create multiple rows that look like

    country    year  HIV.perc
countryname   X1979        v2
countryname   X1980        v3
countryname   X1981        v4    
    ...        ...        ...
countryname   X2011       v34

That is, we want to gather the values from multiple columns into a single column using the names of those columns as another column (generically called the the key). That is we are turning the column names and column values into key-value pairs. All the key-value pairs from a given row of the wide data will be associated with copies of the variables that were not part of the key-value pair (in this case, the country gets repeated 33 times).

To do this we

use the gather() function from the tidyr package
specify names for the key and value columns (year and HIV.perc)
specify which columns are being gathered (all the ones that start with X, or equivalently, all columns except the country column).

HIVdata3 <-
  HIVdata2 %>% gather( year, HIV.perc, -country)
head(HIVdata3, 3)

##                 country  year HIV.perc
## 1              Abkhazia X1979       NA
## 2           Afghanistan X1979       NA
## 3 Akrotiri and Dhekelia X1979       NA

Now we need to remove the X from the years. We can do this with the extract_numeric() function from tidyr:

HIVdata4 <-
  HIVdata3 %>% mutate( year = extract_numeric(year))
head(HIVdata4, 3)

##                 country year HIV.perc
## 1              Abkhazia 1979       NA
## 2           Afghanistan 1979       NA
## 3 Akrotiri and Dhekelia 1979       NA

Now the data are ready for a plot. Let’s compare just a few countries. There is very little data before 1990, so we’ll filter the data to include just the more recent years.

HIVdata4 %>%
  filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
                        "Zimbabwe", "United States")) %>%
  filter(year > 1988) %>%
  ggplot(aes(x = year, y = HIV.perc, color = country)) +
  geom_line(size=2, alpha=0.5)

## Warning: Removed 6 rows containing missing values (geom_path).

If we reorder the countries based on the HIV prevalence, then our legend will be in a better order compared with the plot.

HIVdata4 %>%
  filter(country %in% c("Uganda", "Kenya", "Tanzania", "South Africa",
                        "Zimbabwe", "United States")) %>%
  filter(year > 1988) %>%
  mutate(country = 
           reorder(country, HIV.perc, 
                   function(x) - max(x, na.rm=TRUE))) %>% 
  ggplot(aes(x = year, y = HIV.perc, color = country)) +
  geom_line(size=2, alpha=0.5)

## Warning: Removed 6 rows containing missing values (geom_path).

## Exercises

Download an additional data file from Gapminder, tidy it, and merge it with the HIV data so that you can make a scatter plot of HIV prevelance vs some other variable that might reveal and interesting pattern? What will you do about the fact that you have multiple years?
Download still more data files and use them to add additional aesthetics to the plot.

Note: We haven’t talked much about dynamic plots, but Hans Rosling’s famous moving bubble plots are one way to deal with the time variable.

Using Gapminder Data

R Pruim

CVC 2015

Obtaining Data from Gapminder

Grabbing the data directly from google

Grabbing the data via csv

Getting the Data Glyph Ready