Working with data in packages is great for learning how R works, but eventually you will want to use your own data, or data that you obtain from somewhere else. These notes discuss ways to get data into R.

Some sources for data

Here are some places that have data sets available online

data.gov has more than 10,000 data sets in csv format (plus others in other formats)
publicdata.eu has more than 10,000 data sets in csv format (plus others in other formats)
World Bank has data in several formats, including csv an Excel
datahub has over 1,000 data sets in csv format (plus others in other formats)
CSV format data from several R packages
- It is easier to get this data from within the package, but they can be used as examples for import from CSV format. The resulting data is typically all tidied and ready to be used for analysis.

Importing CSV and Excel Data

Data can be stored in many file formats. The CSV (comma separated values) and TSV (tab separated values) file formats are among the most commonly used, and most data handling software can export to CSV and import from CSV, so it good to learn how to work with CSV data.

Excel can also export to CSV, but it is safer to import Excel data directly, since the export to CSV and reimport. Reading directly from Excel has several advantages:

It avoids having multiple copies of the data, which could become out of sync and cause confusion.
It makes it easy to automate work with Excel spreadsheets.
It allows you to work with Excel spreasheets even if you don’t have Excel running (or even installed) on your machine.
If we use the readxl package for this, then certain oddities of Excel files are taken care of for us. Most importantly, readxl takes care of date conversion an knows how to handle data from multiple versions of Excel. (This can be a source of major problems if you use Excel-generated CSV files.)

Exercise: Obtain a csv file or Excel spreadsheet containing data. You can create or use your own data or find a data set online.

Uploading Files to RStudio Server

This step is only required in the web version of RStudio when the file you want to import is on your local machine and not on the server. This will save a copy of the file on your local machine on the server.

In the Files tab, navigate to the folder where you want to store the file.
Choose “Upload File”
Navigate to the file on your local machine.

That’s all there is to it. You can use this procedure to upload files of any type.

Importing Data into your R Session

In the Environment Tab find “Import Dataset”.
Choose the appropriate file type

import from

Note: The files must live on the same machine as RStudio.  If you
are using the web version of RStudio, you may need to upload your file
first.  See the previous section.

In the form below, fill out the information for your data set
1. Navigate to the file, or paste in the URL.
2. Name the data set (the default name is derived from the file name).
3. Fill out the form to declare the structure of the file.
4. Hit “Import” to import the data.

This will place the data set in your working environment.

Using R commands directly

If you import data using the interface described above, you will see that RStudio executes some commands in the console. Using RStudio’s ability to import data is handy for interactive use, but it is not part of a reproducible workflow an can’t be use in RMarkdown documents. You can simply copy and paste the commands from the console (skip the View() command if you are copying into an RMarkdown document since View() is not available there.)

Since CSV files are very common and there are multiple ways to import CSV files, it is good to know a bit more about CSV imports. The main functions we can use for this are read.csv() and readr::read_csv() and mosaic::read.file(). These functions work similarly.

For our example, we will use the Anscombe.csv file which you can obtain using the command below.

download.file(
  "http://vincentarelbundock.github.io/Rdatasets/csv/car/Anscombe.csv",
  "Anscombe.csv")

Now let’s import the data using each of these three functions.

Anscombe1 <- read.csv("Anscombe.csv", header=TRUE)
require(readr)

## Loading required package: readr

Anscombe2 <- read_csv("Anscombe.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   education = col_integer(),
##   income = col_integer(),
##   young = col_double(),
##   urban = col_integer()
## )

require(mosaic)
Anscombe3 <- read.file("Anscombe.csv")

## Reading data with read.csv()

The following shows a few of the differences among these functions.

inspect(Anscombe1)

## 
## categorical variables:  
##   name  class levels  n missing                                  distribution
## 1    X factor     51 51       0 AK (2%), AL (2%), AR (2%) ...                
## 
## quantitative variables:  
##        name   class  min   Q1 median   Q3  max mean    sd  n missing
## 1 education integer  112  165    192  228  372  196  46.5 51       0
## 2    income integer 2081 2786   3257 3612 4425 3225 560.0 51       0
## 3     young numeric  326  342    354  369  440  359  24.0 51       0
## 4     urban integer  322  552    664  790 1000  665 151.3 51       0

inspect(Anscombe2)

## 
## categorical variables:  
##   name     class levels  n missing                                  distribution
## 1   X1 character     51 51       0 AK (2%), AL (2%), AR (2%) ...                
## 
## quantitative variables:  
##        name   class  min   Q1 median   Q3  max mean    sd  n missing
## 1 education integer  112  165    192  228  372  196  46.5 51       0
## 2    income integer 2081 2786   3257 3612 4425 3225 560.0 51       0
## 3     young numeric  326  342    354  369  440  359  24.0 51       0
## 4     urban integer  322  552    664  790 1000  665 151.3 51       0

inspect(Anscombe3)

## 
## categorical variables:  
##   name     class levels  n missing                                  distribution
## 1    X character     51 51       0 AK (2%), AL (2%), AR (2%) ...                
## 
## quantitative variables:  
##        name   class  min   Q1 median   Q3  max mean    sd  n missing
## 1 education integer  112  165    192  228  372  196  46.5 51       0
## 2    income integer 2081 2786   3257 3612 4425 3225 560.0 51       0
## 3     young numeric  326  342    354  369  440  359  24.0 51       0
## 4     urban integer  322  552    664  790 1000  665 151.3 51       0

read.csv() needs to be told that there is a header row containing the variable names. The others need to be told when there is not a header row.
read.csv() modifies variable names to make them “nice”; read_csv() does not. Notice that in Anscombe2 has no name for the first variable, while Anscombe1 renames this as X. We can rename this manually if we like:

names(Anscombe2)[1] <- "state"  # rename first variable
xyplot( education ~ income, data = Anscombe2)

read.csv() converts character data into factor data (by default); read_csv() does not.
read.file() uses the file name to guess which of several functions to use for file import. In this case it is using read.csv(). It can also read several other types of files. See ?read.file for details.
depending on the file structure, read_csv() or read.csv() may fail when the other succeeds. It’s good to know about both if you are using data from online sources.

Importing Excel Spreadsheets

In the same way, we can use R commands to import from an Excel spreadsheet (for example, by copying and pasting from the console after using the GUI). Here’s what that would look like.

download.file(
  "http://api.worldbank.org/v2/en/topic/17?downloadformat=excel",
  "Gender.xls")

The first two rows of this spreadsheet don’t contain data, so we’ll skip them.

require(readxl) 
Gender <- read_excel("Gender.xls", skip=2)
inspect(Gender[, 1:10])

## 
## categorical variables:  
##             name     class levels     n missing
## 1   Country Name character    248 33728       0
## 2   Country Code character    247 33592     136
## 3 Indicator Name character    136 33728       0
## 4 Indicator Code character    136 33728       0
##                                    distribution
## 1 Afghanistan (0.4%), Albania (0.4%) ...       
## 2 ABW (0.4%), AFG (0.4%), AGO (0.4%) ...       
## 3  (%) ...                                     
## 4 IC.FRM.FEMO.ZS (0.7%) ...                    
## 
## quantitative variables:  
##   name   class  min   Q1 median  Q3 max mean  sd    n missing
## 1 1960 numeric 1.94 38.5   61.4 143 655  118 135 1794   31934
## 2 1961 numeric 1.94 38.9   61.8 140 648  116 133 1794   31934
## 3 1962 numeric 1.79 39.4   62.4 139 641  116 131 1792   31936
## 4 1963 numeric 1.82 40.0   62.7 139 631  115 129 1789   31939
## 5 1964 numeric 1.79 40.4   63.0 139 630  114 127 1790   31938
## 6 1965 numeric 1.74 41.1   63.5 137 628  113 125 1792   31936

We would like to reformat this data so that year is one of the variables instead of having a column for each year. We can do this with the tidyr package; we just need to inicate which columns should be converte to data and give a name for the variable (key) and its values (value). The remaining variables are duplicated as needed.

require(tidyr)
Gender2 <- 
  Gender %>% 
  gather(
    "year", "value",
    `1960`:`2014`
  ) %>% 
  filter(!is.na(value)) %>%
  mutate(year = extract_numeric(year))

## extract_numeric() is deprecated: please use readr::parse_number() instead

inspect(Gender2)

## 
## categorical variables:  
##             name     class levels      n missing
## 1   Country Name character    247 446159       0
## 2   Country Code character    246 446036     123
## 3 Indicator Name character    136 446159       0
## 4 Indicator Code character    136 446159       0
##                                    distribution
## 1 Italy (0.7%), Greece (0.7%) ...              
## 2 ITA (0.7%), GRC (0.7%), PAN (0.7%) ...       
## 3  (%) ...                                     
## 4 SP.DYN.LE00.FE.IN (2.8%) ...                 
## 
## quantitative variables:  
##    name   class  min     Q1 median     Q3      max  mean      sd      n missing
## 1  year numeric 1960 1987.0 1998.0 2006.0     2014  1995      13 446159       0
## 2 value numeric    0   23.9   53.9   86.3 74842048 40855 1021226 446159       0

head(Gender2)

## # A tibble: 6 x 6
##   `Country Name` `Country Code`
##            <chr>          <chr>
## 1          Aruba            ABW
## 2          Aruba            ABW
## 3          Aruba            ABW
## 4          Aruba            ABW
## 5          Aruba            ABW
## 6          Aruba            ABW
## # ... with 4 more variables: `Indicator Name` <chr>, `Indicator Code` <chr>, year <dbl>,
## #   value <dbl>

Now we can pick one of the inicators and some countries and compare them over time.

Gender2 %>% 
  filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
         `Indicator Code` %in% c("SP.ADO.TFRT")) %>%
  xyplot(value ~ year, data = ., groups = `Country Code`, type = "l",
         auto.key=list(lines=TRUE, points=FALSE, columns=5), 
         ylab = Gender2$`Indicator Name`[1])

Gender2 %>% 
  filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
         `Indicator Code` %in% c("SP.ADO.TFRT")) %>%
  ggplot() +
  geom_line(aes(y=value, x=year, colour=`Country Code`)) +
  labs(y = Gender2$`Indicator Name`[1])

Exercise: Import an Excel spreadsheet into R an create a plot.

Importing from Google Spreadsheets

The googlesheets package makes it easy to read data from (and write data to) google spreadsheets. You can find a vignette describing how to use the package here:

http://htmlpreview.github.io/?https://raw.githubusercontent.com/jennybc/googlesheets/master/vignettes/basic-usage.html

Other Data Formats

Many packages provide the ability to read specialized forms of data (e.g. genetic sequence data) or data saved in formats native to other software (SAS, SPSS, etc.). The foreign package, for example, has functions to import data from Minitab, S, SAS, Stat, Systat, Weka, octave, and more.

For data specialized for particular purposes, you might check a relevant CRAN Task View.

For things related to biological applications, Bioconductor is another place to look. In particular, take a look at the Bioconductor Workflows.

Getting Data Into R

R Pruim

CVC 2016