Working with data in packages is great for learning how R works, but eventually you will want to use your own data, or data that you obtain from somewhere else. These notes discuss ways to get data into R.

Some sources for data

Here are some places that have data sets available online

Importing CSV Data

Data can be stored in many file formats. The CSV (comma separated values) and TSV (tab separated values) file formats are among the most commonly used, and most data handling software can export to CSV and import from CSV, so it good to learn how to work with CSV data.

Exercise: Obtain a csv file containing data. You can create your own csv file from a google or Excel spreadsheet, or you can find a CSV file online.

Note: If you have data in an Excel spreadsheet, you can “save as” csv. This isn’t exactly the same thing as exporting as csv because the open file switches formats and you are no longer working with a regular Excel file.

Uploading Files to RStudio Server

*This step is only required in the web version of RStudio when the file you want to import is on your local machine and not on the server.

  1. In the Files tab, navigate to the folder where you want to store the file.
  2. Choose “Upload File”
  3. Navigate to the file on your local machine.

That’s all there is to it. You can use this procedure to upload files of any type.

Importing Data into your R Session

  1. In the Environment Tab find “Import Dataset”.
  2. Choose “From Text File” or “From Web URL”

    • Note: text files must live on the same machine as RStudio. If you are using the web version of RStudio, you may need to upload your file first. See above.
  3. Navigate to the file, or paste in the URL.
  4. Name the data set (the default name is derived from the file name).
  5. Fill out the form to declare the structure of the file.
  6. Hit “Import” to import the data.

This will place the data set in your working environment.

Using R commands directly

Using RStudio’s ability to import data is handy for interactive use, but it is not part of a reproducible workflow an can’t be use in RMarkdown documents. So it is good to learn how to use R commands to import CSV files.

Note: When you use RStudio’s interactive interface, the command used to read the file is printe in the console, so you can also import interactively and copy this command into your RMarkdown document.

The main functions we can use for this are read.csv() and readr::read_csv(). These functions work similarly.

For our example, we will use the Anscombe.csv file which you can obtain using the command below.

download.file(
  "http://vincentarelbundock.github.io/Rdatasets/csv/car/Anscombe.csv",
  "Anscombe.csv")
Anscombe1 <- read.csv("Anscombe.csv", header=TRUE)
require(readr)
## Loading required package: readr
Anscombe2 <- read_csv("Anscombe.csv")

The following shows a few of the differences between the two functions.

glimpse(Anscombe1)
## Observations: 51
## Variables:
## $ X         (fctr) ME, NH, VT, MA, RI, CT, NY, NJ, PA, OH, IN, IL, MI,...
## $ education (int) 189, 169, 230, 168, 180, 193, 261, 214, 201, 172, 19...
## $ income    (int) 2824, 3259, 3072, 3835, 3549, 4256, 4151, 3954, 3419...
## $ young     (dbl) 350.7, 345.9, 348.5, 335.3, 327.1, 341.0, 326.2, 333...
## $ urban     (int) 508, 564, 322, 846, 871, 774, 856, 889, 715, 753, 64...
glimpse(Anscombe2)
## Observations: 51
## Variables:
## $ [EMPTY]   (chr) "ME", "NH", "VT", "MA", "RI", "CT", "NY", "NJ", "PA"...
## $ education (int) 189, 169, 230, 168, 180, 193, 261, 214, 201, 172, 19...
## $ income    (int) 2824, 3259, 3072, 3835, 3549, 4256, 4151, 3954, 3419...
## $ young     (dbl) 350.7, 345.9, 348.5, 335.3, 327.1, 341.0, 326.2, 333...
## $ urban     (int) 508, 564, 322, 846, 871, 774, 856, 889, 715, 753, 64...
  • read.csv() needs to be told that there is a header row containing the variable names.

  • read.csv() modifies variable names to make them “nice”;
    read_csv() does not.
    Notice that in Anscombe2 has no name for the first variable, while Anscombe1 renames this as X.

  • read.csv() converts character data into factor data (by default); read_csv() does not.

names(Anscombe2)[1] <- "state"
xyplot( education ~ income, data = Anscombe2)

Importing Excel Spreadsheets

It is also possible to import data directly from an Excel spreadsheet without first saving it as a CSV file.
This has some important advantages:

  • It avoids having multiple copies of the data, which could become out of sync and cause confusion.
  • It makes it easy to automate work with Excel spreadsheets.
  • It allows you to work with Excel spreasheets even if you don’t have Excel running (or even installed) on your machine
  • If we use the readxl package for this, then certain oddities of Excel files are taken care of for us.
    Most importantly, readxl takes care of date conversion an knows how to handle data from multiple versions of Excel.
download.file(
  "http://api.worldbank.org/v2/en/topic/17?downloadformat=excel",
  "Gender.xls")

The first two rows of this spreadsheet don’t contain data, so we’ll skip them.

require(readxl)
## Loading required package: readxl
Gender <- read_excel("Gender.xls", skip=2)
glimpse(Gender[, 1:10])
## Observations: 33728
## Variables:
## $ Country Name   (chr) "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "A...
## $ Country Code   (chr) "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW"...
## $ Indicator Name (chr) "Firms with female participation in ownership (...
## $ Indicator Code (chr) "IC.FRM.FEMO.ZS", "SE.ADT.1524.LT.FE.ZS", "SE.A...
## $ 1960           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1961           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1962           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1963           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1964           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1965           (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

We would like to reformat this data so that year is one of the variables instead of having a column for each year. We can do this with the tidyr package; we just need to inicate which columns should be converte to data an give a name for the variable (key) and its values (value). The remaining variables are duplicated as needed.

require(tidyr)
Gender2 <- 
  Gender %>% 
  gather(
    "year", "value",
    `1960`:`2014`
  ) %>% 
  filter(!is.na(value)) %>%
  mutate(year = as.numeric(as.character(year)))
glimpse(Gender2)
## Observations: 446159
## Variables:
## $ Country Name   (chr) "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "A...
## $ Country Code   (chr) "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW"...
## $ Indicator Name (chr) "Adolescent fertility rate (births per 1,000 wo...
## $ Indicator Code (chr) "SP.ADO.TFRT", "SP.DYN.AMRT.FE", "SP.DYN.AMRT.M...
## $ year           (dbl) 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,...
## $ value          (dbl) 106.20620, 181.48380, 227.25340, 67.12900, 64.0...

Now we can pick one of the inicators and some countries and compare them over time.

Gender2 %>% 
  filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
         `Indicator Code` %in% c("SP.ADO.TFRT")) %>%
  xyplot(value ~ year, data = ., groups = `Country Code`, type = "l",
         auto.key=list(lines=TRUE, points=FALSE, columns=5), 
         ylab = Gender2$`Indicator Name`[1])

Gender2 %>% 
  filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
         `Indicator Code` %in% c("SP.ADO.TFRT")) %>%
  ggplot() +
  geom_line(aes(y=value, x=year, colour=`Country Code`)) +
  labs(y = Gender2$`Indicator Name`[1])

Exercise: Import an Excel spreadsheet into R an create a plot.

Importing from Google Spreadsheets

The googlesheets package makes it easy to read data from (and write data to) google spreadsheets. You can find a vignette describing how to use the package here:

Other Data Formats

Many packages provide the ability to read specialized forms of data (e.g. genetic sequence data) or data saved in formats native to other software (SAS, SPSS, etc.). The foreign package, for example, has functions to import data from Minitab, S, SAS, Stat, Systat, Weka, octave, and more.

For data specialized for particular purposes, you might check a relevant CRAN Task View.

For things related to biological applications, Bioconductor is another place to look. In particular, take a look at the Bioconductor Workflows.