Working with data in packages is great for learning how R works, but eventually you will want to use your own data, or data that you obtain from somewhere else. These notes discuss ways to get data into R.
Here are some places that have data sets available online
data.gov has more than 10,000 data sets in csv format (plus others in other formats)
publicdata.eu has more than 10,000 data sets in csv format (plus others in other formats)
World Bank has data in several formats, including csv an Excel
datahub has over 1,000 data sets in csv format (plus others in other formats)
Data can be stored in many file formats. The CSV (comma separated values) and TSV (tab separated values) file formats are among the most commonly used, and most data handling software can export to CSV and import from CSV, so it good to learn how to work with CSV data.
Exercise: Obtain a csv file containing data. You can create your own csv file from a google or Excel spreadsheet, or you can find a CSV file online.
Note: If you have data in an Excel spreadsheet, you can “save as” csv. This isn’t exactly the same thing as exporting as csv because the open file switches formats and you are no longer working with a regular Excel file.
*This step is only required in the web version of RStudio when the file you want to import is on your local machine and not on the server.
That’s all there is to it. You can use this procedure to upload files of any type.
Choose “From Text File” or “From Web URL”
Hit “Import” to import the data.
This will place the data set in your working environment.
Using RStudio’s ability to import data is handy for interactive use, but it is not part of a reproducible workflow an can’t be use in RMarkdown documents. So it is good to learn how to use R commands to import CSV files.
Note: When you use RStudio’s interactive interface, the command used to read the file is printe in the console, so you can also import interactively and copy this command into your RMarkdown document.
The main functions we can use for this are read.csv()
and readr::read_csv()
. These functions work similarly.
For our example, we will use the Anscombe.csv
file which you can obtain using the command below.
download.file(
"http://vincentarelbundock.github.io/Rdatasets/csv/car/Anscombe.csv",
"Anscombe.csv")
Anscombe1 <- read.csv("Anscombe.csv", header=TRUE)
require(readr)
## Loading required package: readr
Anscombe2 <- read_csv("Anscombe.csv")
The following shows a few of the differences between the two functions.
glimpse(Anscombe1)
## Observations: 51
## Variables:
## $ X (fctr) ME, NH, VT, MA, RI, CT, NY, NJ, PA, OH, IN, IL, MI,...
## $ education (int) 189, 169, 230, 168, 180, 193, 261, 214, 201, 172, 19...
## $ income (int) 2824, 3259, 3072, 3835, 3549, 4256, 4151, 3954, 3419...
## $ young (dbl) 350.7, 345.9, 348.5, 335.3, 327.1, 341.0, 326.2, 333...
## $ urban (int) 508, 564, 322, 846, 871, 774, 856, 889, 715, 753, 64...
glimpse(Anscombe2)
## Observations: 51
## Variables:
## $ [EMPTY] (chr) "ME", "NH", "VT", "MA", "RI", "CT", "NY", "NJ", "PA"...
## $ education (int) 189, 169, 230, 168, 180, 193, 261, 214, 201, 172, 19...
## $ income (int) 2824, 3259, 3072, 3835, 3549, 4256, 4151, 3954, 3419...
## $ young (dbl) 350.7, 345.9, 348.5, 335.3, 327.1, 341.0, 326.2, 333...
## $ urban (int) 508, 564, 322, 846, 871, 774, 856, 889, 715, 753, 64...
read.csv()
needs to be told that there is a header row containing the variable names.
read.csv()
modifies variable names to make them “nice”;
read_csv()
does not.
Notice that in Anscombe2
has no name for the first variable, while Anscombe1
renames this as X
.
read.csv()
converts character data into factor data (by default); read_csv()
does not.
names(Anscombe2)[1] <- "state"
xyplot( education ~ income, data = Anscombe2)
It is also possible to import data directly from an Excel spreadsheet without first saving it as a CSV file.
This has some important advantages:
readxl
package for this, then certain oddities of Excel files are taken care of for us.readxl
takes care of date conversion an knows how to handle data from multiple versions of Excel.download.file(
"http://api.worldbank.org/v2/en/topic/17?downloadformat=excel",
"Gender.xls")
The first two rows of this spreadsheet don’t contain data, so we’ll skip them.
require(readxl)
## Loading required package: readxl
Gender <- read_excel("Gender.xls", skip=2)
glimpse(Gender[, 1:10])
## Observations: 33728
## Variables:
## $ Country Name (chr) "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "A...
## $ Country Code (chr) "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW"...
## $ Indicator Name (chr) "Firms with female participation in ownership (...
## $ Indicator Code (chr) "IC.FRM.FEMO.ZS", "SE.ADT.1524.LT.FE.ZS", "SE.A...
## $ 1960 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1961 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1962 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1963 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1964 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ 1965 (dbl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
We would like to reformat this data so that year is one of the variables instead of having a column for each year. We can do this with the tidyr
package; we just need to inicate which columns should be converte to data an give a name for the variable (key
) and its values (value
). The remaining variables are duplicated as needed.
require(tidyr)
Gender2 <-
Gender %>%
gather(
"year", "value",
`1960`:`2014`
) %>%
filter(!is.na(value)) %>%
mutate(year = as.numeric(as.character(year)))
glimpse(Gender2)
## Observations: 446159
## Variables:
## $ Country Name (chr) "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "A...
## $ Country Code (chr) "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW"...
## $ Indicator Name (chr) "Adolescent fertility rate (births per 1,000 wo...
## $ Indicator Code (chr) "SP.ADO.TFRT", "SP.DYN.AMRT.FE", "SP.DYN.AMRT.M...
## $ year (dbl) 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,...
## $ value (dbl) 106.20620, 181.48380, 227.25340, 67.12900, 64.0...
Now we can pick one of the inicators and some countries and compare them over time.
Gender2 %>%
filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
`Indicator Code` %in% c("SP.ADO.TFRT")) %>%
xyplot(value ~ year, data = ., groups = `Country Code`, type = "l",
auto.key=list(lines=TRUE, points=FALSE, columns=5),
ylab = Gender2$`Indicator Name`[1])
Gender2 %>%
filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
`Indicator Code` %in% c("SP.ADO.TFRT")) %>%
ggplot() +
geom_line(aes(y=value, x=year, colour=`Country Code`)) +
labs(y = Gender2$`Indicator Name`[1])
Exercise: Import an Excel spreadsheet into R an create a plot.
The googlesheets
package makes it easy to read data from (and write data to) google spreadsheets. You can find a vignette describing how to use the package here:
Many packages provide the ability to read specialized forms of data (e.g. genetic sequence data) or data saved in formats native to other software (SAS, SPSS, etc.). The foreign
package, for example, has functions to import data from Minitab, S, SAS, Stat, Systat, Weka, octave, and more.
For data specialized for particular purposes, you might check a relevant CRAN Task View.
For things related to biological applications, Bioconductor is another place to look. In particular, take a look at the Bioconductor Workflows.