Working with data in packages is great for learning how R works, but eventually you will want to use your own data, or data that you obtain from somewhere else. These notes discuss ways to get data into R.
Here are some places that have data sets available online
data.gov has more than 10,000 data sets in csv format (plus others in other formats)
publicdata.eu has more than 10,000 data sets in csv format (plus others in other formats)
World Bank has data in several formats, including csv an Excel
datahub has over 1,000 data sets in csv format (plus others in other formats)
Data can be stored in many file formats. The CSV (comma separated values) and TSV (tab separated values) file formats are among the most commonly used, and most data handling software can export to CSV and import from CSV, so it good to learn how to work with CSV data.
Excel can also export to CSV, but it is safer to import Excel data directly, since the export to CSV and reimport. Reading directly from Excel has several advantages:
readxl
package for this, then certain oddities of Excel files are taken care of for us. Most importantly, readxl
takes care of date conversion an knows how to handle data from multiple versions of Excel. (This can be a source of major problems if you use Excel-generated CSV files.)Exercise: Obtain a csv file or Excel spreadsheet containing data. You can create or use your own data or find a data set online.
This step is only required in the web version of RStudio when the file you want to import is on your local machine and not on the server. This will save a copy of the file on your local machine on the server.
In the Files tab, navigate to the folder where you want to store the file.
Choose “Upload File”
Navigate to the file on your local machine.
That’s all there is to it. You can use this procedure to upload files of any type.
In the Environment Tab find “Import Dataset”.
Choose the appropriate file type
Note: The files must live on the same machine as RStudio. If you
are using the web version of RStudio, you may need to upload your file
first. See the previous section.
In the form below, fill out the information for your data set
Navigate to the file, or paste in the URL.
Name the data set (the default name is derived from the file name).
Fill out the form to declare the structure of the file.
Hit “Import” to import the data.
This will place the data set in your working environment.
If you import data using the interface described above, you will see that RStudio executes some commands in the console. Using RStudio’s ability to import data is handy for interactive use, but it is not part of a reproducible workflow an can’t be use in RMarkdown documents. You can simply copy and paste the commands from the console (skip the View()
command if you are copying into an RMarkdown document since View()
is not available there.)
Since CSV files are very common and there are multiple ways to import CSV files, it is good to know a bit more about CSV imports. The main functions we can use for this are read.csv()
and readr::read_csv()
and mosaic::read.file()
. These functions work similarly.
For our example, we will use the Anscombe.csv
file which you can obtain using the command below.
download.file(
"http://vincentarelbundock.github.io/Rdatasets/csv/car/Anscombe.csv",
"Anscombe.csv")
Now let’s import the data using each of these three functions.
Anscombe1 <- read.csv("Anscombe.csv", header=TRUE)
require(readr)
## Loading required package: readr
Anscombe2 <- read_csv("Anscombe.csv")
require(mosaic)
Anscombe3 <- read.file("Anscombe.csv")
## Reading data with read.csv()
The following shows a few of the differences among these functions.
inspect(Anscombe1)
##
## categorical variables:
## name class levels n missing distribution
## 1 X factor 51 51 0 AK (2%), AL (2%), AR (2%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 education integer 112 165 192 228 372 196 46.5 51 0
## 2 income integer 2081 2786 3257 3612 4425 3225 560.0 51 0
## 3 young numeric 326 342 354 369 440 359 24.0 51 0
## 4 urban integer 322 552 664 790 1000 665 151.3 51 0
inspect(Anscombe2)
##
## categorical variables:
## name class levels n missing distribution
## 1 character 51 51 0 AK (2%), AL (2%), AR (2%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 education integer 112 165 192 228 372 196 46.5 51 0
## 2 income integer 2081 2786 3257 3612 4425 3225 560.0 51 0
## 3 young numeric 326 342 354 369 440 359 24.0 51 0
## 4 urban integer 322 552 664 790 1000 665 151.3 51 0
inspect(Anscombe3)
##
## categorical variables:
## name class levels n missing distribution
## 1 X character 51 51 0 AK (2%), AL (2%), AR (2%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 education integer 112 165 192 228 372 196 46.5 51 0
## 2 income integer 2081 2786 3257 3612 4425 3225 560.0 51 0
## 3 young numeric 326 342 354 369 440 359 24.0 51 0
## 4 urban integer 322 552 664 790 1000 665 151.3 51 0
read.csv()
needs to be told that there is a header row containing the variable names. The others need to be told when there is not a header row.
read.csv()
modifies variable names to make them “nice”; read_csv()
does not. Notice that in Anscombe2
has no name for the first variable, while Anscombe1
renames this as X
. We can rename this manually if we like:
names(Anscombe2)[1] <- "state" # rename first variable
xyplot( education ~ income, data = Anscombe2)
read.csv()
converts character data into factor data (by default); read_csv()
does not.
read.file()
uses the file name to guess which of several functions to use for file import. In this case it is using read.csv()
. It can also read several other types of files. See ?read.file
for details.
depending on the file structure, read_csv()
or read.csv()
may fail when the other succeeds. It’s good to know about both if you are using data from online sources.
In the same way, we can use R commands to import from an Excel spreadsheet (for example, by copying and pasting from the console after using the GUI). Here’s what that would look like.
download.file(
"http://api.worldbank.org/v2/en/topic/17?downloadformat=excel",
"Gender.xls")
The first two rows of this spreadsheet don’t contain data, so we’ll skip them.
require(readxl)
Gender <- read_excel("Gender.xls", skip=2)
inspect(Gender[, 1:10])
##
## categorical variables:
## name class levels n missing
## 1 Country Name character 248 33728 0
## 2 Country Code character 247 33592 136
## 3 Indicator Name character 136 33728 0
## 4 Indicator Code character 136 33728 0
## distribution
## 1 Afghanistan (0.4%), Albania (0.4%) ...
## 2 ABW (0.4%), AFG (0.4%), AGO (0.4%) ...
## 3 (%) ...
## 4 IC.FRM.FEMO.ZS (0.7%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 1960 numeric 1.94 38.5 61.4 143 655 118 135 1794 31934
## 2 1961 numeric 1.94 38.9 61.8 140 648 116 133 1794 31934
## 3 1962 numeric 1.79 39.4 62.4 139 641 116 131 1792 31936
## 4 1963 numeric 1.82 40.0 62.7 139 631 115 129 1789 31939
## 5 1964 numeric 1.79 40.4 63.0 139 630 114 127 1790 31938
## 6 1965 numeric 1.74 41.1 63.5 137 628 113 125 1792 31936
We would like to reformat this data so that year is one of the variables instead of having a column for each year. We can do this with the tidyr
package; we just need to inicate which columns should be converte to data and give a name for the variable (key
) and its values (value
). The remaining variables are duplicated as needed.
require(tidyr)
Gender2 <-
Gender %>%
gather(
"year", "value",
`1960`:`2014`
) %>%
filter(!is.na(value)) %>%
mutate(year = extract_numeric(year))
inspect(Gender2)
##
## categorical variables:
## name class levels n missing
## 1 Country Name character 247 446159 0
## 2 Country Code character 246 446036 123
## 3 Indicator Name character 136 446159 0
## 4 Indicator Code character 136 446159 0
## distribution
## 1 Italy (0.7%), Greece (0.7%) ...
## 2 ITA (0.7%), GRC (0.7%), PAN (0.7%) ...
## 3 (%) ...
## 4 SP.DYN.LE00.FE.IN (2.8%) ...
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 year numeric 1960 1987.0 1998.0 2006.0 2014 1995 13 446159 0
## 2 value numeric 0 23.9 53.9 86.3 74842048 40855 1021226 446159 0
head(Gender2)
## Source: local data frame [6 x 6]
##
## Country Name Country Code Indicator Name
## (chr) (chr) (chr)
## 1 Aruba ABW Adolescent fertility rate (births per 1,000 women ages 15-19)
## 2 Aruba ABW Mortality rate, adult, female (per 1,000 female adults)
## 3 Aruba ABW Mortality rate, adult, male (per 1,000 male adults)
## 4 Aruba ABW Life expectancy at birth, female (years)
## 5 Aruba ABW Life expectancy at birth, male (years)
## 6 Aruba ABW Fertility rate, total (births per woman)
## Variables not shown: Indicator Code (chr), year (dbl), value (dbl)
Now we can pick one of the inicators and some countries and compare them over time.
Gender2 %>%
filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
`Indicator Code` %in% c("SP.ADO.TFRT")) %>%
xyplot(value ~ year, data = ., groups = `Country Code`, type = "l",
auto.key=list(lines=TRUE, points=FALSE, columns=5),
ylab = Gender2$`Indicator Name`[1])
Gender2 %>%
filter(`Country Code` %in% c("USA", "NLD", "SWE", "KOR", "JPN"),
`Indicator Code` %in% c("SP.ADO.TFRT")) %>%
ggplot() +
geom_line(aes(y=value, x=year, colour=`Country Code`)) +
labs(y = Gender2$`Indicator Name`[1])
Exercise: Import an Excel spreadsheet into R an create a plot.
The googlesheets
package makes it easy to read data from (and write data to) google spreadsheets. You can find a vignette describing how to use the package here:
Many packages provide the ability to read specialized forms of data (e.g. genetic sequence data) or data saved in formats native to other software (SAS, SPSS, etc.). The foreign
package, for example, has functions to import data from Minitab, S, SAS, Stat, Systat, Weka, octave, and more.
For data specialized for particular purposes, you might check a relevant CRAN Task View.
For things related to biological applications, Bioconductor is another place to look. In particular, take a look at the Bioconductor Workflows.