“Every easy data format is alike. Every difficult data format is difficult in its own way.” — inspired by Leo Tolstoy, Anna Kerenina
When you are working on a data-oriented project, it’s likely that you already have a source of data; that may be what motivated the project in the first place.
These notes are meant to help you identify situations where accessing the data will be easy, and to spot and fix common errors in data files.
The data table is a very common format for storing data. If your data are in this format, you likely won’t have much problem putting them into a form that can be read directly in to R or any other widely used technical computing environment.
The first thing is to realize what names are used to refer to data tables. Here are several, each with a brief description:
<table>
formatThe particular software and techniques for reading data in one of these formats, varies depending on the format. For Excel or Google spreadsheet data, it’s often sufficient to use the application software to export the data as a CSV file. There are also R packages for reading directly from an Excel spreadsheet, which is useful if the spreadsheet is being updated frequently.
For the technical software package formats, the foreign
R package’s reading and writing functions are very useful. For relational databases, even if they are on a remote server, there are several useful R packages, including dplyr
and data.table
.
CSV and HTML <table>
formats are frequently encountered. The next subsections give a bit more detail about how to read them into R.
CSV stands for comma-separated values. It’s a text format that can be read with a huge variety of softare. It has a data table format, with the values of variables in each case separated by commas. Here’s an example of the first several lines of a CSV file:
"name","sex","count","year"
"Mary","F",7065,1880
"Anna","F",2604,1880
"Emma","F",2003,1880
"Elizabeth","F",1939,1880
The top row usually (but not always) contains the variable names. Quotation marks are often used at the start and end of character strings; these quotation marks are not part of the content of the string.
Although CSV files are often named with the .csv
suffix, it’s also common for them to be named with .txt
or other things. You will also see characters other than commas being used to delimit the fields, tabs and vertical bars are particularly common.
An excellent function for reading CSV files into R is fread()
in the data.table
package.1 Combined with the XML
and RCurl
packages, you can even read files stored on other computers if you have a URL for the file. For instance, here’s a way to access a .csv
file over the Internet.
library( XML )
library( RCurl )
library( data.table )
# Find your own URL!
# Remember the quotes around the character string.
myURL <- "http://www.mosaic-web.org/go/datasets/SaratogaHouses.csv"
MyDataTable <-
myURL %>%
getURLContent() %>%
data.table::fread()
head( MyDataTable )
Price Living.Area Baths Bedrooms Fireplace Acres Age
1: 142212 1982 1.0 3 N 2.00 133
2: 134865 1676 1.5 3 Y 0.38 14
3: 118007 1694 2.0 3 Y 0.96 15
4: 138297 1800 1.0 2 Y 0.48 49
5: 129470 2088 1.0 3 Y 1.84 29
6: 206512 1456 2.0 3 N 0.98 10
Reading from a file on your own computer is even easier. You just need to have the file path, as can be found using file.choose()
. For instance:
# Call file.choose() then copy the string with the file path below
fileName <- "~/Project1/Important.csv"
MyDataTable2 <-
fileName %>%
data.table::fread()
Useful arguments to fread()
:
stringsAsFactors=FALSE
is useful particularly when you plan to be cleaning the data.nrows=0
— just read the variable names. This is helpful when you are checking into the format and variable names of a data table. Of course, you might also want to look at a few rows of data, by setting nrows
to a small positive integer, e.g. nrows=3
.select=c(1,4,5,10)
allows you to specify the variables you want to read in. This is very useful for large data files with extraneous information.drop=c(2,3,6)
is like select
, but drops the columns.Web pages are usually in HTML format. This format is much more general than the data table format; it’s used for text, page arrangement, etc. Sometimes, however, there is tabular data contained within the HTML page. This will often look in your browser like the following:
Within HTML, such tables are represented with the HTML <table>
tag.
When you have the URL of a page containing one or more tables, it can be easy to read them in to R as data tables. Here’s the pattern.
library( XML )
library( RCurl )
SetOfTables <-
"http://en.wikipedia.org/wiki/Mile_run_world_record_progression" %>%
getURLContent() %>%
readHTMLTable( stringsAsFactors=FALSE )
The result, SetOfTables
, is not a table, it is a list of the tables found in the web page. You can access any of those tables like this:
length( SetOfTables )
[1] 8
There are 8 identified by readHTMLTable()
.
To look at the first one,
head( SetOfTables[[1]], 2 ) # Note: double square brackets
Time Athlete Nationality Date Venue
1 4:28 Charles Westhall United Kingdom 26 July 1855 London
2 4:28 Thomas Horspool United Kingdom 28 September 1857 Manchester
You can look at the second, third, and so on in the same way. When you have found the table or tables you want, assign them to an object name. For instance, the two tables shown in the figure above happen to be the third and fourth in SetOfTables
.
MyTable <- SetOfTables[[3]]
MyOtherTable <- SetOfTables[[4]]
head( MyTable, 2 )
Time | Athlete | Nationality | Date | Venue |
---|---|---|---|---|
4:52 | Cadet Marshall | United Kingdom | 2 September 1852 | Addiscome |
4:45 | Thomas Finch | United Kingdom | 3 November 1858 | Oxford |
head( MyOtherTable, 2 )
Time | Auto | Athlete | Nationality | Date | Venue |
---|---|---|---|---|---|
4:14.4 | John Paul Jones | United States | 31 May 1913[5] | Allston, Mass. | |
4:12.6 | Norman Taber | United States | 16 July 1915[5] | Allston, Mass. |
A person somewhat knowledgable about running would have little trouble interpreting the previous tables correctly. The Time
is in minutes and seconds. The Date
gives the day on which the record was set.
Data cleaning refers to taking the information contained in a variable and transforming it to a form in which that information can be used. As an example, consider the spreadsheet from which the OrdwayBirds
data table was extracted.
It’s a simple matter to use the Google toolbar to download these data as a CSV file:
When the menu choice is made, the CSV file is put into a directory, from which it can be read with fread()
. Here, a random set of 50 is read.
Random5000 <-
fread( "~/Downloads/BirdBanding-Sheet1.csv" ) %>%
sample_n( size=5000 )
The initial step in cleaning a file is often to give the variables useful names. Here are the names that come originally with the file:
names( Random5000 )
[1] "1730-20842" "Year (19xx)" "Day"
[4] "Month" "Time of capture" "species name"
[7] "Sex" "Age" "Band number"
[10] "Trap identification" "Weather" "Banding report?"
[13] "Recapture?" "Recapture date (month)" "Recapture date (day)"
[16] "Condition" "Release" "Comments"
[19] "Data entry person" "Weight" "Wing chord"
[22] "Temperature" "\"Recapture, original\"" "\"Recapture, previous\""
[25] "Tail length" "Capture location"
At best, these are too long to be convenient.
One way to change the names is like this:
newNames <- c('code','year','day','month','time','species','sex','age','bandNum',
'trapID','weather','bandingReport','recapture','recapMonth',
'recapDay','condition','release','comments','dataEntryPerson',
'weight','wingChord','temperature','OrigRecapture','previousRecapture',
'tailLength','captureLocation')
names( Random5000 ) <- newNames
For simplicity, consider just a few variables:
BirdData <- Random5000 %>%
select( year, day, month, time, species,
weight, wingChord, tailLength )
head( BirdData )
year day month time species weight wingChord tailLength
1: 1973 18 8 20:00:00 Song Sparrow
2: 1972 11 10 14:30:00 Slate-colored Junco
3: 1976 23 9 8:00:00 Lincoln's Sparrow 17 61
4: 1976 5 4 15:00:00 Slate-colored Junco 23.5 76
5: 1979 15 5 10:00:00 Catbird 37.9 93
6: 1973 12 1 8:00:00 Curve-billed Thrasher
A first step can be to check which type each of the variables is.
str( BirdData )
Classes 'data.table' and 'data.frame': 5000 obs. of 8 variables:
$ year : int 1973 1972 1976 1976 1979 1973 1975 1973 1976 1975 ...
$ day : int 18 11 23 5 15 12 27 13 2 8 ...
$ month : int 8 10 9 4 5 1 11 4 3 10 ...
$ time : chr "20:00:00" "14:30:00" "8:00:00" "15:00:00" ...
$ species : chr "Song Sparrow" "Slate-colored Junco" "Lincoln's Sparrow" "Slate-colored Junco" ...
$ weight : chr "" "" "17" "23.5" ...
$ wingChord : chr "" "" "61" "76" ...
$ tailLength: chr "" "" "" "" ...
- attr(*, ".internal.selfref")=<externalptr>
Although the weight
, wingChord
, and tailLength
variables are supposed to be numerical, they have been read in as character strings. The as.numeric()
function can be used to convert them to actual numbers. If the entries are non-numeric, they will be turned into NA
.
BirdData <- BirdData %>%
mutate( weight=as.numeric(weight),
wingChord=as.numeric(wingChord),
tailLength=as.numeric(tailLength) )
It’s helpful to plot the distribution of each of the variables to look for outliers.
ggplot( data=BirdData, aes(x=weight)) + geom_density()
ggplot( data=BirdData, aes(x=wingChord)) + geom_density()
ggplot( data=BirdData, aes(x=tailLength)) + geom_density()
There are a few birds with a weight below zero but no other obvious anomolies. When there are, you can filter those cases out of the data or turn them to NA
.
BirdData <- BirdData %>%
filter( weight > 0 )
Similarly, check the other variables to make sure values are in line. For instance, the month should be between 1 and 12 (inclusive):
BirdData %>%
group_by( month ) %>%
summarise( count=n() )
Source: local data table [12 x 2]
month count
1 1 152
2 2 101
3 3 211
4 4 409
5 5 509
6 6 173
7 7 181
8 8 99
9 9 408
10 10 752
11 11 306
12 12 146
Checking variables in this way is tedious, but saves trouble in the future.
Some variables provide no easy way to check, for instance species
. One helpful strategy is to count the number of instances of each level, and organize them in descending order of count.
SpeciesCount <- BirdData %>%
group_by( species ) %>%
summarise( count=n() ) %>%
arrange( desc(count) )
nrow(SpeciesCount)
[1] 131
Examine the counts by eye, using a spreadsheet program. To do this, write the data to a CSV file:
write.csv( SpeciesCount, file="SpeciesCount.csv" )
In the count of species, you can see that the first several have large counts. Species with very low counts may be mis-spellings. Create another variable in the spreadsheet to place the corrected spelling. Then, read the spreadsheet file back into R, filter out the cases where corrections were made, and join them with the original data. (A join such as left_join()
that keeps all the rows in the original table is appropriate.)
Another useful function, part of base R, is read.csv()
. The fread()
function is generally better: faster, more flexible, etc.↩