From Strings to Numbers

You’ve seen two major types of variables: quantitative and categorical. You’re used to using quoted character strings as the levels of categorical variables, and numbers for quantitative variable.

Often, you will encounter data files which have variables whose meaning is numeric but whose representation is a character string. This can occur when one or more cases is given a non-numeric value, e.g., “not available” or “.”.

To correct these use the as.numeric() and as.character() functions.

For example, in the OrdwayBirds data, the Month, Day and Year variables are all being stored as categorical variables. Convert these to numbers with the following:

OrdwayBirds <- OrdwayBirds %>%
  mutate( Month=as.numeric(as.character(Month)),
          Year=as.numeric(as.character(Year)),
          Day=as.numeric(as.character(Day)))

Exercise: Not all of the Day values in OrdwayBirds are legitimate. Find the ones that are clearly out of place and filter them from the data.

Dates as Dates

Dates are generally written down as character strings, for instance, “29 October 2014”. As you know, dates have a natural order. When you plot values suc as “16 December 2014” and “29 October 2014”, you expect the December date to come before the October date, even though this is not true alphabetically of the string itself.

When you plot a value that is numeric, you expect the axis to be marked with a few round numbers. A plot from 0 to 100 might have ticks at 0, 20, 40, 60, 100.
It’s similar dates. When you are plotting dates within one month, you expect the day of the month to be shown on the axis. But if you are plotting a range of several years, you it would be appropriate to show only the years on the axis.

When you are given dates stored as a character string, it can be useful to convert them to a genuine dates. For instance, in the OrdwayBirds data, the Timestamp variable refers to the time the data were transcribed from the original lab notebook to the computer file. You can translate the character string into a genuine date using functions from the lubridate package. Consider a few of the date character strings:

OrdwayBirds %>% select( Timestamp ) %>% sample_n( size=3 )
               Timestamp
7523   2/17/2011 9:59:06
5109  1/24/2011 15:17:38
14977  4/4/2012 13:34:05

These dates are written in a format showing month/day/year hour/minute/second. The mdy_hms() function converts strings in this format to a date. As an example, suppose you want to examine when the entries were transcribed and who did them. You might create a small data table like this.

library( lubridate )
WhenAndWho <- OrdwayBirds %>% 
  select(Who=DataEntryPerson, When=Timestamp) %>%
  mutate( When=mdy_hms(When) )

And as a plot …

ggplot( WhenAndWho, aes(When, Who)) + geom_point( alpha=0.2 ) 

plot of chunk unnamed-chunk-6

Many of the same operations that apply to numbers can be used on dates. For example:

WhenAndWho %>% 
  group_by( Who ) %>% 
  summarise( start=min(When,na.rm=TRUE),
             finish=max(When, na.rm=TRUE)) %>%
  mutate( duration=finish-start)
Source: local data frame [9 x 4]

                   Who               start              finish      duration
1                                     <NA>                <NA>       NA secs
2        Abby Colehour 2011-04-23 15:50:24 2011-04-23 15:50:24        0 secs
3   Brennan Panzarella 2010-09-13 10:48:12 2011-04-10 21:58:56 18097844 secs
4        Caitlin Baker 2010-05-13 16:00:30 2010-05-28 19:41:52  1309282 secs
5        Emily Merrill 2010-06-08 09:10:01 2010-06-08 14:47:21    20240 secs
6         Jerald Dosch 2010-04-14 13:20:56 2010-04-14 13:20:56        0 secs
7         Jolani Daney 2010-06-08 09:03:00 2011-05-03 10:12:59 28429799 secs
8 Keith Bradley-Hewitt 2010-09-21 11:31:02 2011-05-06 17:36:38 19634736 secs
9 Mary Catherine Muñiz 2012-02-02 08:57:37 2012-04-30 14:06:27  7621730 secs

There are many similar lubridate functions for converting into dates strings in different formats, e.g. ymd(), dmy(), and so on. There are also functions like hour(), yday(),

Exercise

What does this plot tell you?

WhenAndWho %>% 
  ggplot( aes( x=Who, y=hour(When))) + geom_violin() + coord_flip()

plot of chunk unnamed-chunk-8

Small Project (Optional)

Find the entries in OrdwayBirds where there is a mistake spelling the species. Do these mistakes tend to happen at certain times of day? Make an appropriate graphic.