The OrdwayBirds
data frame is a historical record of birds captured and released at the Katharine Ordway Natural History Study Area, a 278-acre preserve in Inver Grove Heights, Minnesota, owned and managed by Macalester College. Originally written by hand in a field notebook, the entries have been transcribed into electronic format under the supervision of Jerald Dosch, Dept. of Biology, Macalester College.
Due to mistakes in data entry, the SpeciesName
variable needs some fixing. SpeciesName
is intended to identify the species of each of the birds, but the spelling often varies among birds of the same biological species. This leads to mis-classification of birds. There are also problems with the Month
and Day
variables; they are supposed to be numerical, but mistakes prevent them from being correctly identified as such.
Fortunately, all these errors are easy to correct. The data frame OrdwaySpeciesNames
collects together all the variant spellings. Entry by entry, each mis-spelling was translated (by a human) into a standardized spelling. Thus, join()
can be used to correct the mis-spellings in the OrdwayBirds
table.
You are going to look at the month-to-month presence of different species. Think of your assignment as creating a manual for birders to guide them to the correct time of year to visit Ordway to see a particular species.
There are many variables that you won’t need for this activity, and you still have to fix the Month
and Day
variables. To keep things simple, cut and paste this command into a chunk at the start of the document.
The OrdwayBirds
data are available in the dcData
package.
<-
OrdwayBirds %>%
OrdwayBirds select( SpeciesName, Month, Day ) %>%
mutate( Month = as.numeric(as.character(Month)),
Day = as.numeric(as.character(Day)))
The mutate()
step is part of the data cleaning process, converting Month
and Day
as numerical variables as originally intended by the folks entering the data.
Including mis-spellings, how many different species are there in the OrdwayBirds
data?
Consider the OrdwaySpeciesNames
data frame also found in the dcData
package as well.
How many distinct species are there in the SpeciesNameCleaned
variable in OrdwaySpeciesNames
?
You will find it helpful to use n_distinct()
a reduction function, which counts the number of unique values in a variable.
Use the OrdwaySpeciesNames
table to create a new data frame that corrects the mis-spellings in SpeciesNames
. This can be done easily using the inner_join()
data verb.
<-
Corrected %>%
OrdwayBirds inner_join( OrdwaySpeciesNames ) %>%
select( Species = SpeciesNameCleaned, Month, Day ) %>%
na.omit() # cleaned up the missing ones
Look at the names of the variables in OrdwaySpeciesNames
and OrdwayBirds
.
How many bird captures are reported for each of the (corrected) species?
Call the variable that contains the total count
. Arrange this into descending order from the species with the most birds, and look through the list. Hint: Remember n(). Also, one of the arguments to one of the data verbs will be desc(count) to arrange the cases into descending order. Display the top 10 species in terms of the number of bird captures.
Define for yourself a “major species” as a species with more than a particular threshold count. Set your threshold so that there are 5 or 6 species designated as major.
Filter to produce a data frame with only the birds that belong to a major species.Hint: Remember that summary functions can be used case-by-case when filtering or mutating a data frame that has been grouped.
Save the output in a table called Majors
.
When you have correctly produced Majors
, write a command that produces the month-by-month count of each of the major species. Call this table ByMonth
.
Display this month-by-month count with a bar chart arranged in a way that you think tells the story of what time of year the various species appear. You can use mplot()
to explore different possibilities. Warning: mplot() and similar interactive functions should not appear in your Rmd file, it needs to be used interactively from the console. Use the “Show Expression” button in mplot() to create an expression that you can cut and paste into a chunk in your Rmd document, so that the graph gets created when you compile it.
Once you have the graph, use it to answer these questions:
n_distinct()
and >= 6
.)