Week 4

Functions and their roles

Identify each of these functions as either a Data Verb, a Transformation, a Summary Function, or a Quick Presentation or a Comparison Expression.

  • str()
  • group_by()
  • rank()
  • mean()
  • filter()
  • summary()
  • summarise()
  • anti_join()
  • merge()
  • glimpse()

Diamonds

These questions refer to the diamonds data table. Take a look at the codebook (using help()) so that you’ll understand the meaning of the tasks.1 Write, using paper and pen, an expression that will answer these questions.

  • What’s the largest diamond depth observed for each clarity group?
  • What color of diamond is most common for each of the different cuts?

Output from Summarising

Imagine a data table, Patients, with categorical variables name, diagnosis, sex, and quantitative variable age.

You have a statement in the form

Patients %>%
  group_by(  some variables ) %>%
  summarise( count=n(), meanAge=mean(age) )

Replacing some variables with each of the following, say …

  • what variables will appear in the output
  • whether meanAge will contain any new information.
  1. sex
  2. diagnosis
  3. sex, diagosis
  4. age, diagnosis
  5. age

Wide and Narrow

Here are three data tables with the same information:

The same data presented in three versions
Version One Version Two Version Three
name sex year nbabies
Harrison M 1912 170
Roderick M 1912 46
Terry F 1912 17
Terry M 1912 49
Harrison F 2012 15
Harrison M 2012 2120
Roderick M 2012 202
Terry F 2012 17
Terry M 2012 479
name year F M
Harrison 1912 170
Harrison 2012 15 2120
Roderick 1912 46
Roderick 2012 202
Terry 1912 17 49
Terry 2012 17 479
name sex 1912 2012
Harrison F 15
Harrison M 170 2120
Roderick M 46 202
Terry F 17 17
Terry M 49 479
  1. What is the meaning of a case in each of the tables?
    • Version One
    • Version Two
    • Version Three
  2. Comparing Version One to Version Two, which table is narrow and which one is wide?
  3. What “key” variable from the narrow table is being used?
  4. There are no NAs in Version One, but there are in Versions Two and Three. Why?
  5. Version Two has 6 cases, while Version 3 has only 5 cases. How can they contain the same information?
  6. Version Three was “spread” from Version One. What variable was used to denote the spread columns?
  7. Version One can be created by gathering columns from Version Two.
    • Which variables from Two were gathered into One?
    • What “key” variable, not explicitly named in Version Two, does appear in Version One?
    • Where where the values taken from Version Two to use as levels in the key variable created for Version One?
  • Suppose you want to create the following table with the name of the most popular name of either sex each year

    Source: local data frame [4 x 4]
    Groups: year, sex
    
          name sex year nbabies
    1 Roderick   M 1912      46
    2    Terry   F 1912      17
    3 Harrison   F 2012      15
    4 Roderick   M 2012     202
    What should the chain of commands look like to make this from Table One?
  • Suppose you want to calculate the ratio of male to female in each name in each year. Like this:

    Source: local data frame [6 x 3]
    
          name year    ratio
    1 Harrison 1912       NA
    2 Harrison 2012 0.007075
    3 Roderick 1912       NA
    4 Roderick 2012       NA
    5    Terry 1912 0.346939
    6    Terry 2012 0.035491
    • Would you rather start from Version Two or Version Three?
    • If you were given Version One, would you rather work directly on that with the data verbs or, first, translate to one of the other forms?

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.


  1. Motivated by this problem set based on drills by Garrett Grolemund.