Week 4

Functions and their roles

Identify each of these functions as either a Data Verb, a Transformation, a Summary Function, or a Quick Presentation or a Comparison Expression.

str()
Quick presentation
group_by()
Data verb
rank()
Transformation
mean()
Summary Function
filter()
Data Verb
summary()
Quick presentation
summarise()
Data verb
anti_join()
Data verb
merge()
Data verb
glimpse()
Quick presentation

Diamonds

These questions refer to the diamonds data table. Take a look at the codebook (using help()) so that you’ll understand the meaning of the tasks.¹ Write, using paper and pen, an expression that will answer these questions.

What’s the largest diamond depth observed for each clarity group?
What color of diamond is most common for each of the different cuts?

Output from Summarising

Imagine a data table, Patients, with categorical variables name, diagnosis, sex, and quantitative variable age.

You have a statement in the form

Patients %>%
  group_by(  some variables ) %>%
  summarise( count=n(), meanAge=mean(age) )

Replacing some variables with each of the following, say …

what variables will appear in the output
whether meanAge will contain any new information.

sex
diagnosis
sex, diagosis
age, diagnosis
age

Wide and Narrow

Here are three data tables with the same information:

The same data presented in three versions

Version One

Version Two

Version Three

name	sex	year	nbabies
Harrison	M	1912	170
Roderick	M	1912	46
Terry	F	1912	17
Terry	M	1912	49
Harrison	F	2012	15
Harrison	M	2012	2120
Roderick	M	2012	202
Terry	F	2012	17
Terry	M	2012	479

name	year	F	M
Harrison	1912		170
Harrison	2012	15	2120
Roderick	1912		46
Roderick	2012		202
Terry	1912	17	49
Terry	2012	17	479

name	sex	1912	2012
Harrison	F		15
Harrison	M	170	2120
Roderick	M	46	202
Terry	F	17	17
Terry	M	49	479

What is the meaning of a case in each of the tables?
- Version One
- Version Two
- Version Three
Comparing Version One to Version Two, which table is narrow and which one is wide?
What “key” variable from the narrow table is being used?
There are no NAs in Version One, but there are in Versions Two and Three. Why?
Version Two has 6 cases, while Version 3 has only 5 cases. How can they contain the same information?
Version Three was “spread” from Version One. What variable was used to denote the spread columns?
Version One can be created by gathering columns from Version Two.
- Which variables from Two were gathered into One?
- What “key” variable, not explicitly named in Version Two, does appear in Version One?
- Where where the values taken from Version Two to use as levels in the key variable created for Version One?

Suppose you want to create the following table with the name of the most popular name of either sex each year

Source: local data frame [4 x 4]
Groups: year, sex

      name sex year nbabies
1 Roderick   M 1912      46
2    Terry   F 1912      17
3 Harrison   F 2012      15
4 Roderick   M 2012     202

What should the chain of commands look like to make this from Table One?

Suppose you want to calculate the ratio of male to female in each name in each year. Like this:
```
Source: local data frame [6 x 3]

      name year    ratio
1 Harrison 1912       NA
2 Harrison 2012 0.007075
3 Roderick 1912       NA
4 Roderick 2012       NA
5    Terry 1912 0.346939
6    Terry 2012 0.035491
```
- Would you rather start from Version Two or Version Three?
- If you were given Version One, would you rather work directly on that with the data verbs or, first, translate to one of the other forms?

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.

Motivated by this problem set based on drills by Garrett Grolemund.↩