Chapter 10 More Data Verbs

You have already seen two data verbs:

summarise()
group_by()

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

Result <- 
  BabyNames %>% 
  group_by( sex, year ) %>%
  summarise( total = sum( count ) )

Table 10.1: Grouping BabyNames by sex and year then using summarise() to add up the total births for each group.

sex	year	total
F	1892	212347
M	1892	122038
F	1935	1048395
M	1935	1040892
F	1978	1531592
M	1978	1642137

This chapter introduces four new data verbs that, like summarise() and group_by(), perform their action on a single data frame.

select()
filter()
mutate()
arrange()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

10.1 Choosing variables with select()

Selecting from a data frame means choosing one or more variables from the table, creating a new table that has just a subset of the variables in the original. For now, think of select() as enabling you to simplify a table to eliminate variables that you don’t need or to rename a variable.

The use of select() is straightforward. It takes a data frame as input along with the names of one or more variables. To illustrate, consider the CountryData table in the dcData package. There are 76 variables in CountryData.

names(CountryData)

 [1] "country"      "area"         "pop"          "growth"       "birth"       
 [6] "death"        "migr"         "maternal"     "infant"       "life"        
[11] "fert"         "health"       "HIVrate"      "HIVpeople"    "HIVdeath"    
   ... and so on ...
[61] "petroImp"     "gasProd"      "gasCons"      "gasExp"       "gasImp"      
[66] "gasRes"       "mainlines"    "cell"         "netHosts"     "netUsers"    
[71] "airports"     "railways"     "roadways"     "waterways"    "marine"      
[76] "military"

By modern standards for computing, 76 is not at all a large number of variables. But for a printed display, it would be ungainly to have to print all columns. Instead, we might choose to show only the variables that are most important for our analysis. To display fewer columns, use select, as was used in Table 10.2 to subset the variables of the original CountryData data frame and create the smaller Demographics data frame.

Demographics <- 
  CountryData %>% 
  select(country, pop, area)

Table 10.2: The Demographics table created by selection from CountryData.

country	pop	area
Afghanistan	31822848	652230
Akrotiri	15700	123
Albania	3020209	28748
Algeria	38813722	2381741
American Samoa	54517	199
… and so on for 256 rows altogether.

If you want to rename a variable, use rename(). The arguments to rename() specify which variables to rename and what the new name will be. For instance, to change area to kmSquared use the argument kmSquared = area. (Table 10.3)

Table 10.3: Renaming the area variable from Table 10.2.

country	pop	kmSquared
Afghanistan	31822848	652230
Akrotiri	15700	123
Albania	3020209	28748
Algeria	38813722	2381741
American Samoa	54517	199
Andorra	85458	468
… and so on for 256 rows altogether.

10.2 Removing cases with filter()

To filter means to remove unwanted material. The data verb filter() removes unwanted cases, passing through to the result only those cases that are wanted or needed.11 Filtering and selecting are complementary actions for subsetting a data frame. select() passes the specified subset of variables; filter() passes the specified subset of cases.

With filtering, the cases to pass through are specified by one or more criteria or tests. The tests usually involve comparisons between variables or between a variable and a scalar. Table 10.9 gives examples of comparison functions and their uses.

For instance, here’s how you can filter out the boys, producing a result with only the girls’ names, shown in Table 10.4.

BabyNames %>% 
  filter(sex != "M")

Table 10.4: Using filter() to exclude the males.

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
… and so on for 1,062,432 rows altogether.

Or, you could filter on the basis of year, for instance, keeping the cases for either sex for babies born after 1990:

BabyNames %>% 
  filter( year > 1990 )

Table 10.5: Baby names after 1990.

name	sex	count	year
Ashley	F	43483	1991
Jessica	F	43394	1991
Brittany	F	29089	1991
… and so on for 697,983 rows altogether.

You can specify as many tests as you like. For instance, to keep only the cases for females born after 1990:

BabyNames %>% 
  filter(year > 1990, sex == "F")

Table 10.6: Filtered for rows that satisify all of multiple conditions: after 1990 and sex "F".

name	sex	count	year
Ashley	F	43483	1991
Jessica	F	43394	1991
Brittany	F	29089	1991
… and so on for 413,105 rows altogether.

The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990. To specify either-or, use the vertical bar | rather than a comma to separate the conditions.

BabyNames %>% 
  filter(year > 1990 | sex == "F")

Table 10.7: Filtered for rows that satisfy any of multiple conditions: after 1990 or sex "F".

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
… and so on for 1,347,310 rows altogether.

It’s also possible to test for a variable being any of several different values. For instance, Table 10.8 includes just the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010))

Table 10.8: Filtering to keep only babies born in 1980, 1990, or 2000.

name	sex	count	year
Jennifer	F	58385	1980
Amanda	F	35818	1980
Jessica	F	33920	1980
Melissa	F	31631	1980
Sarah	F	25744	1980
Heather	F	19971	1980
… and so on for 107,935 rows altogether.

Table 10.9: Examples of the sorts of comparisons used as criteria in filter().

Comparison	Function	Meaning	Example
Exact equality	`==`	`year == 1995`	matches a number
		`name == "Julian"`	matches a character string
Greater than	`>`	`year > 1995`	years after 1995
		`name > "C"`	names starting with “D” or after
Less than	`<`	`year < 1995`	years before 1995
		`name < "C"`	names starting with “A” or “B”
Less than or equal to	`<=`	`year <= 1995`	year is 1995 or before
		`name <= "C"`	names alphabetically before “C”
Greater than or equal to	`>=`		Analogous to `<=`

Contained in	`%in%`	`year %in% c(1940, 1941)`	matches either 1940 or 1941
		`name %in% c("April","May","June")`	A springtime name

10.3 Creating variables with mutate()

The word “mutate” means to change in form or nature. The data verb mutate() is more specific: to change a variable or add new variables based on the existing ones. Mutation leaves the cases exactly as they were; the output of mutate() will have exactly as many rows as the input data frame.

To illustrate, consider Table 10.2. It has variables pop and area giving the population and area (in km\(^2\)) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Computing the density is a matter of using a transformation verb: divide population by area. The mutate() verb let’s you carry out this transformation and put the result in a new variable. For instance:22 Note that in this R statement Demographics is the input to mutate() and the result is being stored under the name Demographics. This style can be read as “revise the Demographics table and then….” Of course, you could choose to give the result a new name or to pipe the result to another stage of data wrangling, etc.

Demographics <-
  Demographics %>% 
  mutate(pop_density = pop / area)

Table 10.10: Demographics with a pop_density variable added.

country	pop	area	pop_density
Afghanistan	31822848	652230	48.79084
Akrotiri	15700	123	127.64228
Albania	3020209	28748	105.05806
Algeria	38813722	2381741	16.29637
American Samoa	54517	199	273.95477
Andorra	85458	468	182.60256
… and so on for 256 rows altogether.

The arguments to mutate() specify what the new variable or variables are to be called (pop_density above) and how to calculate the values of those variables. This calculation will always involve a transformation variable, just as the summarise() verb always involves a reduction verb.

10.4 Setting in order with arrange()

To “arrange” means to set in order. The arrange() data verb sorts out the cases in an order that you specify: it might be alphabetical or numeric, in ascending or descending order. arrange() does not change the variables in the data frame or filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, Table 10.11 shows the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() )

Table 10.11: Ballot count of first-choice candidates.

First	total
ABDUL M RAHAMAN “THE ROCK”	338
ALICIA K. BENNETT	351
BETSY HODGES	28935
BILL KAHN	97
BOB “AGAIN” CARNEY JR	56
BOB FINE	2094
… and so on for 38 rows altogether.

The ascending alphabetical order in Table 10.11 might be good for some purposes. But if your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending descending numeric order as in Table 10.12. By default, arrange() puts cases in ascending order: from smallest to largest for numbers and from A to Z for character strings. You can set the order to be descending by using the desc() function.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() ) %>%
  arrange( desc(total) )

Table 10.12: Arranging the candidates in descending order by vote total.

First	total
BETSY HODGES	28935
MARK ANDREW	19584
DON SAMUELS	8335
CAM WINTON	7511
JACKIE CHERRYHOMES	3524
BOB FINE	2094
… and so on for 38 rows altogether.

10.5 Even more data verbs

A data verb is any function that takes a data frame as input and gives a data frame as the return value. In addition to the ones already introduced, you may find these useful in your work:

Data Verb	Purpose	Example	Arguments
`head()`	Grab the first few rows	`BabyNames %>% head(5)`	Number of rows to grab
`tail()`	Grab the last few rows	`BabyNames %>% tail(3)`	Number of rows
`transmute()`	Create new variables	Like `mutate()`, but only returns the new variables
`rename()`	Rename variables	Does not affect variables that are not referenced	`newName = oldName`
`sample_n()`	Random sample of rows	`BabyNames %>% sample_n(size = 5)`	Size of sample.

10.6 Summary functions

To look at a data frame, these functions are useful. Each takes the data frame as the only argument.

Function	Purpose
`glimpse()`	Quick summary of the table
`str()`	Quick summary
`summary()`	Quick summary
`nrow()`	How many cases in the data frame
`ncol()`	How many variables in the data frame
`names()`	The variable names
`View()`	Shows the table like a spreadsheet. In RStudio only.

10.7 Exercises

Problem 10.1: Identify each of these functions as either a Data Verb, a Transformation, a Reduction Function, or a Summary Function. (Hint: If you are unfamiliar with the function, use help().)

str()
group_by()
rank()
mean()
filter()
summary()
summarise()
glimpse()

Problem 10.2: Each of these tasks can be performed using a single data verb. For each task, say which verb it is:

Add a new column that is the ratio between two variables.
Sort the cases in descending order of a variable.
Create a new data table that includes only those cases that meet a criterion.
From a data table with three categorical variables A, B, & C, and a quantitative variable X, produce an output that has the same cases but only the variables A and X.
From a data table with three categorical variables A, B, & C, and a quantitative variable X, produce an output that has a separate case for each of the combinations of the levels of A and B. (Hints: This requires two different data verbs. It might be easier to see the answer if the problem statement added, “and gives the maximum value of X over all the cases that have a given combination of A and B.”)

Problem 10.3: For each of these computations, say what R function is the most appropriate:

Count the number of cases in a data table.
List the names of the variables in a data table.
For data tables in an R package, display the documentation (“codebook”) for the data table.
Load a package into your R session.
Mark a data table as grouped by one or more categorical variables.

Problem 10.4: Here is a reminder of what the BabyNames data frame looks like. Take this table as the input to the data wranging statement you will construct.

name	sex	count	year
Juana	F	121	1956
Stamatia	F	5	1981
Kenesha	F	16	2004
Aubreigh	F	63	2010
Yadriel	M	17	2012
Mariam	F	52	1922
… and so on for 1,792,091 rows altogether.

For each of the following outputs, identify the data verb linking the input to the output and write down the details (i.e., arguments) of the operation. (Hints: Look at the number of cases. The order of variables may also be a clue.)

Output Table A

name	sex	count	year
Stamatia	F	5	1981
Cathrin	F	5	1916
Kenesha	F	16	2004
Mariam	F	52	1922
Lue	F	58	1957
Aubreigh	F	63	2010
… and so on for 1,792,091 rows altogether.

Output Table B

name	sex	count	year
Juana	F	121	1956
Stamatia	F	5	1981
Kenesha	F	16	2004
Aubreigh	F	63	2010
Mariam	F	52	1922
Adelaide	F	471	1920
… and so on for 1,062,432 rows altogether.

Output Table C

name sex count year

Yadriel M 17 2012

Ferdie M 6 1915

… and so on for 729,659 rows altogether.
Output Table D

total

333417770
Output Table E

name count

Juana 121

Stamatia 5

Kenesha 16

Aubreigh 63

Yadriel 17

Mariam 52

… and so on for 1,792,091 rows altogether.

name	sex	count	year
Yadriel	M	17	2012
Ferdie	M	6	1915
… and so on for 729,659 rows altogether.

total
333417770

name	count
Juana	121
Stamatia	5
Kenesha	16
Aubreigh	63
Yadriel	17
Mariam	52
… and so on for 1,792,091 rows altogether.

Problem 10.5: In the ranked-choice ballot system that underlies the Minneapolis2013 data available in dcData, once a voter has picked a first choice candidate, there is no advantage in listing that same candidate as second or third choice.

In answering each of the following, you should include three things:

A statement in English (using data verbs!) of your strategy for carrying out the calculation.
The implementation of the calculation in R.
The data table that is the result of the calculation. This will be printed automatically from (b). You do not have to format the result beautifully. It’s sufficient to show the first few lines in the data table appear. (Remember head()).

Here are the questions.

How many people chose the same candidate for both First and Second place?
Of the ballots where the First and Second place choices are the same, what were the top 3 choices?
Of the people who selected Ole Savior for First, what were the top three Second choices?

Problem 10.6: These questions refer to the diamonds data table in the ggplot2 package. Take a look at the codebook (using help()) so that you’ll understand the meaning of the tasks. (Motivated by Garrett Grolemund.)

Each of the following tasks can be accomplished by a statement of the form

diamonds %>% 
  verb1( args1 ) %>% 
  verb2( args2 ) %>% 
  arrange(desc( args3 )) %>% 
  head( 1 )

For each task, give appropriate R functions or arguments to substitute in place of verb1, verb2, args1, args2, and args3.

Which color diamonds seem to be largest on average (in terms of carats)?
Which clarity of diamonds has the largest average “table” per carat?

Problem 10.7: In the 2013 Minneapolis mayoral election, there were the 36 candidates as well as no choice (“undervote”) or disallowed multiple choices (“overvote”).] This means there are \(38 \times 38 = 1444\) different possibilities for a voter’s pair of first and second choices, and \(38 \times 38 \times 38 = 54,872\) possibilities for the triplet of first-second-third choices. It studying the patterns of choice combinations, it can be helpful to simplify by reducing the number of possibilities. A sensible approach is to group the minor candidates together.

Construct a data wrangling command to translate each of the minor candidates from the candidate’s name to “minor.” Define a minor candidate as getting fewer than 5000 first-choice votes.

Problem 10.8: Here are the first letters of the names of several data verbs. (Repeated letters are for different verbs.) Complete the name and say briefly what the verb does.

Rather than looking at a list of verbs, do this just from memory.