Chapter 10 More Data Verbs

You have already seen two data verbs:

  • summarise()
  • group_by()

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

Result <- 
  BabyNames %>% 
  group_by( sex, year ) %>%
  summarise( total = sum( count ) )

Table 10.1: Grouping BabyNames by sex and year then using summarise() to add up the total births for each group.

sex year total
F 1892 212347
M 1892 122038
F 1935 1048395
M 1935 1040892
F 1978 1531592
M 1978 1642137

This chapter introduces four new data verbs that, like summarise() and group_by(), perform their action on a single data frame.

  • select()
  • filter()
  • mutate()
  • arrange()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

10.1 Choosing variables with select()

Selecting from a data frame means choosing one or more variables from the table, creating a new table that has just a subset of the variables in the original. For now, think of select() as enabling you to simplify a table to eliminate variables that you don’t need or to rename a variable.

The use of select() is straightforward. It takes a data frame as input along with the names of one or more variables. To illustrate, consider the CountryData table in the dcData package. There are 76 variables in CountryData.

names(CountryData)
 [1] "country"      "area"         "pop"          "growth"       "birth"       
 [6] "death"        "migr"         "maternal"     "infant"       "life"        
[11] "fert"         "health"       "HIVrate"      "HIVpeople"    "HIVdeath"    
   ... and so on ...
[61] "petroImp"     "gasProd"      "gasCons"      "gasExp"       "gasImp"      
[66] "gasRes"       "mainlines"    "cell"         "netHosts"     "netUsers"    
[71] "airports"     "railways"     "roadways"     "waterways"    "marine"      
[76] "military"      

By modern standards for computing, 76 is not at all a large number of variables. But for a printed display, it would be ungainly to have to print all columns. Instead, we might choose to show only the variables that are most important for our analysis. To display fewer columns, use select, as was used in Table 10.2 to subset the variables of the original CountryData data frame and create the smaller Demographics data frame.

Demographics <- 
  CountryData %>% 
  select(country, pop, area) 

Table 10.2: The Demographics table created by selection from CountryData.

country pop area
Afghanistan 31822848 652230
Akrotiri 15700 123
Albania 3020209 28748
Algeria 38813722 2381741
American Samoa 54517 199
… and so on for 256 rows altogether.

If you want to rename a variable, use rename(). The arguments to rename() specify which variables to rename and what the new name will be. For instance, to change area to kmSquared use the argument kmSquared = area. (Table 10.3)

Table 10.3: Renaming the area variable from Table 10.2.

country pop kmSquared
Afghanistan 31822848 652230
Akrotiri 15700 123
Albania 3020209 28748
Algeria 38813722 2381741
American Samoa 54517 199
Andorra 85458 468
… and so on for 256 rows altogether.

10.2 Removing cases with filter()

To filter means to remove unwanted material. The data verb filter() removes unwanted cases, passing through to the result only those cases that are wanted or needed.1 Filtering and selecting are complementary actions for subsetting a data frame. select() passes the specified subset of variables; filter() passes the specified subset of cases.

With filtering, the cases to pass through are specified by one or more criteria or tests. The tests usually involve comparisons between variables or between a variable and a scalar. Table 10.9 gives examples of comparison functions and their uses.

For instance, here’s how you can filter out the boys, producing a result with only the girls’ names, shown in Table 10.4.

BabyNames %>% 
  filter(sex != "M") 

Table 10.4: Using filter() to exclude the males.

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
… and so on for 1,062,432 rows altogether.

Or, you could filter on the basis of year, for instance, keeping the cases for either sex for babies born after 1990:

BabyNames %>% 
  filter( year > 1990 )

Table 10.5: Baby names after 1990.

name sex count year
Ashley F 43483 1991
Jessica F 43394 1991
Brittany F 29089 1991
… and so on for 697,983 rows altogether.

You can specify as many tests as you like. For instance, to keep only the cases for females born after 1990:

BabyNames %>% 
  filter(year > 1990, sex == "F")

Table 10.6: Filtered for rows that satisify all of multiple conditions: after 1990 and sex "F".

name sex count year
Ashley F 43483 1991
Jessica F 43394 1991
Brittany F 29089 1991
… and so on for 413,105 rows altogether.

The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990. To specify either-or, use the vertical bar | rather than a comma to separate the conditions.

BabyNames %>% 
  filter(year > 1990 | sex == "F") 

Table 10.7: Filtered for rows that satisfy any of multiple conditions: after 1990 or sex "F".

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
… and so on for 1,347,310 rows altogether.

It’s also possible to test for a variable being any of several different values. For instance, Table 10.8 includes just the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010))

Table 10.8: Filtering to keep only babies born in 1980, 1990, or 2000.

name sex count year
Jennifer F 58385 1980
Amanda F 35818 1980
Jessica F 33920 1980
Melissa F 31631 1980
Sarah F 25744 1980
Heather F 19971 1980
… and so on for 107,935 rows altogether.

Table 10.9: Examples of the sorts of comparisons used as criteria in filter().

Comparison Function Meaning Example
Exact equality == year == 1995 matches a number
name == "Julian" matches a character string
Greater than > year > 1995 years after 1995
name > "C" names starting with “D” or after
Less than < year < 1995 years before 1995
name < "C" names starting with “A” or “B”
Less than or equal to <= year <= 1995 year is 1995 or before
name <= "C" names alphabetically before “C”
Greater than or equal to >= Analogous to <=
Contained in %in% year %in% c(1940, 1941) matches either 1940 or 1941
name %in% c("April","May","June") A springtime name

10.3 Creating variables with mutate()

The word “mutate” means to change in form or nature. The data verb mutate() is more specific: to change a variable or add new variables based on the existing ones. Mutation leaves the cases exactly as they were; the output of mutate() will have exactly as many rows as the input data frame.

To illustrate, consider Table 10.2. It has variables pop and area giving the population and area (in km\(^2\)) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Computing the density is a matter of using a transformation verb: divide population by area. The mutate() verb let’s you carry out this transformation and put the result in a new variable. For instance:2 Note that in this R statement Demographics is the input to mutate() and the result is being stored under the name Demographics. This style can be read as “revise the Demographics table and then….” Of course, you could choose to give the result a new name or to pipe the result to another stage of data wrangling, etc.

Demographics <-
  Demographics %>% 
  mutate(pop_density = pop / area)

Table 10.10: Demographics with a pop_density variable added.

country pop area pop_density
Afghanistan 31822848 652230 48.79084
Akrotiri 15700 123 127.64228
Albania 3020209 28748 105.05806
Algeria 38813722 2381741 16.29637
American Samoa 54517 199 273.95477
Andorra 85458 468 182.60256
… and so on for 256 rows altogether.

The arguments to mutate() specify what the new variable or variables are to be called (pop_density above) and how to calculate the values of those variables. This calculation will always involve a transformation variable, just as the summarise() verb always involves a reduction verb.

10.4 Setting in order with arrange()

To “arrange” means to set in order. The arrange() data verb sorts out the cases in an order that you specify: it might be alphabetical or numeric, in ascending or descending order. arrange() does not change the variables in the data frame or filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, Table 10.11 shows the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() ) 

Table 10.11: Ballot count of first-choice candidates.

First total
ABDUL M RAHAMAN “THE ROCK” 338
ALICIA K. BENNETT 351
BETSY HODGES 28935
BILL KAHN 97
BOB “AGAIN” CARNEY JR 56
BOB FINE 2094
… and so on for 38 rows altogether.

The ascending alphabetical order in Table 10.11 might be good for some purposes. But if your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending descending numeric order as in Table 10.12. By default, arrange() puts cases in ascending order: from smallest to largest for numbers and from A to Z for character strings. You can set the order to be descending by using the desc() function.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() ) %>%
  arrange( desc(total) ) 

Table 10.12: Arranging the candidates in descending order by vote total.

First total
BETSY HODGES 28935
MARK ANDREW 19584
DON SAMUELS 8335
CAM WINTON 7511
JACKIE CHERRYHOMES 3524
BOB FINE 2094
… and so on for 38 rows altogether.

10.5 Even more data verbs

A data verb is any function that takes a data frame as input and gives a data frame as the return value. In addition to the ones already introduced, you may find these useful in your work:

Data Verb Purpose Example Arguments
head() Grab the first few rows BabyNames %>% head(5) Number of rows to grab
tail() Grab the last few rows BabyNames %>% tail(3) Number of rows
transmute() Create new variables Like mutate(), but only returns the new variables
rename() Rename variables Does not affect variables that are not referenced newName = oldName
sample_n() Random sample of rows BabyNames %>% sample_n(size = 5) Size of sample.

10.6 Summary functions

To look at a data frame, these functions are useful. Each takes the data frame as the only argument.

Function Purpose
glimpse() Quick summary of the table
str() Quick summary
summary() Quick summary
nrow() How many cases in the data frame
ncol() How many variables in the data frame
names() The variable names
View() Shows the table like a spreadsheet. In RStudio only.

10.7 Exercises

Problem 10.1: Identify each of these functions as either a Data Verb, a Transformation, a Reduction Function, or a Summary Function. (Hint: If you are unfamiliar with the function, use help().)

  1. str()
  2. group_by()
  3. rank()
  4. mean()
  5. filter()
  6. summary()
  7. summarise()
  8. glimpse()

Problem 10.2: Each of these tasks can be performed using a single data verb. For each task, say which verb it is:

  1. Add a new column that is the ratio between two variables.
  2. Sort the cases in descending order of a variable.
  3. Create a new data table that includes only those cases that meet a criterion.
  4. From a data table with three categorical variables A, B, & C, and a quantitative variable X, produce an output that has the same cases but only the variables A and X.
  5. From a data table with three categorical variables A, B, & C, and a quantitative variable X, produce an output that has a separate case for each of the combinations of the levels of A and B. (Hints: This requires two different data verbs. It might be easier to see the answer if the problem statement added, “and gives the maximum value of X over all the cases that have a given combination of A and B.”)

Problem 10.3: For each of these computations, say what R function is the most appropriate:

  1. Count the number of cases in a data table.
  2. List the names of the variables in a data table.
  3. For data tables in an R package, display the documentation (“codebook”) for the data table.
  4. Load a package into your R session.
  5. Mark a data table as grouped by one or more categorical variables.

Problem 10.4: Here is a reminder of what the BabyNames data frame looks like. Take this table as the input to the data wranging statement you will construct.

name sex count year
Juana F 121 1956
Stamatia F 5 1981
Kenesha F 16 2004
Aubreigh F 63 2010
Yadriel M 17 2012
Mariam F 52 1922
… and so on for 1,792,091 rows altogether.

For each of the following outputs, identify the data verb linking the input to the output and write down the details (i.e., arguments) of the operation. (Hints: Look at the number of cases. The order of variables may also be a clue.)

  1. Output Table A
    name sex count year
    Stamatia F 5 1981
    Cathrin F 5 1916
    Kenesha F 16 2004
    Mariam F 52 1922
    Lue F 58 1957
    Aubreigh F 63 2010
    … and so on for 1,792,091 rows altogether.
  2. Output Table B
    name sex count year
    Juana F 121 1956
    Stamatia F 5 1981
    Kenesha F 16 2004
    Aubreigh F 63 2010
    Mariam F 52 1922
    Adelaide F 471 1920
    … and so on for 1,062,432 rows altogether.
  3. Output Table C
    name sex count year
    Yadriel M 17 2012
    Ferdie M 6 1915
    … and so on for 729,659 rows altogether.
  4. Output Table D
    total
    333417770
  5. Output Table E
    name count
    Juana 121
    Stamatia 5
    Kenesha 16
    Aubreigh 63
    Yadriel 17
    Mariam 52
    … and so on for 1,792,091 rows altogether.

Problem 10.5: In the ranked-choice ballot system that underlies the Minneapolis2013 data available in dcData, once a voter has picked a first choice candidate, there is no advantage in listing that same candidate as second or third choice.

In answering each of the following, you should include three things:

  1. A statement in English (using data verbs!) of your strategy for carrying out the calculation.
  2. The implementation of the calculation in R.
  3. The data table that is the result of the calculation. This will be printed automatically from (b). You do not have to format the result beautifully. It’s sufficient to show the first few lines in the data table appear. (Remember head()).

Here are the questions.

  1. How many people chose the same candidate for both First and Second place?

  2. Of the ballots where the First and Second place choices are the same, what were the top 3 choices?

  3. Of the people who selected Ole Savior for First, what were the top three Second choices?

Problem 10.6: These questions refer to the diamonds data table in the ggplot2 package. Take a look at the codebook (using help()) so that you’ll understand the meaning of the tasks. (Motivated by Garrett Grolemund.)

Each of the following tasks can be accomplished by a statement of the form

diamonds %>% 
  verb1( args1 ) %>% 
  verb2( args2 ) %>% 
  arrange(desc( args3 )) %>% 
  head( 1 )

For each task, give appropriate R functions or arguments to substitute in place of verb1, verb2, args1, args2, and args3.

  1. Which color diamonds seem to be largest on average (in terms of carats)?
  2. Which clarity of diamonds has the largest average “table” per carat?

Problem 10.7: In the 2013 Minneapolis mayoral election, there were the 36 candidates as well as no choice (“undervote”) or disallowed multiple choices (“overvote”).] This means there are \(38 \times 38 = 1444\) different possibilities for a voter’s pair of first and second choices, and \(38 \times 38 \times 38 = 54,872\) possibilities for the triplet of first-second-third choices. It studying the patterns of choice combinations, it can be helpful to simplify by reducing the number of possibilities. A sensible approach is to group the minor candidates together.

Construct a data wrangling command to translate each of the minor candidates from the candidate’s name to “minor.” Define a minor candidate as getting fewer than 5000 first-choice votes.

Problem 10.8: Here are the first letters of the names of several data verbs. (Repeated letters are for different verbs.) Complete the name and say briefly what the verb does.

Rather than looking at a list of verbs, do this just from memory.

  • A
  • G
  • J
  • M
  • F
  • S
  • S