Chapter 9 More Data Verbs

You have already seen two data verbs:

  • summarise()
  • group_by()

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

Result <- 
  BabyNames %>% 
  group_by( sex, year ) %>%
  summarise( total = sum( count ) )

Table 9.1: Grouping BabyNames by sex and year then using summarise() to add up the total births for each group.

sex year total
F 1892 212347
M 1892 122038
F 1935 1048395
M 1935 1040892
F 1978 1531592
M 1978 1642137

This chapter introduces four new data verbs that, like summarise() and group_by(), perform their action on a single data frame.

  • select()
  • filter()
  • mutate()
  • arrange()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

9.1 Choosing variables with select()

Selecting from a data frame means choosing one or more variables from the table, creating a new table that has just a subset of the variables in the original. For now, think of select() as enabling you to simplify a table to eliminate variables that you don’t need or to rename a variable.

The use of select() is straightforward. It takes a data frame as input along with the names of one or more variables. To illustrate, consider the CountryData table in the DataComputing package. There are 76 variables in CountryData.

names(CountryData)
 [1] "country"      "area"         "pop"          "growth"       "birth"       
 [6] "death"        "migr"         "maternal"     "infant"       "life"        
[11] "fert"         "health"       "HIVrate"      "HIVpeople"    "HIVdeath"    
   ... and so on ...
[61] "petroImp"     "gasProd"      "gasCons"      "gasExp"       "gasImp"      
[66] "gasRes"       "mainlines"    "cell"         "netHosts"     "netUsers"    
[71] "airports"     "railways"     "roadways"     "waterways"    "marine"      
[76] "military"      

By modern standards for computing, 76 is not at all a large number of variables. But for a printed display, it would be ungainly to have to print all columns. Instead, we might choose to show only the variables that are most important for our analysis. To display fewer columns, use select, as was used in Table 9.2 to subset the variables of the original CountryData data frame and create the smaller Demographics data frame.

Demographics <- 
  CountryData %>% 
  select(country, pop, area) 

Table 9.2: The Demographics table created by selection from CountryData.

country pop area
Afghanistan 31822848 652230
Akrotiri 15700 123
Albania 3020209 28748
Algeria 38813722 2381741
American Samoa 54517 199
… and so on for 256 rows altogether.

If you want to rename a variable, use rename(). The arguments to rename() specify which variables to rename and what the new name will be. For instance, to change area to kmSquared use the argument kmSquared = area. (Table 9.3)

Table 9.3: Renaming the area variable from Table 9.2.

country pop kmSquared
Afghanistan 31822848 652230
Akrotiri 15700 123
Albania 3020209 28748
Algeria 38813722 2381741
American Samoa 54517 199
Andorra 85458 468
… and so on for 256 rows altogether.

9.2 Removing cases with filter()

To filter means to remove unwanted material. The data verb filter() removes unwanted cases, passing through to the result only those cases that are wanted or needed.1 Filtering and selecting are complementary actions for subsetting a data frame. select() passes the specified subset of variables; filter() passes the specified subset of cases.

With filtering, the cases to pass through are specified by one or more criteria or tests. The tests usually involve comparisons between variables or between a variable and a scalar. Table 9.9 gives examples of comparison functions and their uses.

For instance, here’s how you can filter out the boys, producing a result with only the girls’ names, shown in Table 9.4.

BabyNames %>% 
  filter(sex != "M") 

Table 9.4: Using filter() to exclude the males.

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
… and so on for 1,062,432 rows altogether.

Or, you could filter on the basis of year, for instance, keeping the cases for either sex for babies born after 1990:

BabyNames %>% 
  filter( year > 1990 )

Table 9.5: Baby names after 1990.

name sex count year
Ashley F 43483 1991
Jessica F 43394 1991
Brittany F 29089 1991
… and so on for 697,983 rows altogether.

You can specify as many tests as you like. For instance, to keep only the cases for females born after 1990:

BabyNames %>% 
  filter(year > 1990, sex=="F")

Table 9.6: Filtered for rows that satisify all of multiple conditions: after 1990 and sex "F".

name sex count year
Ashley F 43483 1991
Jessica F 43394 1991
Brittany F 29089 1991
… and so on for 413,105 rows altogether.

The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990. To specify either-or, use the vertical bar | rather than a comma to separate the conditions.

BabyNames %>% 
  filter(year > 1990 | sex == "F") 

Table 9.7: Filtered for rows that satisfy any of multiple conditions: after 1990 or sex "F".

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
… and so on for 1,347,310 rows altogether.

It’s also possible to test for a variable being any of several different values. For instance, Table 9.8 includes just the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010))

Table 9.8: Filtering to keep only babies born in 1980, 1990, or 2000.

name sex count year
Jennifer F 58385 1980
Amanda F 35818 1980
Jessica F 33920 1980
Melissa F 31631 1980
Sarah F 25744 1980
Heather F 19971 1980
… and so on for 107,935 rows altogether.

Table 9.9: Examples of the sorts of comparisons used as criteria in filter().

Comparison Function Meaning Example
Exact equality == year == 1995 matches a number
name == "Julian" matches a character string
Greater than > year > 1995 years after 1995
name > "C" names starting with “D” or after
Less than < year < 1995 years before 1995
name < "C" names starting with “A” or “B”
Less than or equal to <= year <= 1995 year is 1995 or before
name <= "C" names alphabetically before “C”
Greater than or equal to >= Analogous to <=
Contained in %in% year %in% c(1940, 1941) matches either 1940 or 1941
name %in% c("April","May","June") A springtime name

9.3 Creating variables with mutate()

The word “mutate” means to change in form or nature. The data verb mutate() is more specific: to change a variable or add new variables based on the existing ones. Mutation leaves the cases exactly as they were; the output of mutate() will have exactly as many rows as the input data frame.

To illustrate, consider Table 9.2. It has variables pop and area giving the population and area (in km\(^2\)) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Computing the density is a matter of using a transformation verb: divide population by area. The mutate() verb let’s you carry out this transformation and put the result in a new variable. For instance:2 Note that in this R statement Demographics is the input to mutate() and the result is being stored under the name Demographics. This style can be read as “revise the Demographics table and then….” Of course, you could choose to give the result a new name or to pipe the result to another stage of data wrangling, etc.

Demographics <-
  Demographics %>% 
  mutate(pop_density = pop / area)

Table 9.10: Demographics with a pop_density variable added.

country pop area pop_density
Afghanistan 31822848 652230 48.79084
Akrotiri 15700 123 127.64228
Albania 3020209 28748 105.05806
Algeria 38813722 2381741 16.29637
American Samoa 54517 199 273.95477
Andorra 85458 468 182.60256
… and so on for 256 rows altogether.

The arguments to mutate() specify what the new variable or variables are to be called (pop_density above) and how to calculate the values of those variables. This calculation will always involve a transformation variable, just as the summarise() verb always involves a reduction verb.

9.4 Setting in order with arrange()

To “arrange” means to set in order. The arrange() data verb sorts out the cases in an order that you specify: it might be alphabetical or numeric, in ascending or descending order. arrange() does not change the variables in the data frame or filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, Table 9.11 shows the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() ) 

Table 9.11: Ballot count of first-choice candidates.

First total
ABDUL M RAHAMAN “THE ROCK” 338
ALICIA K. BENNETT 351
BETSY HODGES 28935
BILL KAHN 97
BOB “AGAIN” CARNEY JR 56
BOB FINE 2094
… and so on for 38 rows altogether.

The ascending alphabetical order in Table 9.11 might be good for some purposes. But if your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending descending numeric order as in Table 9.12. By default, arrange() puts cases in ascending order: from smallest to largest for numbers and from A to Z for character strings. You can set the order to be descending by using the desc() function.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total = n() ) %>%
  arrange( desc(total) ) 

Table 9.12: Arranging the candidates in descending order by vote total.

First total
BETSY HODGES 28935
MARK ANDREW 19584
DON SAMUELS 8335
CAM WINTON 7511
JACKIE CHERRYHOMES 3524
BOB FINE 2094
… and so on for 38 rows altogether.

9.5 Even more data verbs

A data verb is any function that takes a data frame as input and gives a data frame as the return value. In addition to the ones already introduced, you may find these useful in your work:

Data Verb Purpose Example Arguments
head() Grab the first few rows BabyNames %>% head(5) Number of rows to grab
tail() Grab the last few rows BabyNames %>% tail(3) Number of rows
transmute() Create new variables Like mutate(), but only returns the new variables
rename() Rename variables Does not affect variables that are not referenced newName = oldName
sample_n() Random sample of rows BabyNames %>% sample_n(size = 5) Size of sample.

9.6 Summary functions

To look at a data frame, these functions are useful. Each takes the data frame as the only argument.

Function Purpose
glimpse() Quick summary of the table
str() Quick summary
summary() Quick summary
nrow() How many cases in the data frame
ncol() How many variables in the data frame
names() The variable names
View() Shows the table like a spreadsheet. In RStudio only.