Introducing a few more data verbs

You have already seen two data verbs:

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

Result <- 
  BabyNames %>% 
  group_by( sex, year ) %>%
  summarise( total=sum( count ) )
sex year total
M 1923 1096189
M 1925 1115920
M 1942 1380735
F 1952 1854203
F 1969 1686942
F 1978 1531592

These notes will introduce you to five more data verbs:

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

Select

Selecting from a data table means choosing one or more variables from the table. Reasons to do this:

The syntax is similar to that of group_by() or summarise(). A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.

To illustrate, here’s the first few cases in the BabyNames data table:

name sex count year
Mary F 7065 1880
Anna F 2604 1880
Emma F 2003 1880
Elizabeth F 1939 1880
Minnie F 1746 1880
Margaret F 1578 1880

And here is the result of selecting just the name and year variables:

name year
Mary 1880
Anna 1880
Emma 1880
Elizabeth 1880
Minnie 1880
Margaret 1880

If you want to rename a variable, use a named argument, as with when=year in the following:

name when
Mary 1880
Anna 1880
Emma 1880
Elizabeth 1880
Minnie 1880
Margaret 1880

Filter

To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.

In selecting, the variables are specified by name, e.g.

BabyNames %>% select( year, count )

With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==, >, <, %in%, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:

BabyNames %>% filter( sex=="F") %>%
  sample_n( size=10 ) %>% xtable()
name sex count year
Lamar F 5 1928
Lakeysha F 91 1978
Beverly F 8701 1939
Carmesha F 11 1992
Sherilyn F 30 1993
Aubreyanna F 11 2000
Kerrigan F 102 1996
Teana F 18 2006
Tacoria F 5 1988
Fumiyo F 5 1919

Here are the cases for either sex for babies born after 1990:

BabyNames %>% filter( year > 1990 ) %>% 
  sample_n( size=10 ) %>% xtable()
name sex count year
Amayrany F 7 2009
Byanka F 10 1998
Champayne F 17 1991
Lanessa F 13 2013
Iysha F 6 1994
Elani F 12 1995
Corie F 12 2009
Brieanna F 13 2013
Daxten M 7 2008
Domanique M 5 2006

Here are the girls born after 1990:

BabyNames %>% filter( year > 1990, sex=="F")
name sex count year
Kirstan F 17 2000
Janielle F 11 2007
Azaria F 89 1999
Alajah F 5 1995
Juanita F 180 2007
Jhanae F 8 2011
Rim F 5 2007
Sharra F 9 1991
Serita F 5 2002
Avril F 120 2009
Zarianna F 6 2004
Alaya F 52 1997
Natacha F 9 2010
Conley F 12 2010
Almadelia F 5 2008
Tulah F 5 2010
Melisa F 196 1999
Asija F 6 2000
Mirka F 5 2011
Dariane F 8 1992
Empress F 36 2012
Martika F 69 1993
Shawanda F 17 1997
Taia F 11 1993
Laynie F 74 2012
Caila F 45 1991
Tysheonna F 5 1998
Mailani F 5 2006
Amonie F 19 2009
Harlo F 23 2011

You can specify as many tests as you like. The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:

BabyNames %>% filter( year>1990 | sex=="F") 

It’s also possible to test for a variable being any of several different variables. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010)) %>%
  sample_n( size=12 ) %>% 
  arrange(year) %>%
  xtable()
name sex count year
Chriselda F 5 1990
Kla F 13 1990
Anquinetta F 5 1990
Carmello M 7 1990
Arnita F 6 1990
Asma F 63 2000
Marinna F 9 2000
Kaylib M 5 2000
Lazerrick M 6 2000
Jayvonte M 6 2000
Jermell M 6 2000
Jennifer M 6 2010

Mutate

The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.

Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData data table has variables pop and area giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.

Result <-
  CountryData %>% select( country, pop, area ) %>%
  mutate( popDensity=pop/area ) %>% 
#  na.omit() %>% 
  sample_n(size=10) %>% xtable()

The resulting data table, named Result here, contains all the original variables as well as the newly created ones.

Result has 77 variables. Rather than show them all, here are just some relevant variables for a few of the countries.

country pop area popDensity
Sint Maarten 39689.00 34.00 1167.32
West Bank 2731052.00 5860.00 466.05
Isle of Man 86866.00 572.00 151.86
Costa Rica 4755234.00 51100.00 93.06
Cameroon 23130708.00 475440.00 48.65
South Africa 48375645.00 1219090.00 39.68

Arrange

Arranging sets the order of cases. It does not change the variables — that’s a job for select(). Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  head() %>% xtable()
First total
ABDUL M RAHAMAN “THE ROCK” 338
ALICIA K. BENNETT 351
BETSY HODGES 28935
BILL KAHN 97
BOB “AGAIN” CARNEY JR 56
BOB FINE 2094

The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending order.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  arrange( desc(total) ) %>%
  head() %>% xtable()
First total
BETSY HODGES 28935
MARK ANDREW 19584
DON SAMUELS 8335
CAM WINTON 7511
JACKIE CHERRYHOMES 3524
BOB FINE 2094

By default, the arrangement goes in ascending order: from lowest to highest. You can also arrange for the order to be descending, with the highest value first.

Join

The last major data verb is “join.” That’s described here.

Some other data verbs

Elaboration: Data verb languages

The notation used in these notes is dplyr. Keep in mind that this is just one of several notations. Some of them are:

Here is the same expression in these different notations:

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus