Introducing a few more data verbs

You have already seen two data verbs:

summarise()
group_by()

Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:

Result <- 
  BabyNames %>% 
  group_by( sex, year ) %>%
  summarise( total=sum( count ) )

sex	year	total
M	1923	1096189
M	1925	1115920
M	1942	1380735
F	1952	1854203
F	1969	1686942
F	1978	1531592

These notes will introduce you to five more data verbs:

select()
mutate()
filter()
arrange()
join()

As with group_by() and summarise(), each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.

Select

Selecting from a data table means choosing one or more variables from the table. Reasons to do this:

Simplify the table you are working on.
Rename one or more of the variables to make it more convenient to work with.

The syntax is similar to that of group_by() or summarise(). A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.

To illustrate, here’s the first few cases in the BabyNames data table:

name	sex	count	year
Mary	F	7065	1880
Anna	F	2604	1880
Emma	F	2003	1880
Elizabeth	F	1939	1880
Minnie	F	1746	1880
Margaret	F	1578	1880

And here is the result of selecting just the name and year variables:

name	year
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

If you want to rename a variable, use a named argument, as with when=year in the following:

name	when
Mary	1880
Anna	1880
Emma	1880
Elizabeth	1880
Minnie	1880
Margaret	1880

Filter

To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.

In selecting, the variables are specified by name, e.g.

BabyNames %>% select( year, count )

With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==, >, <, %in%, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:

BabyNames %>% filter( sex=="F") %>%
  sample_n( size=10 ) %>% xtable()

name	sex	count	year
Lamar	F	5	1928
Lakeysha	F	91	1978
Beverly	F	8701	1939
Carmesha	F	11	1992
Sherilyn	F	30	1993
Aubreyanna	F	11	2000
Kerrigan	F	102	1996
Teana	F	18	2006
Tacoria	F	5	1988
Fumiyo	F	5	1919

Here are the cases for either sex for babies born after 1990:

BabyNames %>% filter( year > 1990 ) %>% 
  sample_n( size=10 ) %>% xtable()

name	sex	count	year
Amayrany	F	7	2009
Byanka	F	10	1998
Champayne	F	17	1991
Lanessa	F	13	2013
Iysha	F	6	1994
Elani	F	12	1995
Corie	F	12	2009
Brieanna	F	13	2013
Daxten	M	7	2008
Domanique	M	5	2006

Here are the girls born after 1990:

BabyNames %>% filter( year > 1990, sex=="F")

name	sex	count	year
Kirstan	F	17	2000
Janielle	F	11	2007
Azaria	F	89	1999
Alajah	F	5	1995
Juanita	F	180	2007
Jhanae	F	8	2011
Rim	F	5	2007
Sharra	F	9	1991
Serita	F	5	2002
Avril	F	120	2009
Zarianna	F	6	2004
Alaya	F	52	1997
Natacha	F	9	2010
Conley	F	12	2010
Almadelia	F	5	2008
Tulah	F	5	2010
Melisa	F	196	1999
Asija	F	6	2000
Mirka	F	5	2011
Dariane	F	8	1992
Empress	F	36	2012
Martika	F	69	1993
Shawanda	F	17	1997
Taia	F	11	1993
Laynie	F	74	2012
Caila	F	45	1991
Tysheonna	F	5	1998
Mailani	F	5	2006
Amonie	F	19	2009
Harlo	F	23	2011

You can specify as many tests as you like. The filter() function will pass through only those cases that pass all the tests.

Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:

BabyNames %>% filter( year>1990 | sex=="F")

It’s also possible to test for a variable being any of several different variables. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:

BabyNames %>% 
  filter( year %in% c(1980, 1990, 2000, 2010)) %>%
  sample_n( size=12 ) %>% 
  arrange(year) %>%
  xtable()

name	sex	count	year
Chriselda	F	5	1990
Kla	F	13	1990
Anquinetta	F	5	1990
Carmello	M	7	1990
Arnita	F	6	1990
Asma	F	63	2000
Marinna	F	9	2000
Kaylib	M	5	2000
Lazerrick	M	6	2000
Jayvonte	M	6	2000
Jermell	M	6	2000
Jennifer	M	6	2010

Mutate

The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.

Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData data table has variables pop and area giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.

Result <-
  CountryData %>% select( country, pop, area ) %>%
  mutate( popDensity=pop/area ) %>% 
#  na.omit() %>% 
  sample_n(size=10) %>% xtable()

The resulting data table, named Result here, contains all the original variables as well as the newly created ones.

Result has 77 variables. Rather than show them all, here are just some relevant variables for a few of the countries.

country	pop	area	popDensity
Sint Maarten	39689.00	34.00	1167.32
West Bank	2731052.00	5860.00	466.05
Isle of Man	86866.00	572.00	151.86
Costa Rica	4755234.00	51100.00	93.06
Cameroon	23130708.00	475440.00	48.65
South Africa	48375645.00	1219090.00	39.68

Arrange

Arranging sets the order of cases. It does not change the variables — that’s a job for select(). Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.

For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  head() %>% xtable()

First	total
ABDUL M RAHAMAN “THE ROCK”	338
ALICIA K. BENNETT	351
BETSY HODGES	28935
BILL KAHN	97
BOB “AGAIN” CARNEY JR	56
BOB FINE	2094

The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total in descending order.

Minneapolis2013 %>%
  group_by( First ) %>% 
  summarise( total=n() ) %>%
  arrange( desc(total) ) %>%
  head() %>% xtable()

First	total
BETSY HODGES	28935
MARK ANDREW	19584
DON SAMUELS	8335
CAM WINTON	7511
JACKIE CHERRYHOMES	3524
BOB FINE	2094

By default, the arrangement goes in ascending order: from lowest to highest. You can also arrange for the order to be descending, with the highest value first.

Join

The last major data verb is “join.” That’s described here.

Some other data verbs

sample_n( ) chooses random cases from a data table. The number of cases to be sampled is set by the named argument, size=. Among other things, sample_n() is useful for dra

Elaboration: Data verb languages

The notation used in these notes is dplyr. Keep in mind that this is just one of several notations. Some of them are:

dplyr - R
data.table - R
SQL - database servers

Here is the same expression in these different notations:

dplyr BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
SQL "BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")
data.table BabyNames[, length(count), by=c("sex","year") ]

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus