You have already seen two data verbs:
summarise()
group_by()
Although these are being written in computer notation, it’s also perfectly legitimate to express actions using them as English verbs. For instance: “Group the baby names by sex and year. Then summarize the groups by adding up the total number of births for each group. This will be the result.” That’s English. Here’s the equivalent statement in computer notation:
Result <-
BabyNames %>%
group_by( sex, year ) %>%
summarise( total=sum( count ) )
sex | year | total |
---|---|---|
M | 1923 | 1096189 |
M | 1925 | 1115920 |
M | 1942 | 1380735 |
F | 1952 | 1854203 |
F | 1969 | 1686942 |
F | 1978 | 1531592 |
These notes will introduce you to five more data verbs:
select()
mutate()
filter()
arrange()
join()
As with group_by()
and summarise()
, each is a standard English word whose action on data is reflected in the colloquial, everyday meaning. And, like English, intricate and detailed statements can be made by combining the words into expressions.
Selecting from a data table means choosing one or more variables from the table. Reasons to do this:
The syntax is similar to that of group_by()
or summarise()
. A data table is provided as input along with the names of the variables you are selecting. The result produced is a new data table with just those variables.
To illustrate, here’s the first few cases in the BabyNames
data table:
name | sex | count | year |
---|---|---|---|
Mary | F | 7065 | 1880 |
Anna | F | 2604 | 1880 |
Emma | F | 2003 | 1880 |
Elizabeth | F | 1939 | 1880 |
Minnie | F | 1746 | 1880 |
Margaret | F | 1578 | 1880 |
And here is the result of selecting just the name
and year
variables:
name | year |
---|---|
Mary | 1880 |
Anna | 1880 |
Emma | 1880 |
Elizabeth | 1880 |
Minnie | 1880 |
Margaret | 1880 |
If you want to rename a variable, use a named argument, as with when=year
in the following:
name | when |
---|---|
Mary | 1880 |
Anna | 1880 |
Emma | 1880 |
Elizabeth | 1880 |
Minnie | 1880 |
Margaret | 1880 |
To “filter” means to remove unwanted material. The data verb “filter” removes unwanted cases, passing through to the result only those cases that are wanted or needed. Filtering constrasts with selecting. Selecting passes the specified variables; filtering passes the specified cases.
In selecting, the variables are specified by name, e.g.
BabyNames %>% select( year, count )
With filtering, the cases are specified by one or more criteria or tests. The tests are generally constructed with variables and functions like ==
, >
, <
, %in%
, and so on. For instance, here’s how you can filter out the boys, producing a result with only the girls’ names:
BabyNames %>% filter( sex=="F") %>%
sample_n( size=10 ) %>% xtable()
name | sex | count | year |
---|---|---|---|
Lamar | F | 5 | 1928 |
Lakeysha | F | 91 | 1978 |
Beverly | F | 8701 | 1939 |
Carmesha | F | 11 | 1992 |
Sherilyn | F | 30 | 1993 |
Aubreyanna | F | 11 | 2000 |
Kerrigan | F | 102 | 1996 |
Teana | F | 18 | 2006 |
Tacoria | F | 5 | 1988 |
Fumiyo | F | 5 | 1919 |
Here are the cases for either sex for babies born after 1990:
BabyNames %>% filter( year > 1990 ) %>%
sample_n( size=10 ) %>% xtable()
name | sex | count | year |
---|---|---|---|
Amayrany | F | 7 | 2009 |
Byanka | F | 10 | 1998 |
Champayne | F | 17 | 1991 |
Lanessa | F | 13 | 2013 |
Iysha | F | 6 | 1994 |
Elani | F | 12 | 1995 |
Corie | F | 12 | 2009 |
Brieanna | F | 13 | 2013 |
Daxten | M | 7 | 2008 |
Domanique | M | 5 | 2006 |
Here are the girls born after 1990:
BabyNames %>% filter( year > 1990, sex=="F")
name | sex | count | year |
---|---|---|---|
Kirstan | F | 17 | 2000 |
Janielle | F | 11 | 2007 |
Azaria | F | 89 | 1999 |
Alajah | F | 5 | 1995 |
Juanita | F | 180 | 2007 |
Jhanae | F | 8 | 2011 |
Rim | F | 5 | 2007 |
Sharra | F | 9 | 1991 |
Serita | F | 5 | 2002 |
Avril | F | 120 | 2009 |
Zarianna | F | 6 | 2004 |
Alaya | F | 52 | 1997 |
Natacha | F | 9 | 2010 |
Conley | F | 12 | 2010 |
Almadelia | F | 5 | 2008 |
Tulah | F | 5 | 2010 |
Melisa | F | 196 | 1999 |
Asija | F | 6 | 2000 |
Mirka | F | 5 | 2011 |
Dariane | F | 8 | 1992 |
Empress | F | 36 | 2012 |
Martika | F | 69 | 1993 |
Shawanda | F | 17 | 1997 |
Taia | F | 11 | 1993 |
Laynie | F | 74 | 2012 |
Caila | F | 45 | 1991 |
Tysheonna | F | 5 | 1998 |
Mailani | F | 5 | 2006 |
Amonie | F | 19 | 2009 |
Harlo | F | 23 | 2011 |
You can specify as many tests as you like. The filter()
function will pass through only those cases that pass all the tests.
Sometimes you may want to set “either-or” criteria, say the babies who are female or born after 1990:
BabyNames %>% filter( year>1990 | sex=="F")
It’s also possible to test for a variable being any of several different variables. For instance, here are the babies born in any of 1980, 1990, 2000, and 2010:
BabyNames %>%
filter( year %in% c(1980, 1990, 2000, 2010)) %>%
sample_n( size=12 ) %>%
arrange(year) %>%
xtable()
name | sex | count | year |
---|---|---|---|
Chriselda | F | 5 | 1990 |
Kla | F | 13 | 1990 |
Anquinetta | F | 5 | 1990 |
Carmello | M | 7 | 1990 |
Arnita | F | 6 | 1990 |
Asma | F | 63 | 2000 |
Marinna | F | 9 | 2000 |
Kaylib | M | 5 | 2000 |
Lazerrick | M | 6 | 2000 |
Jayvonte | M | 6 | 2000 |
Jermell | M | 6 | 2000 |
Jennifer | M | 6 | 2010 |
The word “mutate” means to change in form or nature. The data verb “mutate” is a bit more specific: to change a variable or add new variables based on the existing ones. The data verb always refers to variables; mutation leaves the cases exactly as they were.
Often, mutation is used to combine or transform existing variables into a new variable. For instance, the CountryData
data table has variables pop
and area
giving the population and area (in km^2) of each country. Suppose you wanted to know the population density, that is, how many people per unit area. Using mutate, you creating a new variable that is population / area.
Result <-
CountryData %>% select( country, pop, area ) %>%
mutate( popDensity=pop/area ) %>%
# na.omit() %>%
sample_n(size=10) %>% xtable()
The resulting data table, named Result
here, contains all the original variables as well as the newly created ones.
Result
has 77 variables. Rather than show them all, here are just some relevant variables for a few of the countries.
country | pop | area | popDensity |
---|---|---|---|
Sint Maarten | 39689.00 | 34.00 | 1167.32 |
West Bank | 2731052.00 | 5860.00 | 466.05 |
Isle of Man | 86866.00 | 572.00 | 151.86 |
Costa Rica | 4755234.00 | 51100.00 | 93.06 |
Cameroon | 23130708.00 | 475440.00 | 48.65 |
South Africa | 48375645.00 | 1219090.00 | 39.68 |
Arranging sets the order of cases. It does not change the variables — that’s a job for select()
. Similarly, arranging does not filter the cases. Arranging merely sets the order of cases according to some criterion that you specify.
For instance, here are the first-choices from the Minneapolis mayoral election in 2013 found by counting the ballots:
Minneapolis2013 %>%
group_by( First ) %>%
summarise( total=n() ) %>%
head() %>% xtable()
First | total |
---|---|
ABDUL M RAHAMAN “THE ROCK” | 338 |
ALICIA K. BENNETT | 351 |
BETSY HODGES | 28935 |
BILL KAHN | 97 |
BOB “AGAIN” CARNEY JR | 56 |
BOB FINE | 2094 |
The alphabetical order in the above might be good for some purposes. If your goal is to show who won and how they did compared to the other candidates, it’s better to arrange the results by total
in descending order.
Minneapolis2013 %>%
group_by( First ) %>%
summarise( total=n() ) %>%
arrange( desc(total) ) %>%
head() %>% xtable()
First | total |
---|---|
BETSY HODGES | 28935 |
MARK ANDREW | 19584 |
DON SAMUELS | 8335 |
CAM WINTON | 7511 |
JACKIE CHERRYHOMES | 3524 |
BOB FINE | 2094 |
By default, the arrangement goes in ascending order: from lowest to highest. You can also arrange for the order to be descending, with the highest value first.
The last major data verb is “join.” That’s described here.
sample_n( )
chooses random cases from a data table. The number of cases to be sampled is set by the named argument, size=
. Among other things, sample_n()
is useful for draThe notation used in these notes is dplyr
. Keep in mind that this is just one of several notations. Some of them are:
Here is the same expression in these different notations:
BabyNames %>% group_by(year,sex) %>% summarise( nNames=n() )
"BabyNames" > GROUP_BY("year", "sex") > SUMMARISE(COUNT() AS "nNames")
BabyNames[, length(count), by=c("sex","year") ]
Please use the comment system to make suggestions, point out errors, or to discuss the topic.