Many questions take forms such as these:
The functions min()
and max()
are obvious candidates for carrying out such tasks, but they don’t do quite the right thing. For instance
BabyNames %>%
summarise( biggest=max( count ) )
biggest
1 99674
quickly reveals that the most popular name was given 99674 times. But the result doesn’t indicate what that popular name was or in what year it was given.
What’s needed here is a data verb that will return the one biggest case. The job of choosing cases that meet a criterion belongs to filter()
.
BabyNames %>%
filter( count==max( count ) )
name sex count year
1 Linda F 99674 1947
Note the difference in the use of filter()
instead of summarise()
: rather than creating a new variable as summarise()
(it’s name biggest
in the example), filter()
needs a criterion or test. The criterion count==max( count )
(with the double equals sign ==
) ask for the case where the value of count
matches the largest value of count
. That will be the biggest case.
It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.
BabyNames %>%
filter( count > 0.90*max( count ))
name sex count year
1 Linda F 99674 1947
2 James M 94758 1947
3 Robert M 91652 1947
4 Linda F 96210 1948
5 Linda F 90994 1949
6 Michael M 90629 1956
7 Michael M 92711 1957
8 Michael M 90512 1958
Frequently, however, the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank()
.
Rank()
does something very simple: it replaces each number in a set with where that number stands with respect to the others. For instance, the rank of the number 5 in the set 2,5,4,7,2,9,9,8 is 4 because 5 is the 4th smallest number in the set. (The three numbers 2, 2, 4 are smaller than 5.)
Note that rank()
is not a data verb. Data verbs take a data table as input and return a data table. In contrast, rank()
takes a variable as input and returns a set of the same size that tells where each number in the set stands. The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).1 Since rank()
transforms a variable into a variable, it’s particularly suitable for use in mutate()
and filter()
.
To illustrate, here are the 10 most popular names in the BabyNames
data table. (Note: desc()
is used so that rank()
will work in descending order, rather than from smallest to biggest.)
BabyNames %>%
filter( rank( desc(count) ) <= 10 )
name sex count year
1 Linda F 99674 1947
2 James M 94758 1947
3 Robert M 91652 1947
4 Linda F 96210 1948
5 James M 88610 1948
6 Linda F 90994 1949
7 Michael M 88481 1954
8 Michael M 90629 1956
9 Michael M 92711 1957
10 Michael M 90512 1958
Of course, rank()
can be used with other data verbs, such as group_by()
.
BabyNames %>%
group_by( sex ) %>%
filter( rank( desc(count) ) <= 10 )
Source: local data frame [20 x 4]
Groups: sex
name sex count year
1 Mary F 73981 1921
2 Mary F 72172 1922
3 Mary F 71631 1923
4 Mary F 73520 1924
5 James M 87428 1946
6 Linda F 99674 1947
7 Mary F 71679 1947
8 James M 94758 1947
9 Robert M 91652 1947
10 John M 88318 1947
11 Linda F 96210 1948
12 James M 88610 1948
13 Linda F 90994 1949
14 Linda F 80433 1950
15 Linda F 73928 1951
16 Michael M 88481 1954
17 Michael M 88281 1955
18 Michael M 90629 1956
19 Michael M 92711 1957
20 Michael M 90512 1958
You can also perform other tasks, for instance finding the most popular names over all time by adding up the counts over the years:
BabyNames %>%
group_by( sex, name ) %>%
summarise( count=sum(count) ) %>%
filter( rank( desc(count) ) <= 10 ) %>%
arrange( desc(count) )
Source: local data frame [20 x 3]
Groups: sex
sex name count
1 M James 5091189
2 M John 5073958
3 M Robert 4789776
4 M Michael 4293460
5 F Mary 4112464
6 M William 4038447
7 M David 3565229
8 M Joseph 2557792
9 M Richard 2552302
10 M Charles 2356886
11 M Thomas 2275889
12 F Elizabeth 1591439
13 F Patricia 1570135
14 F Jennifer 1461186
15 F Linda 1450328
16 F Barbara 1432543
17 F Margaret 1238016
18 F Susan 1120083
19 F Dorothy 1105281
20 F Sarah 1055860
Often, there are ties. The rank()
function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were not quite tied.
Some rank-like functions are useful if ties are an issue. For instance, row_number()
breaks ties in favor of the first case encountered.
PopularByYear <- BabyNames %>%
group_by( year ) %>%
filter( rank( desc(count) ) == 3)
head( PopularByYear, 10 )
Source: local data frame [10 x 4]
Groups: year
name sex count year
1 Mary F 7065 1880
2 Mary F 6919 1881
3 Mary F 8148 1882
4 Mary F 8012 1883
5 William M 8897 1884
6 William M 8044 1885
7 William M 8252 1886
8 William M 7470 1887
9 William M 8705 1888
10 William M 7772 1889
tail( PopularByYear, 10 )
Source: local data frame [10 x 4]
Groups: year
name sex count year
125 Emily F 25021 2004
126 Michael M 23789 2005
127 Joshua M 22298 2006
128 Ethan M 21013 2007
129 Ethan M 20194 2008
130 Ethan M 19828 2009
131 Sophia F 20601 2010
132 Isabella F 19850 2011
133 Isabella F 19026 2012
134 Olivia F 18256 2013
This is almost identical to the above, but with ==
replaced by <=
.
Top3ByYear <- BabyNames %>%
group_by( year ) %>%
filter( rank( desc(count) ) <= 3) %>%
arrange( year )
head( Top3ByYear, 15 )
Source: local data frame [15 x 4]
Groups: year
name sex count year
1 Mary F 7065 1880
2 John M 9655 1880
3 William M 9532 1880
4 Mary F 6919 1881
5 John M 8769 1881
6 William M 8524 1881
7 Mary F 8148 1882
8 John M 9557 1882
9 William M 9298 1882
10 Mary F 8012 1883
11 John M 8894 1883
12 William M 8387 1883
13 Mary F 9217 1884
14 John M 9388 1884
15 William M 8897 1884
tail( Top3ByYear, 15 )
Source: local data frame [15 x 4]
Groups: year
name sex count year
388 Isabella F 22273 2009
389 Jacob M 21133 2009
390 Ethan M 19828 2009
391 Isabella F 22872 2010
392 Sophia F 20601 2010
393 Jacob M 22074 2010
394 Sophia F 21799 2011
395 Isabella F 19850 2011
396 Jacob M 20310 2011
397 Sophia F 22245 2012
398 Emma F 20871 2012
399 Isabella F 19026 2012
400 Sophia F 21075 2013
401 Emma F 20788 2013
402 Olivia F 18256 2013
If there is a tie, there’s no guarantee that any of the names will be in 3rd place: they might tie for 4th place. Using row_number()
instead of rank would help here.
Please use the comment system to make suggestions, point out errors, or to discuss the topic.
Unless there are ties for the smallest or the largest.↩