Biggest, smallest, and inbetween

Many questions take forms such as these:

The functions min() and max() are obvious candidates for carrying out such tasks, but they don’t do quite the right thing. For instance

BabyNames %>% 
  summarise( biggest=max( count ) )
  biggest
1   99674

quickly reveals that the most popular name was given 99674 times. But the result doesn’t indicate what that popular name was or in what year it was given.

What’s needed here is a data verb that will return the one biggest case. The job of choosing cases that meet a criterion belongs to filter().

BabyNames %>% 
  filter( count==max( count ) )
   name sex count year
1 Linda   F 99674 1947

Note the difference in the use of filter() instead of summarise(): rather than creating a new variable as summarise() (it’s name biggest in the example), filter() needs a criterion or test. The criterion count==max( count ) (with the double equals sign ==) ask for the case where the value of count matches the largest value of count. That will be the biggest case.

It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.

BabyNames %>% 
  filter( count > 0.90*max( count ))
     name sex count year
1   Linda   F 99674 1947
2   James   M 94758 1947
3  Robert   M 91652 1947
4   Linda   F 96210 1948
5   Linda   F 90994 1949
6 Michael   M 90629 1956
7 Michael   M 92711 1957
8 Michael   M 90512 1958

Frequently, however, the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank().

Rank() does something very simple: it replaces each number in a set with where that number stands with respect to the others. For instance, the rank of the number 5 in the set 2,5,4,7,2,9,9,8 is 4 because 5 is the 4th smallest number in the set. (The three numbers 2, 2, 4 are smaller than 5.)

Note that rank() is not a data verb. Data verbs take a data table as input and return a data table. In contrast, rank() takes a variable as input and returns a set of the same size that tells where each number in the set stands. The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).1 Since rank() transforms a variable into a variable, it’s particularly suitable for use in mutate() and filter().

To illustrate, here are the 10 most popular names in the BabyNames data table. (Note: desc() is used so that rank() will work in descending order, rather than from smallest to biggest.)

BabyNames %>%
  filter( rank( desc(count) ) <= 10 )
      name sex count year
1    Linda   F 99674 1947
2    James   M 94758 1947
3   Robert   M 91652 1947
4    Linda   F 96210 1948
5    James   M 88610 1948
6    Linda   F 90994 1949
7  Michael   M 88481 1954
8  Michael   M 90629 1956
9  Michael   M 92711 1957
10 Michael   M 90512 1958

Of course, rank() can be used with other data verbs, such as group_by().

BabyNames %>%
  group_by( sex ) %>%
  filter( rank( desc(count) ) <= 10 )
Source: local data frame [20 x 4]
Groups: sex

      name sex count year
1     Mary   F 73981 1921
2     Mary   F 72172 1922
3     Mary   F 71631 1923
4     Mary   F 73520 1924
5    James   M 87428 1946
6    Linda   F 99674 1947
7     Mary   F 71679 1947
8    James   M 94758 1947
9   Robert   M 91652 1947
10    John   M 88318 1947
11   Linda   F 96210 1948
12   James   M 88610 1948
13   Linda   F 90994 1949
14   Linda   F 80433 1950
15   Linda   F 73928 1951
16 Michael   M 88481 1954
17 Michael   M 88281 1955
18 Michael   M 90629 1956
19 Michael   M 92711 1957
20 Michael   M 90512 1958

You can also perform other tasks, for instance finding the most popular names over all time by adding up the counts over the years:

BabyNames %>%
  group_by( sex, name ) %>% 
  summarise( count=sum(count) ) %>%
  filter( rank( desc(count) ) <= 10 ) %>% 
  arrange( desc(count) )
Source: local data frame [20 x 3]
Groups: sex

   sex      name   count
1    M     James 5091189
2    M      John 5073958
3    M    Robert 4789776
4    M   Michael 4293460
5    F      Mary 4112464
6    M   William 4038447
7    M     David 3565229
8    M    Joseph 2557792
9    M   Richard 2552302
10   M   Charles 2356886
11   M    Thomas 2275889
12   F Elizabeth 1591439
13   F  Patricia 1570135
14   F  Jennifer 1461186
15   F     Linda 1450328
16   F   Barbara 1432543
17   F  Margaret 1238016
18   F     Susan 1120083
19   F   Dorothy 1105281
20   F     Sarah 1055860

Ties

Often, there are ties. The rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were not quite tied.

Some rank-like functions are useful if ties are an issue. For instance, row_number() breaks ties in favor of the first case encountered.

Examples


  1. Unless there are ties for the smallest or the largest.