Biggest, smallest, and inbetween

Many questions take forms such as these:

“Find the largest …”
“Find the three largest …”
“Find the smallest within each group …”

The functions min() and max() are obvious candidates for carrying out such tasks, but they don’t do quite the right thing. For instance

BabyNames %>% 
  summarise( biggest=max( count ) )

  biggest
1   99674

quickly reveals that the most popular name was given 99674 times. But the result doesn’t indicate what that popular name was or in what year it was given.

What’s needed here is a data verb that will return the one biggest case. The job of choosing cases that meet a criterion belongs to filter().

BabyNames %>% 
  filter( count==max( count ) )

   name sex count year
1 Linda   F 99674 1947

Note the difference in the use of filter() instead of summarise(): rather than creating a new variable as summarise() (it’s name biggest in the example), filter() needs a criterion or test. The criterion count==max( count ) (with the double equals sign ==) ask for the case where the value of count matches the largest value of count. That will be the biggest case.

It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.

BabyNames %>% 
  filter( count > 0.90*max( count ))

     name sex count year
1   Linda   F 99674 1947
2   James   M 94758 1947
3  Robert   M 91652 1947
4   Linda   F 96210 1948
5   Linda   F 90994 1949
6 Michael   M 90629 1956
7 Michael   M 92711 1957
8 Michael   M 90512 1958

Frequently, however, the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank().

Rank() does something very simple: it replaces each number in a set with where that number stands with respect to the others. For instance, the rank of the number 5 in the set 2,5,4,7,2,9,9,8 is 4 because 5 is the 4th smallest number in the set. (The three numbers 2, 2, 4 are smaller than 5.)

Note that rank() is not a data verb. Data verbs take a data table as input and return a data table. In contrast, rank() takes a variable as input and returns a set of the same size that tells where each number in the set stands. The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).¹ Since rank() transforms a variable into a variable, it’s particularly suitable for use in mutate() and filter().

To illustrate, here are the 10 most popular names in the BabyNames data table. (Note: desc() is used so that rank() will work in descending order, rather than from smallest to biggest.)

BabyNames %>%
  filter( rank( desc(count) ) <= 10 )

      name sex count year
1    Linda   F 99674 1947
2    James   M 94758 1947
3   Robert   M 91652 1947
4    Linda   F 96210 1948
5    James   M 88610 1948
6    Linda   F 90994 1949
7  Michael   M 88481 1954
8  Michael   M 90629 1956
9  Michael   M 92711 1957
10 Michael   M 90512 1958

Of course, rank() can be used with other data verbs, such as group_by().

BabyNames %>%
  group_by( sex ) %>%
  filter( rank( desc(count) ) <= 10 )

Source: local data frame [20 x 4]
Groups: sex

      name sex count year
1     Mary   F 73981 1921
2     Mary   F 72172 1922
3     Mary   F 71631 1923
4     Mary   F 73520 1924
5    James   M 87428 1946
6    Linda   F 99674 1947
7     Mary   F 71679 1947
8    James   M 94758 1947
9   Robert   M 91652 1947
10    John   M 88318 1947
11   Linda   F 96210 1948
12   James   M 88610 1948
13   Linda   F 90994 1949
14   Linda   F 80433 1950
15   Linda   F 73928 1951
16 Michael   M 88481 1954
17 Michael   M 88281 1955
18 Michael   M 90629 1956
19 Michael   M 92711 1957
20 Michael   M 90512 1958

You can also perform other tasks, for instance finding the most popular names over all time by adding up the counts over the years:

BabyNames %>%
  group_by( sex, name ) %>% 
  summarise( count=sum(count) ) %>%
  filter( rank( desc(count) ) <= 10 ) %>% 
  arrange( desc(count) )

Source: local data frame [20 x 3]
Groups: sex

   sex      name   count
1    M     James 5091189
2    M      John 5073958
3    M    Robert 4789776
4    M   Michael 4293460
5    F      Mary 4112464
6    M   William 4038447
7    M     David 3565229
8    M    Joseph 2557792
9    M   Richard 2552302
10   M   Charles 2356886
11   M    Thomas 2275889
12   F Elizabeth 1591439
13   F  Patricia 1570135
14   F  Jennifer 1461186
15   F     Linda 1450328
16   F   Barbara 1432543
17   F  Margaret 1238016
18   F     Susan 1120083
19   F   Dorothy 1105281
20   F     Sarah 1055860

Ties

Often, there are ties. The rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were not quite tied.

Some rank-like functions are useful if ties are an issue. For instance, row_number() breaks ties in favor of the first case encountered.

Examples

Find the 3rd most popular name in each year

PopularByYear <- BabyNames %>% 
  group_by( year ) %>% 
  filter( rank( desc(count) ) == 3) 
head( PopularByYear, 10 )

Source: local data frame [10 x 4]
Groups: year

      name sex count year
1     Mary   F  7065 1880
2     Mary   F  6919 1881
3     Mary   F  8148 1882
4     Mary   F  8012 1883
5  William   M  8897 1884
6  William   M  8044 1885
7  William   M  8252 1886
8  William   M  7470 1887
9  William   M  8705 1888
10 William   M  7772 1889

tail( PopularByYear, 10 )

Source: local data frame [10 x 4]
Groups: year

        name sex count year
125    Emily   F 25021 2004
126  Michael   M 23789 2005
127   Joshua   M 22298 2006
128    Ethan   M 21013 2007
129    Ethan   M 20194 2008
130    Ethan   M 19828 2009
131   Sophia   F 20601 2010
132 Isabella   F 19850 2011
133 Isabella   F 19026 2012
134   Olivia   F 18256 2013

Find the top 3 most popular names in each year

This is almost identical to the above, but with == replaced by <=.

Top3ByYear <- BabyNames %>% 
  group_by( year ) %>% 
  filter( rank( desc(count) ) <= 3) %>%  
  arrange( year )
head( Top3ByYear, 15 )

Source: local data frame [15 x 4]
Groups: year

      name sex count year
1     Mary   F  7065 1880
2     John   M  9655 1880
3  William   M  9532 1880
4     Mary   F  6919 1881
5     John   M  8769 1881
6  William   M  8524 1881
7     Mary   F  8148 1882
8     John   M  9557 1882
9  William   M  9298 1882
10    Mary   F  8012 1883
11    John   M  8894 1883
12 William   M  8387 1883
13    Mary   F  9217 1884
14    John   M  9388 1884
15 William   M  8897 1884

tail( Top3ByYear, 15 )

Source: local data frame [15 x 4]
Groups: year

        name sex count year
388 Isabella   F 22273 2009
389    Jacob   M 21133 2009
390    Ethan   M 19828 2009
391 Isabella   F 22872 2010
392   Sophia   F 20601 2010
393    Jacob   M 22074 2010
394   Sophia   F 21799 2011
395 Isabella   F 19850 2011
396    Jacob   M 20310 2011
397   Sophia   F 22245 2012
398     Emma   F 20871 2012
399 Isabella   F 19026 2012
400   Sophia   F 21075 2013
401     Emma   F 20788 2013
402   Olivia   F 18256 2013

If there is a tie, there’s no guarantee that any of the names will be in 3rd place: they might tie for 4th place. Using row_number() instead of rank would help here.

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Unless there are ties for the smallest or the largest.↩