Chapter 12 Ranks

Many questions take forms such as these:

  • “Find the largest …”
  • “Find the three largest …”
  • “Find the smallest within each group …”

The functions min() and max() are reasonable candidates for carrying out such tasks, but they are limited. For instance, to find the name whose yearly count was the highest:

BabyNames %>% 
  filter(count == max(count))
name sex count year
Linda F 99674 1947

The filter() function needs a criterion. The criterion count == max(count) Note the double equals sign ‘==’ used in evaluating the criterion passes through the case where the value of count matches the largest value of count. That will be the biggest case.

It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.

BabyNames %>% 
  filter(count > 0.90 * max(count))
name sex count year
Linda F 99674 1947
James M 94758 1947
Robert M 91652 1947
Linda F 96210 1948
Linda F 90994 1949
Michael M 90629 1956
… and so on for 8 rows altogether.

Frequently the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank(). The rank() function does something simple but powerful: it replaces each number in a set with where that number stands with respect to the others. For instance, look at the tiny data frame Set shown in Table 12.1. What’s the rank of the number 5 in the numbers variable.

Table 12.1: A data frame with one variable. We will refer to this as Set

numbers
2
5
4
7
2
9
9
8

The rank of the number 5 in the set is four. Why? 5 is the fourth smallest number in the set. (Three of the numbers in the set 2, 2, 4 are smaller than 5.) The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).1 Unless there are ties. Table 12.2 shows the rank of each number in the set:

Set %>% mutate(the_rank = rank(numbers))

Table 12.2: The ranks of the values in numbers.

numbers the_rank
2 1.5
5 4.0
4 3.0
7 5.0
2 1.5
9 7.5
9 7.5
8 6.0

Note that the two 2’s have the same rank, as do the two 9’s. They’re tied.

Suppose you want to find the 3rd most popular name of all time. Use rank().

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter(rank(desc(total)) == 3) 

Table 12.3: The third most popular name of all time

name total
Robert 4809858

Or, to find the top three most popular names, replace == in the above by <=.

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter( rank(desc(total)) <= 3) 

Table 12.4: The all-time top three most popular names in BabyNames.

name total
James 5114325
John 5095590
Robert 4809858

When applied to grouped data, rank() will be calculated separately within each group. That is, the rank of a value will be with respect to the other cases in that group. For instance, here’s the third most popular name each year.

BabyNames %>% 
  group_by(year) %>% 
  filter(rank(desc(count)) == 3) 

Table 12.5: The third most popular name in each year

name sex count year
Mary F 7065 1880
Mary F 6919 1881
Mary F 8148 1882
Mary F 8012 1883
William M 8897 1884
William M 8044 1885
… and so on for 134 rows altogether.

Sometimes, two or more numbers are tied in rank. The rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were even slightly different. There are other rank-like transformation verbs that handle ties differently. For instance, row_number() breaks ties in favor of the first case encountered. (See Table 12.6.)

Set %>% 
  mutate(the_rank = rank(numbers), 
         ties_broken = row_number(numbers))

Table 12.6: row_number() breaks ties in rank.

numbers the_rank ties_broken
2 1.5 1
5 4.0 4
4 3.0 3
7 5.0 5
2 1.5 2
9 7.5 7
9 7.5 8
8 6.0 6