Chapter 13 Ranks

Many questions take forms such as these:

  • “Find the largest …”
  • “Find the three largest …”
  • “Find the smallest within each group …”

The functions min() and max() are reasonable candidates for carrying out such tasks, but they are limited. For instance, to find the name whose yearly count was the highest:

BabyNames %>% 
  filter(count == max(count))
name sex count year
Linda F 99674 1947

The filter() function needs a criterion. The criterion count == max(count) Note the double equals sign ‘==’ used in evaluating the criterion passes through the case where the value of count matches the largest value of count. That will be the biggest case.

It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.

BabyNames %>% 
  filter(count > 0.90 * max(count))
name sex count year
Linda F 99674 1947
James M 94758 1947
Robert M 91652 1947
Linda F 96210 1948
Linda F 90994 1949
Michael M 90629 1956
… and so on for 8 rows altogether.

Frequently the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank(). The rank() function does something simple but powerful: it replaces each number in a set with where that number stands with respect to the others. For instance, look at the tiny data frame Set shown in Table 13.1. What’s the rank of the number 5 in the numbers variable.

Table 13.1: A data frame with one variable. We will refer to this as Set

numbers
2
5
4
7
2
9
9
8

The rank of the number 5 in the set is four. Why? 5 is the fourth smallest number in the set. (Three of the numbers in the set 2, 2, 4 are smaller than 5.) The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).1 Unless there are ties. Table 13.2 shows the rank of each number in the set:

Set %>% mutate(the_rank = rank(numbers))

Table 13.2: The ranks of the values in numbers.

numbers the_rank
2 1.5
5 4.0
4 3.0
7 5.0
2 1.5
9 7.5
9 7.5
8 6.0

Note that the two 2’s have the same rank, as do the two 9’s. They’re tied.

Suppose you want to find the 3rd most popular name of all time. Use rank().

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter(rank(desc(total)) == 3) 

Table 13.3: The third most popular name of all time

name total
Robert 4809858

Or, to find the top three most popular names, replace == in the above by <=.

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter( rank(desc(total)) <= 3) 

Table 13.4: The all-time top three most popular names in BabyNames.

name total
James 5114325
John 5095590
Robert 4809858

When applied to grouped data, rank() will be calculated separately within each group. That is, the rank of a value will be with respect to the other cases in that group. For instance, here’s the third most popular name each year.

BabyNames %>% 
  group_by(year) %>% 
  filter(rank(desc(count)) == 3) 

Table 13.5: The third most popular name in each year

name sex count year
Mary F 7065 1880
Mary F 6919 1881
Mary F 8148 1882
Mary F 8012 1883
William M 8897 1884
William M 8044 1885
… and so on for 134 rows altogether.

Sometimes, two or more numbers are tied in rank. The rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were even slightly different. There are other rank-like transformation verbs that handle ties differently. For instance, row_number() breaks ties in favor of the first case encountered. (See Table 13.6.)

Set %>% 
  mutate(the_rank = rank(numbers), 
         ties_broken = row_number(numbers))

Table 13.6: row_number() breaks ties in rank.

numbers the_rank ties_broken
2 1.5 1
5 4.0 4
4 3.0 3
7 5.0 5
2 1.5 2
9 7.5 7
9 7.5 8
8 6.0 6

13.1 Exercises

Problem 13.1: For each sex, find the 5 most popular names in BabyNames adding up over all the years.

Problem 13.2: Let’s investigate how the diversity of name use has changed over time. Using BabyNames, for each year, find the fraction of all babies born in that year who were given a name in the top 100 for that year. Make a line graph showing how this fraction has changed over the years.

  1. Produce a data table showing, for each year, the number of babies given the 100 most popular names in that year and the number of babies given names not in the top 100. Like this …
year ranking total
2013 Below 2588290
2013 Top_100 1019807
2012 Below 2599397
2012 Top_100 1039441
2011 Below 2589001
2011 Top_100 1054997
… and so on for 268 rows altogether.
  1. Use pivot_wider() and mutate() to find the fraction of babies with names in the top 100 each year.

  2. Make a line graph showing how this fraction has changed over the years. Be sure to include an informative title and axis labels.

Problem 13.3: Find the names in BabyNames that were very popular in some year but which were very unpopular in some other year. Take being in the top 50 in a year as being “very popular” and being below the top 1000 in a year as being “very unpopular.”