Chapter 13 Ranks

Many questions take forms such as these:

“Find the largest …”
“Find the three largest …”
“Find the smallest within each group …”

The functions min() and max() are reasonable candidates for carrying out such tasks, but they are limited. For instance, to find the name whose yearly count was the highest:

BabyNames %>% 
  filter(count == max(count))

name	sex	count	year
Linda	F	99674	1947

The filter() function needs a criterion. The criterion count == max(count) [Click to see note.]Note the double equals sign ‘==’ used in evaluating the criterion passes through the case where the value of count matches the largest value of count. That will be the biggest case.

It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.

BabyNames %>% 
  filter(count > 0.90 * max(count))

name	sex	count	year
Linda	F	99674	1947
James	M	94758	1947
Robert	M	91652	1947
Linda	F	96210	1948
Linda	F	90994	1949
Michael	M	90629	1956
… and so on for 8 rows altogether.

Frequently the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful: rank(). The rank() function does something simple but powerful: it replaces each number in a set with where that number stands with respect to the others. For instance, look at the tiny data frame Set shown in Table 13.1. What’s the rank of the number 5 in the numbers variable.

Table 13.1: A data frame with one variable. We will refer to this as Set

numbers
2
5
4
7
2
9
9
8

The rank of the number 5 in the set is four. Why? 5 is the fourth smallest number in the set. (Three of the numbers in the set 2, 2, 4 are smaller than 5.) The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).11 Unless there are ties. Table 13.2 shows the rank of each number in the set:

Set %>% mutate(the_rank = rank(numbers))

Table 13.2: The ranks of the values in numbers.

numbers	the_rank
2	1.5
5	4.0
4	3.0
7	5.0
2	1.5
9	7.5
9	7.5
8	6.0

Note that the two 2’s have the same rank, as do the two 9’s. They’re tied.

Suppose you want to find the 3rd most popular name of all time. Use rank().

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter(rank(desc(total)) == 3)

Table 13.3: The third most popular name of all time

name	total
Robert	4809858

Or, to find the top three most popular names, replace == in the above by <=.

BabyNames %>% 
  group_by(name) %>%
  summarise(total = sum(count)) %>%
  filter( rank(desc(total)) <= 3)

Table 13.4: The all-time top three most popular names in BabyNames.

name	total
James	5114325
John	5095590
Robert	4809858

When applied to grouped data, rank() will be calculated separately within each group. That is, the rank of a value will be with respect to the other cases in that group. For instance, here’s the third most popular name each year.

BabyNames %>% 
  group_by(year) %>% 
  filter(rank(desc(count)) == 3)

Table 13.5: The third most popular name in each year

name	sex	count	year
Mary	F	7065	1880
Mary	F	6919	1881
Mary	F	8148	1882
Mary	F	8012	1883
William	M	8897	1884
William	M	8044	1885
… and so on for 134 rows altogether.

Sometimes, two or more numbers are tied in rank. The rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were even slightly different. There are other rank-like transformation verbs that handle ties differently. For instance, row_number() breaks ties in favor of the first case encountered. (See Table 13.6.)

Set %>% 
  mutate(the_rank = rank(numbers), 
         ties_broken = row_number(numbers))

Table 13.6: row_number() breaks ties in rank.

numbers	the_rank	ties_broken
2	1.5	1
5	4.0	4
4	3.0	3
7	5.0	5
2	1.5	2
9	7.5	7
9	7.5	8
8	6.0	6

13.1 Exercises

Problem 13.1: For each sex, find the 5 most popular names in BabyNames adding up over all the years.

Problem 13.2: Let’s investigate how the diversity of name use has changed over time. Using BabyNames, for each year, find the fraction of all babies born in that year who were given a name in the top 100 for that year. Make a line graph showing how this fraction has changed over the years.

Produce a data table showing, for each year, the number of babies given the 100 most popular names in that year and the number of babies given names not in the top 100. Like this …

year	ranking	total
2013	Below	2588290
2013	Top_100	1019807
2012	Below	2599397
2012	Top_100	1039441
2011	Below	2589001
2011	Top_100	1054997
… and so on for 268 rows altogether.

Use pivot_wider() and mutate() to find the fraction of babies with names in the top 100 each year.
Make a line graph showing how this fraction has changed over the years. Be sure to include an informative title and axis labels.

Problem 13.3: Find the names in BabyNames that were very popular in some year but which were very unpopular in some other year. Take being in the top 50 in a year as being “very popular” and being below the top 1000 in a year as being “very unpopular.”