Many questions take forms such as these:
max() are reasonable candidates for carrying out such tasks, but they are limited. For instance, to find the name whose yearly count was the highest:
filter() function needs a criterion. The criterion
count == max(count) Note the double equals sign ‘==’ used in evaluating the criterion passes through the case where the value of
count matches the largest value of
count. That will be the biggest case.
It’s also possible to ask for the cases that are almost as popular as the biggest, e.g. at least 90% as popular.
|… and so on for 8 rows altogether.|
Frequently the question will be framed in terms of the \(n\) biggest or smallest values, not as a fraction of the largest. To perform such tasks, a new transformation verb is helpful:
rank() function does something simple but powerful: it replaces each number in a set with where that number stands with respect to the others. For instance, look at the tiny data frame
Set shown in Table 13.1. What’s the rank of the number 5 in the
Table 13.1: A data frame with one variable. We will refer to this as
The rank of the number 5 in the set is four. Why? 5 is the fourth smallest number in the set. (Three of the numbers in the set 2, 2, 4 are smaller than 5.) The smallest number in a set of \(n\) numbers will have rank 1; the largest will have rank \(n\).1 Unless there are ties. Table 13.2 shows the rank of each number in the set:
Table 13.2: The ranks of the values in
Note that the two 2’s have the same rank, as do the two 9’s. They’re tied.
Suppose you want to find the 3rd most popular name of all time. Use
Table 13.3: The third most popular name of all time
Or, to find the top three most popular names, replace
== in the above by
Table 13.4: The all-time top three most popular names in
When applied to grouped data,
rank() will be calculated separately within each group. That is, the rank of a value will be with respect to the other cases in that group. For instance, here’s the third most popular name each year.
Table 13.5: The third most popular name in each year
|… and so on for 134 rows altogether.|
Sometimes, two or more numbers are tied in rank. The
rank() function deals with these by assigning all the tied values the same rank, which is the mean of the ranks those values would have had if they were even slightly different. There are other rank-like transformation verbs that handle ties differently. For instance,
row_number() breaks ties in favor of the first case encountered. (See Table 13.6.)
row_number() breaks ties in rank.
Problem 13.1: For each sex, find the 5 most popular names in
BabyNames adding up over all the years.
Problem 13.2: Let’s investigate how the diversity of name use has changed over time. Using
BabyNames, for each year, find the fraction of all babies born in that year who were given a name in the top 100 for that year. Make a line graph showing how this fraction has changed over the years.
|… and so on for 268 rows altogether.|
mutate() to find the fraction of babies with names in the top 100 each year.
Make a line graph showing how this fraction has changed over the years. Be sure to include an informative title and axis labels.
Problem 13.3: Find the names in
BabyNames that were very popular in some year but which were very unpopular in some other year. Take being in the top 50 in a year as being “very popular” and being below the top 1000 in a year as being “very unpopular.”