Each row of a data frame is an individual case. Often it’s useful to re-organize the same data in a different way with different meanings to a case. This can make it easier to perform wrangling tasks such as comparisons, joins, and the inclusion of new data.
Consider the format of
BP_wide shown in Table 12.1, where the case is a research study subject and there are separate variables for the measurement of systolic blood pressure (SBP) before and after exposure to a stressful environment.
BP_wide: A wide format
Exactly the same data can be presented in the format of the
BP_narrow data frame (Table 12.2, where the case is an individual occasion for blood-pressure measurement.
BP_narrow: A narrow format; also known as ‘long’ format
Each of the formats
BP_narrow has its advantages and its deficits. For example, it’s easy to find the before-and-after change in blood pressure using
%>% BP_wide mutate(change = after - before)
On the other hand, a narrow (or long) format is more flexible for including additional variables, for example the date of the measurement or the diastolic blood pressure as in Table 12.3. Narrow format also makes it feasible to add in additional measurement occasions. For instance, Table 12.3 shows several “after” measurements for subject WJC. (Such “repeated measures” are a common feature of scientific studies.)
A simple strategy allows you to get the benefits of either format: convert from wide to narrow or from narrow to wide as suits your purpose.
Transforming from wide to narrow is the action of a data verb: a wide data frame is the input and a narrow data frame is the output. The reverse task, transforming from narrow to wide, involves another data verb. Different authors use different names for these verbs: e.g., melting versus casting, stacking versus unstacking, folding versus unfolding, gathering versus spreading. We primarily use pivot wider and pivot longer implemented by
pivot_wider( ) and
pivot_longer( ) in the
tidyr is a close relation to
dplyr, the main data wrangling package used in this book. Both were created by Hadley Wickham and feature prominently among the
tidyverse packages. A prior iteration of the
tidyr package utilized
gather() functions to accomplish the purposes of
pivot_wider( ) function converts from a relatively narrow form to wider form. Carrying out this operation involves specifying some information in the arguments to the function. The mechanics of this transformation are dictated by specifying a pair of arguments which define the names to be associated with new variables to be established for the wider form, as well as the corresponding values to be populated for each case.
For instance, in the narrow form
BP_narrow (Table 12.2) the values found in the
sbp variable will be spread between two variables–
after–in the corresponding wide form,
BP_wide (Table 12.1). Note that the different categorical levels in
after) associated with each case in
BP_narrow determines the destination for that
sbp value in
Notice that each case had been defined by a visit for a subject (e.g., BHO + before) in the
BP_narrow form, and now each case is defined only by a subject in the
BP_wide form of the data. Each form simply re-organizes the same
spb information in a different way with different meanings to a case.
BP_wide can be done like this:
%>% BP_narrow pivot_wider(names_from = when, values_from = sbp)
Now consider how to transform
BP_narrow. The names of the variables to be gathered together,
after, will become the categorical levels in the narrow form. That is, all systolic blood pressure values will be gathered into a single variable
sbp and we will use the accompanying
after information to specify
when the blood pressure reading for that subject was recorded.
when is suggested, but in fact the data analyst has to invent a name for this variable. There are all sorts of sensible possibilities, for instance
before_or_after would have been just as effective.
Similarly, a name must be invented for the variable that is to hold the values in the variables being gathered. Again, there are many reasonable possibilities. It’s sensible to choose a name that reflects the kind of thing those value are, in this case systolic blood pressure. So,
sbp was a good choice.
Finally, the analyst needs to specify which are the variables to be gathered (or not). For instance, it hardly makes sense to gather
subject with the other variables; it will remain as a separate variable in the narrow result. Values in
subject will be repeated as necessary to give each case in the narrow format its own correct value of
Here’s how to gather variables from
BP_wide and convert the result into
%>% BP_wide pivot_longer(cols = c(before, after), names_to = "when", values_to = "sbp")
Again, notice the distinction in the function arguments between new variable names invented by the data analyst (e.g.,
"sbp") to be used in the resulting narrow form in contrast to the original variable names specified from the wide form by the argument
cols = c(before, after).
In “A Boy Named Sue,” country singer Johnny Cash famously told the story of a boy toughened in life by a name commonly given to girls. Indeed, “Sue” is given to about 300 times as many girls as boys according to Table 12.4.
%>% BabyNames filter(name == "Sue") %>% group_by(name, sex) %>% summarise(total = sum(count))
Table 12.4: 99.6% of Sues are female.
On the other hand, Batman’s partner, Robin, had no such problems. Table 12.5 shows that about 15% of Robins are male.
%>% BabyNames filter(name == "Robin") %>% group_by(name, sex) %>% summarise(total = sum(count))
Table 12.5: 15% of Robins are male.
This technique works well if you want to look at gender balance in one name at a time, but suppose you want to find the most gender-neutral names from all 92,600 names in
BabyNames? For this, it would be useful to have the results in a wide format, like Table 12.6.
<- BabyWide %>% BabyNames group_by(name, sex) %>% summarise(total = sum(count)) %>% pivot_wider(names_from = sex, values_from = total, values_fill = 0)
Table 12.6: A wide format facilitates examining gender balance in
|… and so on for 92,600 rows altogether.|
To start, add up the counts in
BabyNames over the years and convert to a wide format as in Table 12.6, using
sex. Note the argument
values_fill = 0 passed to the
spread() function that produced Table 12.6. For a name like Ab or Abe, where there are no females, the entry for
F would be populated with zero rather than treated as missing.
One way to define “approximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.
Here’s a statement to find the most balanced gender-neutral names out of the names with more than 50,000 babies of each sex. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one. (The
pmin() transformation function returns the smaller of the two arguments for each individual case.)
%>% BabyWide filter(M > 50000, F > 50000) %>% mutate(ratio = pmin(M / F, F / M) ) %>% arrange(desc(ratio))
Table 12.7: The most gender-balanced common names.
|… and so on for 19 rows altogether.|
Problem 12.1: The
BabyNames data table (in the
dcData package) looks like this:
The same data can be presented as a wide-format table:
An R statement to convert from narrow to wide format can have the following form:
BothSexes <- BabyNames %>% pivot_wider(names_from = ??? , values_from = count) %>% filter(M > 5 | F > 5)
???in the R statement?
Now that you have
abs(log(F / M)). The smaller this number, the more balanced the name count.
Problem 12.2: Consider the following forms for organizing a subset of the
BabyNames data as we examine the most popular name of either sex each year.
Table 12.8: Version One
|… and so on for 9 rows altogether.|
Table 12.9: Version Two
Table 12.10: Version Three
What is the meaning of a case in each of the tables?
What should the chain of R commands look like to make “Version Two” from the data table “Version One” shown above? Also, explain which of the two versions is “wider.”
What should the chain of R commands look like to make “Version One” from the data table “Version Three” shown above? Also, explain which of the two versions is “wider.”
Suppose you want to calculate the ratio of male to female in each name in each year. Is it better to start from “Version Two” or “Version Three?” Why?
If you were given “Version One,” should you immediately begin using data verbs to calculate the ratio of male to female, or should you first translate to “Version Two” or “Version Three?” Explain.
Problem 12.3: Here are three data frames, A, B, and C, with different organizations of the same data:
Data Frame A
Data Frame B
Data Frame C
Comparing each of the following pairs of tables, is one considered “wider” as described in the chapter? If so, say which is wider; if not, explain.
Which table format do you think would make it easiest to find the change from 2000 to 2001 for each country. How would you do it?
Suppose you have another table,
ContinentData, which gives the continent that each country is in. Which table format do you think would make it easiest to find the sum of the values for each continent for each of the years? How would you do it?
Problem 12.4: Consider the data table
BP_wide shown here:
Think about the result of the following conversion from wide to narrow.
%>% BP_wide pivot_longer(cols = c(subject, before, after), names_to = "when", values_to = "sbp")
Explain why the resulting data frame might be considered “too narrow” meaning that we have converted it to a narrow form that is now less useful for most practical purposes.
Show how the code should be modified in order to convert
BP_wide to a more useful narrow form. Recall that “narrow” as defined in the chapter may not necessarily result in fewer columns.