Chapter 12 Wide versus Narrow Data

Each row of a data frame is an individual case. Often it’s useful to re-organize the same data in a different way with different meanings to a case. This can make it easier to perform wrangling tasks such as comparisons, joins, and the inclusion of new data.

Consider the format of BP_wide shown in Table 12.1, where the case is a research study subject and there are separate variables for the measurement of systolic blood pressure (SBP) before and after exposure to a stressful environment.

Table 12.1: BP_wide: A wide format

subject before after
BHO 160 160
GWB 115 135
WJC 105 145

Exactly the same data can be presented in the format of the BP_narrow data frame (Table 12.2, where the case is an individual occasion for blood-pressure measurement.

Table 12.2: BP_narrow: A narrow format; also known as ‘long’ format

subject when sbp
BHO before 160
GWB before 115
WJC after 145
GWB after 135
WJC before 105
BHO after 160

Each of the formats BP_wide and BP_narrow has its advantages and its deficits. For example, it’s easy to find the before-and-after change in blood pressure using BP_wide:

BP_wide %>%
  mutate(change = after - before)

On the other hand, a narrow (or long) format is more flexible for including additional variables, for example the date of the measurement or the diastolic blood pressure as in Table 12.3. Narrow format also makes it feasible to add in additional measurement occasions. For instance, Table 12.3 shows several “after” measurements for subject WJC. (Such “repeated measures” are a common feature of scientific studies.)

Table 12.3: A data frame extending the information in Tables 12.2 and 12.1 to include additional variables and repeated measurements. The narrow format facilitates including new cases or variables.

subject when sbp dbp date
BHO before 160 69 2007-06-19
GWB before 115 54 1998-04-21
BHO before 155 65 2005-11-08
WJC after 145 75 2002-11-15
WJC after NA 65 2010-03-26
WJC after 130 60 2013-09-15
GWB after 135 NA 2009-05-08
WJC before 105 60 1990-08-17
BHO after 160 78 2017-06-04

A simple strategy allows you to get the benefits of either format: convert from wide to narrow or from narrow to wide as suits your purpose.

12.1 Data verbs for converting wide to narrow and vice versa

Transforming from wide to narrow is the action of a data verb: a wide data frame is the input and a narrow data frame is the output. The reverse task, transforming from narrow to wide, involves another data verb. Different authors use different names for these verbs: e.g., melting versus casting, stacking versus unstacking, folding versus unfolding, gathering versus spreading. We primarily use pivot wider and pivot longer implemented by pivot_wider( ) and pivot_longer( ) in the tidyr package. tidyr is a close relation to dplyr, the main data wrangling package used in this book. Both were created by Hadley Wickham and feature prominently among the tidyverse packages. A prior iteration of the tidyr package utilized spread() and gather() functions to accomplish the purposes of pivot_wider() and pivot_longer(), respectively.

Pivoting wider

The pivot_wider( ) function converts from a relatively narrow form to wider form. Carrying out this operation involves specifying some information in the arguments to the function. The mechanics of this transformation are dictated by specifying a pair of arguments which define the names to be associated with new variables to be established for the wider form, as well as the corresponding values to be populated for each case.

For instance, in the narrow form BP_narrow (Table 12.2) the values found in the sbp variable will be spread between two variables–before and after–in the corresponding wide form, BP_wide (Table 12.1). Note that the different categorical levels in when (e.g., before, and after) associated with each case in BP_narrow determines the destination for that sbp value in BP_wide.

Notice that each case had been defined by a visit for a subject (e.g., BHO + before) in the BP_narrow form, and now each case is defined only by a subject in the BP_wide form of the data. Each form simply re-organizes the same spb information in a different way with different meanings to a case.

Translating BP_narrow to BP_wide can be done like this:

BP_narrow %>%
  pivot_wider(names_from = when, values_from = sbp)

Pivot longer

Now consider how to transform BP_wide into BP_narrow. The names of the variables to be gathered together, before and after, will become the categorical levels in the narrow form. That is, all systolic blood pressure values will be gathered into a single variable sbp and we will use the accompanying before or after information to specify when the blood pressure reading for that subject was recorded.

The name when is suggested, but in fact the data analyst has to invent a name for this variable. There are all sorts of sensible possibilities, for instance before_or_after would have been just as effective.

Similarly, a name must be invented for the variable that is to hold the values in the variables being gathered. Again, there are many reasonable possibilities. It’s sensible to choose a name that reflects the kind of thing those value are, in this case systolic blood pressure. So, sbp was a good choice.

Finally, the analyst needs to specify which are the variables to be gathered (or not). For instance, it hardly makes sense to gather subject with the other variables; it will remain as a separate variable in the narrow result. Values in subject will be repeated as necessary to give each case in the narrow format its own correct value of subject.

Here’s how to gather variables from BP_wide and convert the result into BP_narrow:

BP_wide %>%  
  pivot_longer(cols = c(before, after), names_to = "when", values_to = "sbp")

Again, notice the distinction in the function arguments between new variable names invented by the data analyst (e.g., "when" and "sbp") to be used in the resulting narrow form in contrast to the original variable names specified from the wide form by the argument cols = c(before, after).

Example: Gender-neutral names

In “A Boy Named Sue,” country singer Johnny Cash famously told the story of a boy toughened in life by a name commonly given to girls. Indeed, “Sue” is given to about 300 times as many girls as boys according to Table 12.4.

BabyNames %>% 
  filter(name == "Sue") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count)) 

Table 12.4: 99.6% of Sues are female.

name sex total
Sue F 144410
Sue M 519

On the other hand, Batman’s partner, Robin, had no such problems. Table 12.5 shows that about 15% of Robins are male.

BabyNames %>% 
  filter(name == "Robin") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count)) 

Table 12.5: 15% of Robins are male.

name sex total
Robin F 288396
Robin M 43829

This technique works well if you want to look at gender balance in one name at a time, but suppose you want to find the most gender-neutral names from all 92,600 names in BabyNames? For this, it would be useful to have the results in a wide format, like Table 12.6.

BabyWide <- 
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count)) %>%
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

Table 12.6: A wide format facilitates examining gender balance in BabyNames.

name M F
Addison 12179 90462
Adrian 205313 14770
Alex 258131 8055
Alexis 58962 324081
Ali 21526 9071
Alva 11164 8572
… and so on for 92,600 rows altogether.

To start, add up the counts in BabyNames over the years and convert to a wide format as in Table 12.6, using sex. Note the argument values_fill = 0 passed to the spread() function that produced Table 12.6. For a name like Ab or Abe, where there are no females, the entry for F would be populated with zero rather than treated as missing.

One way to define “approximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.

Here’s a statement to find the most balanced gender-neutral names out of the names with more than 50,000 babies of each sex. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one. (The pmin() transformation function returns the smaller of the two arguments for each individual case.)

BabyWide %>%
  filter(M > 50000, F > 50000) %>%
  mutate(ratio = pmin(M / F, F / M) ) %>%
  arrange(desc(ratio)) 

Table 12.7: The most gender-balanced common names.

name M F ratio
Riley 85039 76811 0.90
Jackie 78061 90217 0.87
Casey 108072 74699 0.69
Jessie 109031 165106 0.66
Angel 203376 90531 0.45
Leslie 112463 263049 0.43
… and so on for 19 rows altogether.

12.2 Exercises

Problem 12.1: The BabyNames data table (in the dcData package) looks like this:

name sex count year
Juana F 121 1956
Stamatia F 5 1981
Kenesha F 16 2004
Aubreigh F 63 2010
Yadriel M 17 2012
Mariam F 52 1922

The same data can be presented as a wide-format table:

name year F M
Aarish 2012 NA 13
Adeeb 2008 NA 12
Adelaide 1920 471 NA
Adom 2002 NA 8
Adon 1975 NA 6
Adrianah 1999 7 NA

An R statement to convert from narrow to wide format can have the following form:

BothSexes <-
  BabyNames %>%
  pivot_wider(names_from = ??? , values_from = count) %>%
  filter(M > 5 | F > 5)
  1. Which variable should replace ??? in the R statement?

Now that you have BothSexes

  1. Find the 10 names with the closest balance between females and males. You can define gender-balance quantitatively as abs(log(F / M)). The smaller this number, the more balanced the name count.
  2. Find the 10 names with the closest balance that have more than 100 babies of either sex.
  3. Challenge. Find the 10 names with the closest balance that have lasting popularity. Define lasting popularity as having more than 100 babies in at least 20 of the years.
  4. Challenge. Are there names that have switched gender balance over the years. Find, for each year, the gender ratio. Pull out ones where the maximum and minumum over the years differ by a large amount. Then plot out over time.

Problem 12.2: Consider the following forms for organizing a subset of the BabyNames data as we examine the most popular name of either sex each year.

Table 12.8: Version One

name sex year nbabies
Ida F 1912 2740
Ida M 1912 14
Robert F 1912 64
Robert M 1912 12837
Theresa F 1912 1268
Ida F 2012 99
… and so on for 9 rows altogether.

Table 12.9: Version Two

name year F M
Ida 1912 2740 14
Ida 2012 99 NA
Robert 1912 64 12837
Robert 2012 7 6910
Theresa 1912 1268 NA
Theresa 2012 232 NA

Table 12.10: Version Three

name sex 1912 2012
Ida F 2740 99
Ida M 14 NA
Robert F 64 7
Robert M 12837 6910
Theresa F 1268 232
  1. What is the meaning of a case in each of the tables?

    • Version One
    • Version Two
    • Version Three
  2. What should the chain of R commands look like to make “Version Two” from the data table “Version One” shown above? Also, explain which of the two versions is “wider.”

  3. What should the chain of R commands look like to make “Version One” from the data table “Version Three” shown above? Also, explain which of the two versions is “wider.”

  4. Suppose you want to calculate the ratio of male to female in each name in each year. Is it better to start from “Version Two” or “Version Three?” Why?

  5. If you were given “Version One,” should you immediately begin using data verbs to calculate the ratio of male to female, or should you first translate to “Version Two” or “Version Three?” Explain.

Problem 12.3: Here are three data frames, A, B, and C, with different organizations of the same data:

Data Frame A

Year Algeria Brazil Columbia
2000 7 12 16
2001 9 14 18

Data Frame B

Country Y2000 Y2001
Algeria 7 9
Brazil 12 14
Columbia 16 18

Data Frame C

Country Year Value
Algeria 2000 7
Algeria 2001 9
Brazil 2000 12
Columbia 2001 18
Columbia 2000 16
Brazil 2001 14
  1. Comparing each of the following pairs of tables, is one considered “wider” as described in the chapter? If so, say which is wider; if not, explain.

    • A versus C
    • B versus C
    • A versus B
  2. Which table format do you think would make it easiest to find the change from 2000 to 2001 for each country. How would you do it?

  3. Suppose you have another table, ContinentData, which gives the continent that each country is in. Which table format do you think would make it easiest to find the sum of the values for each continent for each of the years? How would you do it?

Problem 12.4: Consider the data table BP_wide shown here:

Table 12.11: BP_wide

subject before after
BHO 120 160
GWB 115 135
WJC 135 145

Think about the result of the following conversion from wide to narrow.

BP_wide %>%
  pivot_longer(cols = c(subject, before, after), 
               names_to = "when", values_to = "sbp")
  1. Explain why the resulting data frame might be considered “too narrow” meaning that we have converted it to a narrow form that is now less useful for most practical purposes.

  2. Show how the code should be modified in order to convert BP_wide to a more useful narrow form. Recall that “narrow” as defined in the chapter may not necessarily result in fewer columns.