Chapter 12 Wide versus Narrow Data

Each row of a data frame is an individual case. Often it’s useful to re-organize the same data in a different way with different meanings to a case. This can make it easier to perform wrangling tasks such as comparisons, joins, and the inclusion of new data.

Consider the format of BP_wide shown in Table 12.1, where the case is a research study subject and there are separate variables for the measurement of systolic blood pressure (SBP) before and after exposure to a stressful environment.

Table 12.1: BP_wide: A wide format

subject	before	after
BHO	160	160
GWB	115	135
WJC	105	145

Exactly the same data can be presented in the format of the BP_narrow data frame (Table 12.2, where the case is an individual occasion for blood-pressure measurement.

Table 12.2: BP_narrow: A narrow format; also known as ‘long’ format

subject	when	sbp
BHO	before	160
GWB	before	115
WJC	after	145
GWB	after	135
WJC	before	105
BHO	after	160

Each of the formats BP_wide and BP_narrow has its advantages and its deficits. For example, it’s easy to find the before-and-after change in blood pressure using BP_wide:

BP_wide %>%
  mutate(change = after - before)

On the other hand, a narrow (or long) format is more flexible for including additional variables, for example the date of the measurement or the diastolic blood pressure as in Table 12.3. Narrow format also makes it feasible to add in additional measurement occasions. For instance, Table 12.3 shows several “after” measurements for subject WJC. (Such “repeated measures” are a common feature of scientific studies.)

Table 12.3: A data frame extending the information in Tables 12.2 and 12.1 to include additional variables and repeated measurements. The narrow format facilitates including new cases or variables.

subject	when	sbp	dbp	date
BHO	before	160	69	2007-06-19
GWB	before	115	54	1998-04-21
BHO	before	155	65	2005-11-08
WJC	after	145	75	2002-11-15
WJC	after	NA	65	2010-03-26
WJC	after	130	60	2013-09-15
GWB	after	135	NA	2009-05-08
WJC	before	105	60	1990-08-17
BHO	after	160	78	2017-06-04

A simple strategy allows you to get the benefits of either format: convert from wide to narrow or from narrow to wide as suits your purpose.

12.1 Data verbs for converting wide to narrow and vice versa

Transforming from wide to narrow is the action of a data verb: a wide data frame is the input and a narrow data frame is the output. The reverse task, transforming from narrow to wide, involves another data verb. [Click to see note.]Different authors use different names for these verbs: e.g., melting versus casting, stacking versus unstacking, folding versus unfolding, gathering versus spreading. We primarily use pivot wider and pivot longer implemented by pivot_wider( ) and pivot_longer( ) in the tidyr package. [Click to see note.]tidyr is a close relation to dplyr, the main data wrangling package used in this book. Both were created by Hadley Wickham and feature prominently among the tidyverse packages. [Click to see note.]A prior iteration of the tidyr package utilized spread() and gather() functions to accomplish the purposes of pivot_wider() and pivot_longer(), respectively.

Pivoting wider

The pivot_wider( ) function converts from a relatively narrow form to wider form. Carrying out this operation involves specifying some information in the arguments to the function. The mechanics of this transformation are dictated by specifying a pair of arguments which define the names to be associated with new variables to be established for the wider form, as well as the corresponding values to be populated for each case.

For instance, in the narrow form BP_narrow (Table 12.2) the values found in the sbp variable will be spread between two variables–before and after–in the corresponding wide form, BP_wide (Table 12.1). Note that the different categorical levels in when (e.g., before, and after) associated with each case in BP_narrow determines the destination for that sbp value in BP_wide.

Notice that each case had been defined by a visit for a subject (e.g., BHO + before) in the BP_narrow form, and now each case is defined only by a subject in the BP_wide form of the data. Each form simply re-organizes the same spb information in a different way with different meanings to a case.

Translating BP_narrow to BP_wide can be done like this:

BP_narrow %>%
  pivot_wider(names_from = when, values_from = sbp)

Pivot longer

Now consider how to transform BP_wide into BP_narrow. The names of the variables to be gathered together, before and after, will become the categorical levels in the narrow form. That is, all systolic blood pressure values will be gathered into a single variable sbp and we will use the accompanying before or after information to specify when the blood pressure reading for that subject was recorded.

The name when is suggested, but in fact the data analyst has to invent a name for this variable. There are all sorts of sensible possibilities, for instance before_or_after would have been just as effective.

Similarly, a name must be invented for the variable that is to hold the values in the variables being gathered. Again, there are many reasonable possibilities. It’s sensible to choose a name that reflects the kind of thing those value are, in this case systolic blood pressure. So, sbp was a good choice.

Finally, the analyst needs to specify which are the variables to be gathered (or not). For instance, it hardly makes sense to gather subject with the other variables; it will remain as a separate variable in the narrow result. Values in subject will be repeated as necessary to give each case in the narrow format its own correct value of subject.

Here’s how to gather variables from BP_wide and convert the result into BP_narrow:

BP_wide %>%  
  pivot_longer(cols = c(before, after), names_to = "when", values_to = "sbp")

Again, notice the distinction in the function arguments between new variable names invented by the data analyst (e.g., "when" and "sbp") to be used in the resulting narrow form in contrast to the original variable names specified from the wide form by the argument cols = c(before, after).

Example: Gender-neutral names

In “A Boy Named Sue,” country singer Johnny Cash famously told the story of a boy toughened in life by a name commonly given to girls. Indeed, “Sue” is given to about 300 times as many girls as boys according to Table 12.4.

BabyNames %>% 
  filter(name == "Sue") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count))

Table 12.4: 99.6% of Sues are female.

name	sex	total
Sue	F	144410
Sue	M	519

On the other hand, Batman’s partner, Robin, had no such problems. Table 12.5 shows that about 15% of Robins are male.

BabyNames %>% 
  filter(name == "Robin") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count))

Table 12.5: 15% of Robins are male.

name	sex	total
Robin	F	288396
Robin	M	43829

This technique works well if you want to look at gender balance in one name at a time, but suppose you want to find the most gender-neutral names from all 92,600 names in BabyNames? For this, it would be useful to have the results in a wide format, like Table 12.6.

BabyWide <- 
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count)) %>%
  pivot_wider(names_from = sex, values_from = total, values_fill = 0)

Table 12.6: A wide format facilitates examining gender balance in BabyNames.

name	M	F
Addison	12179	90462
Adrian	205313	14770
Alex	258131	8055
Alexis	58962	324081
Ali	21526	9071
Alva	11164	8572
… and so on for 92,600 rows altogether.

To start, add up the counts in BabyNames over the years and convert to a wide format as in Table 12.6, using sex. [Click to see note.]Note the argument values_fill = 0 passed to the spread() function that produced Table 12.6. For a name like Ab or Abe, where there are no females, the entry for F would be populated with zero rather than treated as missing.

One way to define “approximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.

Here’s a statement to find the most balanced gender-neutral names out of the names with more than 50,000 babies of each sex. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one. (The pmin() transformation function returns the smaller of the two arguments for each individual case.)

BabyWide %>%
  filter(M > 50000, F > 50000) %>%
  mutate(ratio = pmin(M / F, F / M) ) %>%
  arrange(desc(ratio))

Table 12.7: The most gender-balanced common names.

name	M	F	ratio
Riley	85039	76811	0.90
Jackie	78061	90217	0.87
Casey	108072	74699	0.69
Jessie	109031	165106	0.66
Angel	203376	90531	0.45
Leslie	112463	263049	0.43
… and so on for 19 rows altogether.

12.2 Exercises

Problem 12.1: The BabyNames data table (in the dcData package) looks like this:

name	sex	count	year
Juana	F	121	1956
Stamatia	F	5	1981
Kenesha	F	16	2004
Aubreigh	F	63	2010
Yadriel	M	17	2012
Mariam	F	52	1922

The same data can be presented as a wide-format table:

name	year	F	M
Aarish	2012	NA	13
Adeeb	2008	NA	12
Adelaide	1920	471	NA
Adom	2002	NA	8
Adon	1975	NA	6
Adrianah	1999	7	NA

An R statement to convert from narrow to wide format can have the following form:

BothSexes <-
  BabyNames %>%
  pivot_wider(names_from = ??? , values_from = count) %>%
  filter(M > 5 | F > 5)

Which variable should replace ??? in the R statement?

Now that you have BothSexes …

Find the 10 names with the closest balance between females and males. You can define gender-balance quantitatively as abs(log(F / M)). The smaller this number, the more balanced the name count.
Find the 10 names with the closest balance that have more than 100 babies of either sex.
Challenge. Find the 10 names with the closest balance that have lasting popularity. Define lasting popularity as having more than 100 babies in at least 20 of the years.
Challenge. Are there names that have switched gender balance over the years. Find, for each year, the gender ratio. Pull out ones where the maximum and minumum over the years differ by a large amount. Then plot out over time.

Problem 12.2: Consider the following forms for organizing a subset of the BabyNames data as we examine the most popular name of either sex each year.

Table 12.8: Version One

name	sex	year	nbabies
Ida	F	1912	2740
Ida	M	1912	14
Robert	F	1912	64
Robert	M	1912	12837
Theresa	F	1912	1268
Ida	F	2012	99
… and so on for 9 rows altogether.

Table 12.9: Version Two

name	year	F	M
Ida	1912	2740	14
Ida	2012	99	NA
Robert	1912	64	12837
Robert	2012	7	6910
Theresa	1912	1268	NA
Theresa	2012	232	NA

Table 12.10: Version Three

name	sex	1912	2012
Ida	F	2740	99
Ida	M	14	NA
Robert	F	64	7
Robert	M	12837	6910
Theresa	F	1268	232

What is the meaning of a case in each of the tables?
- Version One
- Version Two
- Version Three
What should the chain of R commands look like to make “Version Two” from the data table “Version One” shown above? Also, explain which of the two versions is “wider.”
What should the chain of R commands look like to make “Version One” from the data table “Version Three” shown above? Also, explain which of the two versions is “wider.”
Suppose you want to calculate the ratio of male to female in each name in each year. Is it better to start from “Version Two” or “Version Three?” Why?
If you were given “Version One,” should you immediately begin using data verbs to calculate the ratio of male to female, or should you first translate to “Version Two” or “Version Three?” Explain.

Problem 12.3: Here are three data frames, A, B, and C, with different organizations of the same data:

Data Frame A

Year	Algeria	Brazil	Columbia
2000	7	12	16
2001	9	14	18

Data Frame B

Country	Y2000	Y2001
Algeria	7	9
Brazil	12	14
Columbia	16	18

Data Frame C

Country	Year	Value
Algeria	2000	7
Algeria	2001	9
Brazil	2000	12
Columbia	2001	18
Columbia	2000	16
Brazil	2001	14

Comparing each of the following pairs of tables, is one considered “wider” as described in the chapter? If so, say which is wider; if not, explain.
- A versus C
- B versus C
- A versus B
Which table format do you think would make it easiest to find the change from 2000 to 2001 for each country. How would you do it?
Suppose you have another table, ContinentData, which gives the continent that each country is in. Which table format do you think would make it easiest to find the sum of the values for each continent for each of the years? How would you do it?

Problem 12.4: Consider the data table BP_wide shown here:

Table 12.11: BP_wide

subject	before	after
BHO	120	160
GWB	115	135
WJC	135	145

Think about the result of the following conversion from wide to narrow.

BP_wide %>%
  pivot_longer(cols = c(subject, before, after), 
               names_to = "when", values_to = "sbp")

Explain why the resulting data frame might be considered “too narrow” meaning that we have converted it to a narrow form that is now less useful for most practical purposes.
Show how the code should be modified in order to convert BP_wide to a more useful narrow form. Recall that “narrow” as defined in the chapter may not necessarily result in fewer columns.