Chapter 11 Wide versus Narrow Data

Each row of a data frame is an individual case. Often it’s useful to re-organize the same data in a different way with different meanings to a case. This can make it easier to perform wrangling tasks such as comparisons, joins, and the inclusion of new data.

Consider the format of BP_wide shown in Table 11.1, where the case is a research study subject and there are separate variables for the measurement of systolic blood pressure (SBP) before and after exposure to a stressful environment.

Table 11.1: BP_wide: A wide format

subject before after
BHO 160 160
GWB 115 135
WJC 105 145

Exactly the same data can be presented in the format of the BP_narrow data frame (Table 11.2, where the case is an individual occasion for blood-pressure measurement.

Table 11.2: BP_narrow: A narrow format

subject when sbp
BHO before 160
GWB before 115
WJC after 145
GWB after 135
WJC before 105
BHO after 160

Each of the formats BP_wide and BP_narrow has its advantages and its deficits. For example, it’s easy to find the before-and-after change in blood pressure using BP_wide:

BP_wide %>%
  mutate(change = after - before)

On the other hand, a narrow format is more flexible for including additional variables, for example the date of the measurement or the diastolic blood pressure as in Table 11.3. Narrow format also makes it feasible to add in additional measurement occasions. For instance, Table 11.3 shows several “after” measurements for subject WJC. (Such “repeated measures” are a common feature of scientific studies.)

Table 11.3: A data frame extending the information in Tables 11.2 and 11.1 to include additional variables and repeated measurements. The narrow format facilitates including new cases or variables.

subject when sbp dbp date
BHO before 160 69 2007-06-19
GWB before 115 54 1998-04-21
BHO before 155 65 2005-11-08
WJC after 145 75 2002-11-15
WJC after NA 65 2010-03-26
WJC after 130 60 2013-09-15
GWB after 135 NA 2009-05-08
WJC before 105 60 1990-08-17
BHO after 160 78 2017-06-04

A simple strategy allows you to get the benefits of either format: convert from wide to narrow or from narrow to wide as suits your purpose.

11.1 Data verbs for converting wide to narrow and vice versa

Transforming from wide to narrow is the action of a data verb: a wide data frame is the input and a narrow data frame is the output. The reverse task, transforming from narrow to wide, involves another data verb. Different authors use different names for these verbs: e.g., melting versus casting, stacking versus unstacking, folding versus unfolding, gathering versus spreading. These notes use spread and gather They are implemented by spread() and gather() in the tidyr package. tidyr is a close relation to dplyr, the main data wrangling package used in this book. Both were created by Hadley Wickham and feature prominently among the tidyverse packages. Note: at the time of this writing, the most recent version of the tidyr package (version 1.0.0) intends to transition from the spread() and gather() functions toward new pivot_longer() and pivot_wider() functions. The released version of the tidyr package build was not yet stable enough for our purposes, but the reader may wish to examine those functions independently.

Spreading

The spread() function converts from narrow to wide. Carrying out this operation involves specifying some information in the arguments to the function. The value is the variable in the narrow format which is to be divided up into multiple variables in the resulting wide format. The key is the name of the variable in the narrow format that identifies for each case individually which column in the wide format will receive the value.

For instance, in the narrow form BP_narrow (Table 11.2) the value variable is sbp. In the corresponding wide form, BP_wide (Table 11.1), the information in sbp will be spread between two variables: before and after. The key variable in BP_narrow is when. Note that the different categorical levels in when set which variable in BP_wide will be the destination for the sbp value of each case.

Only the key and value variables are involved in the transformation from narrow to wide. Other variables in the narrow table, such as subject in BP_narrow, are used to define the cases.

Translating BP_narrow to BP_wide can be done like this:

BP_narrow %>%
  spread(key = when, value = sbp)

Gathering

Now consider how to transform BP_wide into BP_narrow. The names of the variables to be gathered together, before and after, will become the categorical levels in the narrow form. That is, they will make up the key variable in the narrow form. The data analyst has to invent a name for this variable. There are all sorts of sensible possibilities, for instance before_or_after. In gathering BP_wide into BP_narrow, the concise variable name when was chosen.

Similarly, a name must be invented for the variable that is to hold the values in the variables being gathered. Again, there are many reasonable possibilities. It’s sensible to choose a name that reflects the kind of thing those value are, in this case systolic blood pressure. So, sbp is a good choice.

Finally, the analyst needs to specify which are the variables to be gathered. For instance, it hardly makes sense to gather subject with the other variables; it will remain as a separate variable in the narrow result. Values in subject will be repeated as necessary to give each case in the narrow format its own correct value of subject.

Here’s how to gather variables from BP_wide and convert the result into BP_narrow:

BP_wide %>%  
  gather( key=when, value=sbp, before, after )

Again, the names of the key and value arguments are given as arguments. These are the names invented by the data analyst; those names are not part of the wide input to gather(). Arguments after the key and value are the names of the variables to be gathered.

Example: Gender-neutral names

In “A Boy Named Sue”, country singer Johnny Cash famously told the story of a boy toughened in life by being given a girl’s name. Indeed, Sue is given to about 300 times as many girls as boys according to Table 11.4.

BabyNames %>% 
  filter(name == "Sue") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count)) 

Table 11.4: 99.6% of Sues are female.

name sex total
Sue F 144410
Sue M 519

On the other hand, Batman’s partner, Robin, had no such problems. Table 11.5 shows that about 15% of Robins are male.

BabyNames %>% 
  filter(name == "Robin") %>%
  group_by(name, sex) %>% 
  summarise(total = sum(count)) 

Table 11.5: 15% of Robins are male.

name sex total
Robin F 288396
Robin M 43829

This technique works well if you want to look at gender balance in one name at a time, but suppose you want to find the most gender-neutral names from all 92,600 names in BabyNames? For this, it would be useful to have the results in a wide format, like Table 11.6.

BabyWide <- 
  BabyNames %>%
  group_by(name, sex) %>%
  summarise(total = sum(count)) %>%
  spread(key = sex, value = total, fill = 0)

Table 11.6: A wide format facilitates examining gender balance in BabyNames.

name F M
Addison 90462 12179
Adrian 14770 205313
Alex 8055 258131
Alexis 324081 58962
Ali 9071 21526
Alva 8572 11164
… and so on for 92,600 rows altogether.

To start, add up the counts in BabyNames over the years and convert to a wide format as in Table 11.6, using sex as the “key” variable in the conversion. Note the argument fill = 0 passed to the spread() function that produced Table 11.6. For a name like Ab or Abe, where there are no females, the entry for F would be populated with zero rather than treated as missing.

One way to define “approximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.

Here’s a statement to find the most balanced gender-neutral names out of the names with more than 50,000 babies of each sex. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one. (The pmin() transformation function returns the smaller of the two arguments for each individual case.)

BabyWide %>%
  filter(M > 50000, F > 50000) %>%
  mutate(ratio = pmin(M / F, F / M) ) %>%
  arrange(desc(ratio)) 

Table 11.7: The most gender-balanced common names.

name F M ratio
Riley 76811 85039 0.90
Jackie 90217 78061 0.87
Casey 74699 108072 0.69
Jessie 165106 109031 0.66
Angel 90531 203376 0.45
Leslie 263049 112463 0.43
… and so on for 19 rows altogether.