Wide and narrow data

Here’s a small subset of the data in BabyNames:

       name sex count year
1     Rotha   F     7 1907
2    Julian   M   535 1948
3    Julian   F     7 1948
4 Christina   F  6600 1967
5 Christina   M    22 1967
6      Song   F    11 1994
7    Wayman   M     9 1997

The above is not the only way that the data table can be arranged. For instance, the following contains the same information:

       name year    F   M
1 Christina 1967 6600  22
2    Julian 1948    7 535
3    Wayman 1997    0   9
4     Rotha 1907    7   0
5      Song 1994   11   0

The first table is a narrow format, the second is a wide format. In the narrow format, there is a variable that describes the meaning of another variable. For instance, the sex variable tells whether count refers to a male or a female. A variable like sex is called a key variable because its value serves as the key or guide to another variable.

In the wide format, there is no sex variable. The distinction between males and females is made by putting the count information into two variables: F and M.

There are reasons to prefer each of the forms. The best choice is a matter of judgement and depends on the task to be performed.

The wide format is useful if you want to compare variables. Suppose, for example, that you want to identify the names that are given to both boys and girls. In the wide format, you could simply compare the M and F variables, looking for cases where the numbers are approximately the same.¹

Wide %>%
  mutate( ratio=pmin(M/F, F/M) ) %>%  
  arrange( desc(ratio) )

       name year    F   M    ratio
1    Julian 1948    7 535 0.013084
2 Christina 1967 6600  22 0.003333
3    Wayman 1997    0   9 0.000000
4     Rotha 1907    7   0 0.000000
5      Song 1994   11   0 0.000000

You can see that Christina and Julian are — out of this small set — the names that are most gender-neutral.
It’s easy for wide formats to get out of hand. For instance, this wide table gives each name over time:

name	sex	2005	2006	2007	2008	2009	2010	2011	2012	2013
Aaron	F	25.00	24.00	23.00	29.00	32.00	23.00	21.00	21.00	10.00
Aaron	M	7799.00	8289.00	8932.00	8526.00	7961.00	7446.00	7603.00	7502.00	7246.00
Ab	M	30.00	32.00	41.00	24.00	10.00	6.00	5.00	0.00	7.00
Abbie	F	445.00	440.00	468.00	400.00	370.00	326.00	339.00	341.00	242.00
Abbie	M	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
Abbott	F	5.00	0.00	0.00	0.00	0.00	0.00	0.00	5.00	0.00

This format is useful for some purposes, for instance calculating whether the use of a name went up or down between two years. (Example: Aaron for males increased between 2005 and 2008.) But if you want to plot out the usage over time, the narrow format is easier to handle.

What’s important is that you realize that there is a choice between wide and narrow and:

be able to identify situations where there is a key variable that describes another variable
know that you can translate between wide and narrow, and that software will do this automatically.

Software

The data verb for turning a narrow data table into a wide one is “spread.” Going the other way, from wide to narrow, is to “gather.” You won’t use these verbs nearly so much as filter, mutate, summarise, group, and join.

Spreading

The spread() function performs this. You need to specify three things:

The variable to use as the key.
The variable that contains the value.
What to fill in when there is no value, or a missing value for a particular key.

In the example above the statement was:

Wide <- BabyNames %>% 
  spread( key=sex, value=count, fill=0 )

set.seed(101)
inds <- sample( 1:nrow(Wide), size=10 )
Wide[inds,] %>% arrange( year )

        name year   F  M
1    Demaris 1914  11  0
2    Erastus 1939   0  7
3     Alesia 1964 419  0
4     Coleen 1972 184  0
5     Devlin 1977   5 21
6  Guillaume 1984   0  5
7        Mea 1996  12  0
8     Jameka 2004   8  0
9      Hazen 2006   0 33
10     Masao 2011   0  8

A fill of zero is appropriate here; when a name is missing in a year, there were no children given that name. Often, NA is the appropriate fill; that’s the default.

Gathering

Gathering is also straightforward. You need to specify

which variables you want to gather
what you want to call the key generated from the names of the gathered variables.
what you want to call the value that the key depends on. As an example, here’s turning the wide baby names format back into narrow form.

Narrow <- 
  Wide %>% 
  gather( key=gender, value=howMany, M, F )

        name year gender howMany
1    Demaris 1914      M       0
2    Demaris 1914      F      11
3    Erastus 1939      M       7
4    Erastus 1939      F       0
5     Alesia 1964      M       0
6     Alesia 1964      F     419
7     Coleen 1972      M       0
8     Coleen 1972      F     184
9     Devlin 1977      M      21
10    Devlin 1977      F       5
11 Guillaume 1984      M       5
12 Guillaume 1984      F       0
13       Mea 1996      M       0
14       Mea 1996      F      12
15    Jameka 2004      M       0
16    Jameka 2004      F       8
17     Hazen 2006      M      33
18     Hazen 2006      F       0
19     Masao 2011      M       8
20     Masao 2011      F       0

You could, of course, filter out the cases where no children were given a name.

Aside: Gender Neutal Names

Out of the very small dataset used in the example of gender-neutral names, Julian and Christina had entries for both sexes. If Julian and Christina don’t strike you as particularly gender neutral, take note that they are nearby misspellings of Christian and Julianne: much more common names with the opposite gender orientation to Christina and Julian.

BabyNames %>% 
  filter( name==c("Christian", "Julianne") ) %>% 
  group_by(name, sex) %>%  
  summarise( n=sum(count) )

Source: local data frame [3 x 3]
Groups: name

       name sex      n
1 Christian   F   6491
2 Christian   M 237753
3  Julianne   F  14154

Rather than using that tiny dataset, here’s a statement to find the 10 most balanced gender-neutral names out of the names with more than 50,000 babies. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one.

BabyNames %>% 
  spread( key=sex, value=count, fill=0 ) %>% 
  group_by( name ) %>%
  summarise( M=sum(M), F=sum(F) ) %>%
  filter( M>50000, F>50000) %>%
  mutate( ratio=pmin(M/F, F/M) ) %>%
  filter( row_number(desc(ratio)) <= 10) %>%
  arrange( desc(ratio) )

Source: local data frame [10 x 4]

     name      M      F  ratio
1   Riley  85039  76811 0.9032
2  Jackie  78061  90217 0.8653
3   Casey 108072  74699 0.6912
4  Jessie 109031 165106 0.6604
5   Angel 203376  90531 0.4451
6  Leslie 112463 263049 0.4275
7  Marion  71662 187647 0.3819
8  Jordan 345635 125588 0.3634
9  Taylor 107264 304305 0.3525
10 Willie 447403 146124 0.3266

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.

One way to define “appoximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.↩