Here’s a small subset of the data in BabyNames
:
name sex count year
1 Rotha F 7 1907
2 Julian M 535 1948
3 Julian F 7 1948
4 Christina F 6600 1967
5 Christina M 22 1967
6 Song F 11 1994
7 Wayman M 9 1997
The above is not the only way that the data table can be arranged. For instance, the following contains the same information:
name year F M
1 Christina 1967 6600 22
2 Julian 1948 7 535
3 Wayman 1997 0 9
4 Rotha 1907 7 0
5 Song 1994 11 0
The first table is a narrow format, the second is a wide format. In the narrow format, there is a variable that describes the meaning of another variable. For instance, the sex
variable tells whether count
refers to a male or a female. A variable like sex
is called a key variable because its value serves as the key or guide to another variable.
In the wide format, there is no sex
variable. The distinction between males and females is made by putting the count information into two variables: F
and M
.
There are reasons to prefer each of the forms. The best choice is a matter of judgement and depends on the task to be performed.
The wide format is useful if you want to compare variables. Suppose, for example, that you want to identify the names that are given to both boys and girls. In the wide format, you could simply compare the M
and F
variables, looking for cases where the numbers are approximately the same.1
Wide %>%
mutate( ratio=pmin(M/F, F/M) ) %>%
arrange( desc(ratio) )
name year F M ratio
1 Julian 1948 7 535 0.013084
2 Christina 1967 6600 22 0.003333
3 Wayman 1997 0 9 0.000000
4 Rotha 1907 7 0 0.000000
5 Song 1994 11 0 0.000000
You can see that Christina and Julian are — out of this small set — the names that are most gender-neutral.
It’s easy for wide formats to get out of hand. For instance, this wide table gives each name over time:
name | sex | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 |
---|---|---|---|---|---|---|---|---|---|---|
Aaron | F | 25.00 | 24.00 | 23.00 | 29.00 | 32.00 | 23.00 | 21.00 | 21.00 | 10.00 |
Aaron | M | 7799.00 | 8289.00 | 8932.00 | 8526.00 | 7961.00 | 7446.00 | 7603.00 | 7502.00 | 7246.00 |
Ab | M | 30.00 | 32.00 | 41.00 | 24.00 | 10.00 | 6.00 | 5.00 | 0.00 | 7.00 |
Abbie | F | 445.00 | 440.00 | 468.00 | 400.00 | 370.00 | 326.00 | 339.00 | 341.00 | 242.00 |
Abbie | M | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Abbott | F | 5.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 5.00 | 0.00 |
This format is useful for some purposes, for instance calculating whether the use of a name went up or down between two years. (Example: Aaron for males increased between 2005 and 2008.) But if you want to plot out the usage over time, the narrow format is easier to handle.
What’s important is that you realize that there is a choice between wide and narrow and:
The data verb for turning a narrow data table into a wide one is “spread.” Going the other way, from wide to narrow, is to “gather.” You won’t use these verbs nearly so much as filter, mutate, summarise, group, and join.
The spread()
function performs this. You need to specify three things:
In the example above the statement was:
Wide <- BabyNames %>%
spread( key=sex, value=count, fill=0 )
set.seed(101)
inds <- sample( 1:nrow(Wide), size=10 )
Wide[inds,] %>% arrange( year )
name year F M
1 Demaris 1914 11 0
2 Erastus 1939 0 7
3 Alesia 1964 419 0
4 Coleen 1972 184 0
5 Devlin 1977 5 21
6 Guillaume 1984 0 5
7 Mea 1996 12 0
8 Jameka 2004 8 0
9 Hazen 2006 0 33
10 Masao 2011 0 8
A fill of zero is appropriate here; when a name is missing in a year, there were no children given that name. Often, NA
is the appropriate fill; that’s the default.
Gathering is also straightforward. You need to specify
Narrow <-
Wide %>%
gather( key=gender, value=howMany, M, F )
name year gender howMany
1 Demaris 1914 M 0
2 Demaris 1914 F 11
3 Erastus 1939 M 7
4 Erastus 1939 F 0
5 Alesia 1964 M 0
6 Alesia 1964 F 419
7 Coleen 1972 M 0
8 Coleen 1972 F 184
9 Devlin 1977 M 21
10 Devlin 1977 F 5
11 Guillaume 1984 M 5
12 Guillaume 1984 F 0
13 Mea 1996 M 0
14 Mea 1996 F 12
15 Jameka 2004 M 0
16 Jameka 2004 F 8
17 Hazen 2006 M 33
18 Hazen 2006 F 0
19 Masao 2011 M 8
20 Masao 2011 F 0
You could, of course, filter out the cases where no children were given a name.
Out of the very small dataset used in the example of gender-neutral names, Julian and Christina had entries for both sexes. If Julian and Christina don’t strike you as particularly gender neutral, take note that they are nearby misspellings of Christian and Julianne: much more common names with the opposite gender orientation to Christina and Julian.
BabyNames %>%
filter( name==c("Christian", "Julianne") ) %>%
group_by(name, sex) %>%
summarise( n=sum(count) )
Source: local data frame [3 x 3]
Groups: name
name sex n
1 Christian F 6491
2 Christian M 237753
3 Julianne F 14154
Rather than using that tiny dataset, here’s a statement to find the 10 most balanced gender-neutral names out of the names with more than 50,000 babies. Remember, a ratio of 1 means exactly balanced; a ratio of 0.5 means two to one in favor of one sex; 0.33 means three to one.
BabyNames %>%
spread( key=sex, value=count, fill=0 ) %>%
group_by( name ) %>%
summarise( M=sum(M), F=sum(F) ) %>%
filter( M>50000, F>50000) %>%
mutate( ratio=pmin(M/F, F/M) ) %>%
filter( row_number(desc(ratio)) <= 10) %>%
arrange( desc(ratio) )
Source: local data frame [10 x 4]
name M F ratio
1 Riley 85039 76811 0.9032
2 Jackie 78061 90217 0.8653
3 Casey 108072 74699 0.6912
4 Jessie 109031 165106 0.6604
5 Angel 203376 90531 0.4451
6 Leslie 112463 263049 0.4275
7 Marion 71662 187647 0.3819
8 Jordan 345635 125588 0.3634
9 Taylor 107264 304305 0.3525
10 Willie 447403 146124 0.3266
Please use the comment system to make suggestions, point out errors, or to discuss the topic.
Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.
One way to define “appoximately the same” is to take the smaller of the ratios M/F and F/M. If females greatly outnumber males, then F/M will be large, but M/F will be small. If the sexes are about equal, then both ratios will be near 1. The smaller will never be greater than 1, so the most balanced names are those with the smaller of the ratios near 1.↩