Each year, the US Social Security Administration publishes a list of the most popular names given to babies. In 2014, the list shows Emma and Olivia leading for girls, Noah and Liam for boys.

The BabyNames data table in the DCF package comes from the Social Security Administration’s listing of the names givens to babies in each year, and the number of babies of each sex given that name. (Only names with 5 or more babies are published by the SSA.)

Warm-ups

A few simple questions about the data.

When starting, it can be helpful to work with a small subset of the data. When you have your data wrangling statements in working order, shift to the entire data table.

SmallSubset <-
  BabyNames %>%
  filter(year > 2000) %>%
  sample_n(size = 200)

Note: Chunks in this template are headed with {r eval=FALSE}. Change this to {r} when you are ready to compile

1. How many babies are represented?

SmallSubset %>%
  summarise(total = ????) # a reduction verb

2. How many babies are there in each year?

SmallSubset %>% 
  group_by(????) %>% 
  summarise(total = ????)

3. How many distinct names in each year?

SmallSubset %>%
  group_by(????) %>%
  summarise(name_count = n_distinct(????))

4. How many distinct names of each sex in each year?

SmallSubset %>%
  group_by(????, ????) %>%
  summarise(????)

Popularity of Jane and Mary

1. Track the yearly number of Janes and Marys over the years.

Result <-
  BabyNames %>%
  ????(name %in% c("Jane", "Mary")) %>% # just the Janes and Marys
  group_by(????, ????) %>% # for each year for each name
  summarise(count = ????)

2. Plot out the result of (1)

Put year on the x-axis and the count of each name on the y-axis. Note that ggplot() commands use + rather than %>%.

ggplot(data=Result, aes(x = year, y = count)) +
  geom_point()
  • Map the name (Mary or Jane) to the aesthetic of color. Remember that mapping to aesthetics is always done inside the aes() function.
  • Instead of using dots as the glyph, use a line that connects consecutive values: geom_line().
  • Change the y-axis label to “Yearly Births”: + ylab("Yearly Births")
  • Set the line thickness to size=2. Remember that “setting” refers to adjusting the value of an aesthetic to a constant. Thus, it’s outside the aes() function.

3. Look at the proportion of births rather than the count

Result2 <-
  BabyNames %>%
  group_by(year) %>%
  mutate(total = ????(count)) %>%
  filter(????) %>%
  mutate(proportion = ???? / ????)
  • Why is sex a variable in Result2? Eliminate it, keeping just the girls. Note: It would likely be better to add up the boys and girls, but this is surprisingly hard. It becomes much easier once you have another data verb to work with: inner_join().
  • What happens if the filter() step is put before the mutate() step?

Just as you did with count vs year, graph proportion vs year.

Result2 %>%
  Your ggplot statements go here!
  • Add a vertical line to mark a year in which something happened that might relate to the increase or decrease the popularity of the name. Example: The movie Whatever Happened to Baby Jane came out in 1962. The glyph is a vertical line: geom_vline().

4. Pick out names of interest to you

Plot out their popularity over time.