Each year, the US Social Security Administration publishes a list of the most popular names given to babies. In 2014, the list shows Emma and Olivia leading for girls, Noah and Liam for boys.
The BabyNames
data table in the DCF
package comes from the Social Security Administration’s listing of the names givens to babies in each year, and the number of babies of each sex given that name. (Only names with 5 or more babies are published by the SSA.)
A few simple questions about the data.
When starting, it can be helpful to work with a small subset of the data. When you have your data wrangling statements in working order, shift to the entire data table.
SmallSubset <-
BabyNames %>%
filter(year > 2000) %>%
sample_n(size = 200)
Note: Chunks in this template are headed with {r eval=FALSE}
. Change this to {r}
when you are ready to compile
SmallSubset %>%
summarise(total = ????) # a reduction verb
SmallSubset %>%
group_by(????) %>%
summarise(total = ????)
SmallSubset %>%
group_by(????) %>%
summarise(name_count = n_distinct(????))
SmallSubset %>%
group_by(????, ????) %>%
summarise(????)
Result <-
BabyNames %>%
????(name %in% c("Jane", "Mary")) %>% # just the Janes and Marys
group_by(????, ????) %>% # for each year for each name
summarise(count = ????)
Put year
on the x-axis and the count of each name on the y-axis. Note that ggplot()
commands use +
rather than %>%
.
ggplot(data=Result, aes(x = year, y = count)) +
geom_point()
aes()
function.geom_line()
.+ ylab("Yearly Births")
size=2
. Remember that “setting” refers to adjusting the value of an aesthetic to a constant. Thus, it’s outside the aes()
function.Result2 <-
BabyNames %>%
group_by(year) %>%
mutate(total = ????(count)) %>%
filter(????) %>%
mutate(proportion = ???? / ????)
sex
a variable in Result2
? Eliminate it, keeping just the girls. Note: It would likely be better to add up the boys and girls, but this is surprisingly hard. It becomes much easier once you have another data verb to work with: inner_join()
.filter()
step is put before the mutate()
step?Just as you did with count vs year, graph proportion vs year.
Result2 %>%
Your ggplot statements go here!
geom_vline()
.Plot out their popularity over time.