---
title: "What Happened to Jane?"
author: "Data Computing"
date: "CVC 2015"
output: html_document
---

```{r include=FALSE}
library(babynames)
library(ggplot2)
library(mosaic)
your_turn <- function(...) {
  "Your statement should replace 'your_turn()' in this chunk."
}
```

Each year, the US Social Security Administration publishes a list of the most popular names given to babies.  In 2014, [the list](http://www.ssa.gov/oact/babynames/#ht=2) shows Emma and Olivia leading for girls, Noah and Liam for boys.

The `babynames` data table in the `babynames` package comes from the Social Security Administration's listing of the names givens to babies in each year, and the number of babies of each sex given that name. (Only names with 5 or more babies are published by the SSA.)


## Warm-ups

A few simple questions about the data.

When starting, it can be helpful to work with a small subset of the data.  When you have your data wrangling statements in working order, shift to the entire data table.

```{r}
SmallSubset <-
  babynames %>%
  filter(year > 2000) %>%
  sample_n(size = 200)
```


Note: Chunks in this template are headed with `{r eval=FALSE}`.  Change this to `{r}` when you are ready to compile

### 1. How many babies are represented?

```{r eval=FALSE}
SmallSubset %>%
  summarise(total = ????(n)) # a reduction verb
```

### 2. How many babies are there in each year?

```{r eval=FALSE}
SmallSubset %>% 
  group_by(????) %>% 
  summarise(total = ????(n))
```

### 3. How many distinct names in each year?

```{r eval=FALSE}
SmallSubset %>%
  group_by(????) %>%
  summarise(name_count = n_distinct(????))
```

### 4. How many distinct names of each sex in each year?

```{r eval=FALSE}
SmallSubset %>%
  group_by(????, ????) %>%
  summarise(????)
```

## Popularity of Jane and Mary

### 1. Track the yearly number of Janes and Marys over the years.

```{r eval=FALSE}
Result <-
  babynames %>%
  ????(name %in% c("Jane", "Mary")) %>% # just the Janes and Marys
  group_by(????, ????) %>% # for each year for each name
  summarise(count = ????)
```

### 2. Plot out the result of (1)

Put `year` on the x-axis and the count of each name on the y-axis.  Note that `ggplot()` commands use `+` rather than `%>%`.

```{r eval=FALSE}
ggplot(data=Result, aes(x = year, y = count)) +
  geom_point()
```

* *Map* the name (Mary or Jane) to the aesthetic of color.  Remember that mapping to aesthetics is always done inside the `aes()` function.
* Instead of using dots as the glyph, use a line that connects consecutive values: `geom_line()`.
* Change the y-axis label to "Yearly Births": `+ ylab("Yearly Births")`
* *Set* the line thickness to `size=2`. Remember that "setting" refers to adjusting the value of an aesthetic to a constant.  Thus, it's *outside* the `aes()` function.

### 3. Look at the *proportion* of births rather than the count

```{r eval=FALSE}
Result2 <-
  babynames %>%
  group_by(year) %>%
  mutate(total = ????(n)) %>%
  filter(????) %>%
  mutate(proportion = n / total)
```

* Why is `sex` a variable in `Result2`? Eliminate it, keeping just the girls.
    Note: It would likely be better to add up the boys and girls, but this is surprisingly hard.  It becomes much easier once you have another data verb to work with: `inner_join()`.  
* What happens if the `filter()` step is put *before* the `mutate()` step?

Just as you did with count vs year, graph proportion vs year.
```{r eval=FALSE}
Result2 %>%
  Your ggplot statements go here!
```
* Add a vertical line to mark a year in which something happened that might relate to the increase or decrease the popularity of the name.  Example: The movie [*Whatever Happened to Baby Jane*](http://en.wikipedia.org/wiki/What_Ever_Happened_to_Baby_Jane%3F_%281962_film%29) came out in 1962.  The glyph is a vertical line: `geom_vline()`.
  
## 4. Pick out names of interest to you 

Plot out their popularity over time.