Visualizing Movie Ratings

A set of 100,000 ratings of movies by individuals was collected in the late 1990s by the grouplens research team at the University of Minnesota. The grouplens team provides the data directly at http://grouplens.org/datasets/movielens/100k/. These data were reformatted by for the Data Computing book. Downloaded them to your own computer with this statement:

You only need to download the data once. But each time you start a new R session11 Every time you knit a document, you are starting a new session just for the purpose of compiling that document. you will need to load() the data to your R session.

Show the appeal of different genres to the different sexes

Which genres are related?

Look at correlation between genres cor().

Another possibility, look at the co-occurence, fraction of movie genre A that is also movie genre B.

co_occurance <- function(genres){
  f <- function(x,y) {sum(x * y) / sum(x)}
  M <- matrix(0, nrow = ncol(genres), ncol = ncol(genres))
  for (first in 1:ncol(genres)) {
    for (second in 1:ncol(genres)) {
      M[first, second] <- f(genres[[first]], genres[[second]])
    }
  }
  M <- as.data.frame(M)
  names(M) <- names(genres)
  M$genre <- names(genres)
  M %>%
    tidyr::gather(key = genre2, value = co_occur, -genre)
}

Genres <- Movies[,6:23]
tmp <- cor(Genres) %>% as.data.frame(stringsAsFactors = FALSE)
tmp$genre <- row.names(tmp)
Genre_pairs <-
  tmp %>% 
  gather(key = genre2, value = correlation, -genre) %>%
  filter(genre != genre2) %>%
  group_by(genre) %>%
  mutate(cor_sign = as.character(sign(correlation))) %>%
  mutate(cor_size = abs(correlation))
Genre_co_occur <- co_occurance(Genres) %>%
  filter(genre != genre2)
Genre_pairs %>%
  ggplot(aes(x = genre2, y = genre)) + 
  geom_point(aes(size = cor_size, color = cor_sign)) +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))

Genre_co_occur %>%
  ggplot(aes(x = genre2, y = genre)) +
  geom_point(aes(size = co_occur)) +
  theme(axis.text.x  = element_text(angle=90, vjust=0.5))

As a network

library(igraph)
Keep_pairs <- 
  Genre_pairs %>%
  filter(cor_size > 0.05, cor_sign == "1") %>%
  filter(genre > genre2) 
Vertices <- Keep_pairs %>% 
  edgesToVertices(from = genre, to = genre2) 
 
Edges <- 
  Vertices %>%
  edgesForPlotting(ID = ID, x, y, Edges = Keep_pairs, from = genre, to = genre2)
Vertices %>%
  ggplot(aes(x = x, y = y)) + geom_point()+
  geom_segment(data = Edges, 
               aes(x = x, y = y, xend = xend, yend = yend, 
                   color = correlation, size = correlation)) + 
  theme_map() + 
    geom_label(aes(label = ID), fill = "white")

And for co-occurances …

Who are the reviewers?

Users %>%
  ggplot(aes(x = age)) + 
  geom_density(aes(fill = occupation), 
               color = NA, alpha = .7, position = "fill") + 
  facet_wrap( ~ sex)

Users %>%
  ggplot(aes(x = age)) + 
  geom_density(aes(fill = sex), 
               color = NA, alpha = .4, position = "fill")

Users %>%
  group_by(occupation) %>%
  tally() %>%
  arrange(desc(n))

## # A tibble: 21 × 2
##       occupation     n
##            <chr> <int>
## 1        student   196
## 2          other   105
## 3       educator    95
## 4  administrator    79
## 5       engineer    67
## 6     programmer    66
## 7      librarian    51
## 8         writer    45
## 9      executive    32
## 10     scientist    31
## # ... with 11 more rows

Ratings as people age

All %>%
  filter( genre != "unknown") %>%
  ggplot(aes(x = age, color = sex, y = rating)) + 
  geom_smooth() + 
  facet_wrap( ~ genre, scales = "free")

## `geom_smooth()` using method = 'gam'

All %>% 
  ggplot(aes(x = age, color = sex, y = rating)) +
  geom_smooth()

## `geom_smooth()` using method = 'gam'