Groups with data verbs

The group_by() and summarise() data verbs go hand in hand. As you know, group_by() lets you establish which variables will be used to define the groups in a data table; summarise() lets you do calculations within each of the groups, returning each group as one case. For instance,

StatePopulation <- 
  ZipGeography %>%
  group_by( State ) %>% 
  summarise( totalPop=sum( Population, na.rm=TRUE ))
Source: local data frame [6 x 2]

          State totalPop
1                 741752
2 Massachusetts  6349048
3 New Hampshire  1235735
4      New York 18974619
5  Rhode Island  1048319
6         Maine  1273094

You can also do grouping operations with filter() and mutate(). Whenever you use a summary function within filter() or mutate(), a groupwise calculation will be performed. When you use a variable or transformation, it refers to the individual case, but the output of the summary function refers to the entire group.

Find the ZIP code in each state with the largest Population as a fraction of the state’s population:

PopFrac <-
  ZipGeography %>%
  group_by( State ) %>% 
  mutate( frac=Population / sum(Population, na.rm=TRUE )) %>%
  filter( rank(desc(frac))==1 )
PopFrac
Source: local data frame [6 x 4]
Groups: State

          State Population   CityName   ZIP
1                    64300  Vega Baja 00693
3 Massachusetts      61737   Brockton 02301
4  Rhode Island      46381  Pawtucket 02860
5 New Hampshire      36732 Manchester 03103
6         Maine      40358     Bangor 04401
7       Vermont      38940 Burlington 05401

Leading the list is Vega Baja, a city in Puerto Rico, which is part of the US, but not a state.


Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus

Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.