The group_by()
and summarise()
data verbs go hand in hand. As you know, group_by()
lets you establish which variables will be used to define the groups in a data table; summarise()
lets you do calculations within each of the groups, returning each group as one case. For instance,
StatePopulation <-
ZipGeography %>%
group_by( State ) %>%
summarise( totalPop=sum( Population, na.rm=TRUE ))
Source: local data frame [6 x 2]
State totalPop
1 741752
2 Massachusetts 6349048
3 New Hampshire 1235735
4 New York 18974619
5 Rhode Island 1048319
6 Maine 1273094
You can also do grouping operations with filter()
and mutate()
. Whenever you use a summary function within filter()
or mutate()
, a groupwise calculation will be performed. When you use a variable or transformation, it refers to the individual case, but the output of the summary function refers to the entire group.
Find the ZIP code in each state with the largest Population
as a fraction of the state’s population:
PopFrac <-
ZipGeography %>%
group_by( State ) %>%
mutate( frac=Population / sum(Population, na.rm=TRUE )) %>%
filter( rank(desc(frac))==1 )
PopFrac
Source: local data frame [6 x 4]
Groups: State
State Population CityName ZIP
1 64300 Vega Baja 00693
3 Massachusetts 61737 Brockton 02301
4 Rhode Island 46381 Pawtucket 02860
5 New Hampshire 36732 Manchester 03103
6 Maine 40358 Bangor 04401
7 Vermont 38940 Burlington 05401
Leading the list is Vega Baja, a city in Puerto Rico, which is part of the US, but not a state.
Please use the comment system to make suggestions, point out errors, or to discuss the topic.
Written by Daniel Kaplan for the Data & Computing Fundamentals Course. Development was supported by grants from the National Science Foundation for Project Mosaic (NSF DUE-0920350) and from the Howard Hughes Medical Institute.