Data Verbs

In English grammar, a verb is a word that expresses an action.1 In working with data, a verb is a function that transfigures a data table in some specified way.

The word “transfigure” is used intentionally. A data verb turns a data table into a new data table. The new data table may have the same kinds of cases as the original or it may have different kinds of cases. For instance, data verbs can be used to transfigure a data table with cases about individual people (each living in a country) into a data table where the cases are countries.

The basic data verbs you will study in these notes2 are:

Keep these things in mind about data-verb functions:

Summarise

The summarise() data verb3, calculates values that combine the cases in a specified way. Common calculations include:

  • n() — counts the cases.
  • sum(), mean(), median(), sd() — summarizes of a numerical variable.
  • max(), min() — finds maxima and minima of numerical variable.

To illustrate the uses of summarise(), consider the WorldCities data table (in the DCF package).

data( WorldCities, package="DCF" )
names( WorldCities )
 [1] "code"          "name"          "latitude"     
 [4] "longitude"     "country"       "countryRegion"
 [7] "population"    "regionCode"    "region"       
[10] "date"         

This consists of 23,018 of the most populous cities in the world.

To see this, we can summarize the cases with a simple count. That is:

WorldCities %>%
  summarise( count=n() )
  count
1 23018

Perhaps you want to know the total population in these cities, or the average population per city or the smallest city in the table.

WorldCities %>%
  summarise( averPop=mean(population),
             totalPop=sum(population),
             smallest=min(population) )
  averPop  totalPop smallest
1  112625 2.592e+09        0

The average city on the list has about 110,000 people. The total population of people living in these cities is about 2.6 billion — a bit more than one-third of the world population. The smallest city has — this seems strange! — zero people; not really a city at all.

Some things to notice about the use of summarise():

  • The chaining syntax, %>% has been used to pass the first argument to summarize(). This will be useful later on, when there is more than one step in a transfiguration. It’s OK to end a line with %>%, but never start a line with it.
  • The output of summarize() is a data table. True, there is only one case in the output of the example: “all cities.” The summarise() function aggregates over the cases in the input data — the output therefore has a different meaning for case.
  • Named assignment has been used to give names to the variables in the output data.frame.

Group_by

The group_by() data verb sets things up so that other data verbs will perform their action on a group-by-group basis. For instance:

WorldCities %>% 
  group_by( country ) %>%
  summarise( count=n() ) %>% head
Source: local data frame [6 x 2]

  country count
1      AD     2
2      AE    12
3      AF    50
4      AG     1
5      AI     1
6      AL    20

The transfiguration produced by group_by() is very subtle. You won’t notice it unless you perform some other action on the output of group_by(); it really just marks the data table as being grouped and arranges the its structure to be as efficient as possible for those other operations.

Using group_by() along with summarize() and n(), provides a basis for counting up the number of cases in groups and subgroups, or for calculating group-wise statistics.

Note that the functions used within summarise()n(), max(), sum(), etc. — are not data verbs. Remember that a data verb takes a data table as input and returns a transfigured data table as output. In contrast, the summary functions, n(), sum(), and so on, take a set of numbers as input and return a single number as output. The summarise() function will take insert these numbers from the summary function into the output dataframe

Is That All?

Summarise()4 is a team player; it works best in combination with other data verbs. Once you learn those data verbs, particularly filter() and mutate(), you’ll be able to carry out many more operations. The richness of the data-verb system comes from the ways the different verbs can be combined together. With just group_by() and summarize() you will be very limited.

An example

To illustrate, consider the NHANES data containing body shape, health, and mortality information. The case is an individual person.

data(NHANESDCF)

One of the variables is hdlHDL cholesterol. (HDL is the “good” kind of cholesterol, as opposed to LDL cholesterol, which is reported in the LDL variable).

Suppose you want to know a typical HDL level. “Typical” is with respect to the values for all the cases, not just the value for a single case. Since all the cases are being combined, summarize() is appropriate.

NHANES %>% 
  group_by( sex ) %>%
  summarise( typical=mean( hdl, na.rm=TRUE ), 
             shortest=min( height, na.rm=TRUE ),
             n=n() )
NHANES %>% summarise( typical=mean( hdl, na.rm=TRUE ) )
  typical
1   52.37

Or, perhaps you want the typical HDL, the shortest height, and the number of cases.

NHANES %>% 
  summarise( typical=mean( hdl, na.rm=TRUE ), 
                      shortest=min( height, na.rm=TRUE ),
                      count=n() )
  typical shortest count
1   52.37     0.79 31126
  • Each of the numerical calculations has been instructed to ignore missing data. (That’s what the na.rm=TRUE argument is doing.) Here’s what happens without na.rm=TRUE — the missing data shapes the result.

    NHANES %>% summarise( typical=mean( hdl ), 
                      shortest=min( height ))
      typical shortest
    1      NA       NA
  • The n() function doesn’t need to be passed a variable. All of the variables in a data table have the same number of cases.

Group_by

The group_by() data verb sets things up so that other data verbs will perform their action on a group-by-group basis. For instance:

NHANES %>% 
  group_by( sex ) %>%
  summarise( typical=mean( hdl, na.rm=TRUE ), 
             shortest=min( height, na.rm=TRUE ),
             n=n() )
Source: local data frame [2 x 4]

     sex typical shortest     n
1   male   48.91    0.790 15184
2 female   55.70    0.797 15942

Using group_by() along with summarize() and n(), provides a basis for counting up the number of cases in groups and subgroups:

NHANES %>% 
  group_by( sex, smoker ) %>%
  summarise( n=n() )
Source: local data frame [6 x 3]
Groups: sex

     sex smoker     n
1   male    yes  2330
2   male     no 11965
3   male     NA   889
4 female    yes  1796
5 female     no 13284
6 female     NA   862

Please use the comment system to make suggestions, point out errors, or to discuss the topic.

comments powered by Disqus


  1. And, to be more complete, a state of being. But that isn’t relevant here.

  2. Other data verbs are filter(), mutate(), arrange(), join(), gather(), and spread()

  3. Also known by the American spelling, summarize()

  4. The function is summarise() with a lower-case s at the start. But English grammar calls for starting a sentence with a capital letter. You should use summarise() in your calculations.