4  Data wrangling

The data that comes to us is often not in the ideal organization for display in a way that suits our purposes. This Lesson (and the next) is about basic computer techniques for re-organizing data frames. There is huge benefit of sticking to the conventions of data frames: the hard work of programming has already been done for us by experts. Instead of loops and arrays—the kinds of things you learn in an introductory computer programming course—there is a small set of basic operations that are easily invoked to re-organize data. Each of these operations (except, perhaps, the one covered in Lesson 6) can be understood intuitively.

We can’t anticipate every way that you might find it useful to re-organize data to suit your purposes, simply because we don’t know what your purposes will be in any given project. Even so, the experience of millions of data professionals around the world gives us confidence that, whatever you seek to accomplish can be done by stringing together “data wrangling operations” in an appropriate order. An analogy that’s relevant is writing and playing the harmony of rock music on a guitar. The guitarist has to learn a few chords. The music is composed or played by organizing a sequence of these chords and short breaks, customizing each one for duration and loudness. A new and unique piece of music can be composed by the careful combination of chords and their sequence.

Three kinds of functions

When data wrangling, you will be working with a small set of functions that always take a data frame as input, and always produce a data frame as output. At the core of that small set are five wrangling functions that you will be using extensively: select(), summarize(), mutate(), filter(), and arrange(). To illustrate:

  1. Output has only the selected variables.
Tiny |> select(mass, species)
 mass  species   
-----  ----------
 3950  Chinstrap 
 4400  Adelie    
 5600  Gentoo    
 4700  Gentoo    
 3500  Adelie    
 5600  Gentoo    
 3950  Adelie    
 3250  Chinstrap 
  ... and so on for 4 rows in total.
  1. Output is a data frame with one row, summarizing in some way the input data frame.
Tiny |> summarize(mf = mean(flipper))
       mf
 --------
  205.125
  1. Output has a new column, calculated from one or more of the columns in the input.
Tiny |> mutate(root = sqrt(mass))
species      mass   flipper  sex       root
----------  -----  --------  -------  -----
Chinstrap    3950       201  male      62.8
Adelie       4400       196  male      66.3
Gentoo       5600       228  male      74.8
Gentoo       4700       219  female    68.6
Adelie       3500       189  female    59.2
Gentoo       5600       228  male      74.8
Adelie       3950       189  male      62.8
Chinstrap    3250       191  female    57.0
  ... and so on for 3 rows in total.
  1. The output has only those rows that meet a specified condition.
Tiny |> filter(species == "Gentoo")
species    mass   flipper  sex    
--------  -----  --------  -------
Gentoo     5600       228  male   
Gentoo     4700       219  female 
Gentoo     5600       228  male   
  1. The output has the rows re-arranged on the basis of a variable.
Tiny |> arrange(mass)
species      mass   flipper  sex    
----------  -----  --------  -------
Chinstrap    3250       191  female 
Adelie       3500       189  female 
Chinstrap    3950       201  male   
Adelie       3950       189  male   
Adelie       4400       196  male   
Gentoo       4700       219  female 
Gentoo       5600       228  male   
Gentoo       5600       228  male   

Like rock-and-roll cords, these wrangling functions can be put into a pipeline: a sequence of wrangling steps connected by the pipe symbol |>. Also like rock-and-role, it’s best to listen to some examples to get an idea of the range of overall effects you can create.

Wrangling “adverbs”

Before proceeding to examples of wrangling tasks and how they can be accomplished, let’s look carefully at the “grammar” involved in wrangling commands. In English, sentences have a subject and a verb, and often other components such as adverbs. We will use that analogy here.

By analogy, consider the short English sentence, “Jane runs.” Jane is the subject and “runs” is the verb. English sentences don’t produce an “output,” but try to imagine the situation as subject Jane being the input, “run” modifying Jane in a specific way, and a running version of Jane is the output. Extending your imagination further, think of “Jane runs” written in R as Jane |> runs().

In data-wrangling sentences, the wrangling verbs (that is, functions) almost always are modified by an adverb. As you know, an adverb is a word that modifies a verb. For example, “Jane runs swiftly,” or “Jane sometimes runs.” Again, return to your imagination where these two sentences, with adverbs, might be written in R as Jane |> runs(swiftly) and Jane |> runs(sometimes). In English you can have multiple adverbs, as in “Jane sometimes runs swiftly.” Imagine this in R as Jane |> runs(sometimes, swiftly).

  1. Every wrangling command starts with a subject, the data frame at the head of the pipeline. The above examples use Tiny as the subject.

  2. The subject is followed by a pipe symbol (|>) and then, always, by a wrangling verb. The chief characteristic of a wrangling verb is that it takes a data frame as input and produces a data frame as output.

  3. Almost always, the wrangling verb is modified by an adverb. The adverb goes in the position of the second argument and there may be a third argument or more.

There is only a small set of possible wrangling verbs: arrange(), group_by(), and so on. But there is an infinite variety of adverbs. Key to successful data wrangling is the choice of a verb and an adverb that accomplishes the task at hand. We turn now to the structure of proper wrangling adverbs.

First, a wrangling adverb must be a proper R expression, that is, obeying the rules of R grammar. Primarily this means that if a function is used in an adverb, it will be followed by parentheses.

Second, a wrangling adverb almost always includes the name of one or more variables that are in the data frame piped as input to the wrangling verb.

An error message is not a criticism of you or your work. Ideally, you should see error messages as hints about what’s not yet correct about your command. Reading the error message is essential to being able to make use of the hint. Admittedly, many R error messages are cryptic, but usually they will tell you where in the command lies the problem.

Third, the R expression forming an adverb will work only inside the parentheses of a wrangling verb. If you try to use an adverbial expression as a full, stand-alone R command you will usually get an error message.

Perhaps the simplest wrangling adverbs consist purely of the name of a variable in the data frame. Examples: select(species) or arrange(mass). The adverbs here are just species or mass while select() and arrange() are the wrangling verbs being modified by the adverbs.

Another very common adverb pattern is a function applied to a variable name. For example, summarize(mean(flipper)) or mutate(max(mass)). If the function is arithmetical, use arithmetic notation, e.g. mutate(flipper / mass).

The wrangling verbs summarize() and mutate() create columns. It’s nice if those columns have a simple name. You can set the name to be used by preceding the adverb by the name would want followed by an equal sign. Examples: summarize(mn = mean(flipper)) or mutate(ratio = flipper / mass).

Functions in adverbs

As mentioned previously, a wrangling adverb may be as simple as the name of a variable, or much more complicated, involving both variable names and functions. The English analogy to a complicated wrangling adverb is an adverbial phrase, as in, “Jane runs swiftly as the wind.”

It’s important to note that the functions used in a wrangling adverbial phrase are not themselves wrangling verbs. Wrangling verbs alway take a data frame as the first input. In contrast, functions in adverbial phrases usually take a variable or R expression (as in flipper / mass) as the first input.

Adverbial-phrase functions like mean() produce as output a single quantity. So mean(flipper) calculates a single number as output. We will call these “shrinkage” verbs, since they condense a set of values into a single value.

Other adverbial-phrase functions take a set of values and provide as output another set of values. Typical of these are the mathematical functions, for instance, sqrt() or division. You can use either type of adverbial-phrase function with most wrangling verbs, depending on what you are trying to accomplish. Within summarize() however, use only shrinkage verbs.

Groupwise calculations

Use .by=species rather THAN GROUP_BY()

Exercises

  1. Compute the mother’s weight gain during pregnancy as indicated by natality2014::Natality_2014_100k and graph it versus baby’s weight dbwt. Is there an obvious relationship between the two variables? CHANGE THIS TO Births2022 where the variables are weight_pre, weight_delivery, weight.

Explain why the values in the count variable of the output is different between these two similar-looking R commands:

Tiny |> summarize(count = n_distinct(species))
  count
 ------
      3
Tiny |> summarize(count = n_distinct(species), .by = species)
species      count
----------  ------
Chinstrap        1
Adelie           1
Gentoo           1

The n() adverbial wrangling shrinkage function counts the number of rows. It is unusual in that it doesn’t need any input; it is counting rows, not values in a variable.

  1. There is a simple relationship between the rows column in these two statements. What is that relationship?
Tiny |> summarize(rows = n())
  rows
 -----
     8
Tiny |> summarize(rows = n(), .by = species)
species      rows
----------  -----
Chinstrap       2
Adelie          3
Gentoo          3
  1. Replace Tiny with Big in your wrangling statements and explain if your answer to (1) still holds up.

Using the .by=variable argument in wrangling verbs causes the wrangling operation to be done separately for each of the groups of rows defined by the variable.

Here are two ways you might re-arrange the rows of Tiny:

  1. Tiny |> arrange(mass, .by=species)
  2. Tiny |> arrange(species, .by=mass)

Run both commands and observe how the outputs differ from one another. Give a brief explanation of what features of the Tiny data cause them to differ.

Suppose you are tasked to create a new data frame out of the dataframe Penguins that has a ratio column given flipper / mass for each penguin.

  1. Should you use mutate() or summarize() to create the new data frame? Answer: Use mutate(). You’re creating a new column with the same rows as in the original: one for each penguin.

  2. Do the wrangling to create ratio, then use tilde_graph() to graph ratio ~ species with a violin annotation.

Answer:
Penguins |>
  mutate(ratio = flipper / mass) |>
  tilde_graph(ratio ~ species, annot = "violin")

  1. Reading the graph you created in (b), say which species tends to have the lowest ratio of flipper length to mass. Answer: Although there is some overlap with the other two species, Gentoo penguins tend to have the lowest ratio.

  2. Form the wrangling command to calculate the mean and variance of ratio for each species of penguin.

Answer:
Penguins |>
  mutate(ratio = flipper / mass) |>
  group_by(species) |>
  summarize(mean(ratio), var(ratio))
species      mean(ratio)   var(ratio)
----------  ------------  -----------
Adelie         0.0519743     3.38e-05
Chinstrap      0.0528887     2.14e-05
Gentoo         0.0429841     1.20e-05

See cheetah-loose-saucer for a data cleaning example.


BEING HARVESTED, not yet part of the chapter.

What do the names n_distinct and species refer to. Just from the open parenthesis directly following n_distinct you can tell that it is a function. That doesn’t tell you what the function does; just that it’s a function. Similarly, you have to refer to context to know what species is about. If you remember from the first few examples in this Lesson—Tiny |> print() and Tiny |> names()—you’ll recognize species as the name of one of the variables in Tiny.

Two function families

To a very great extent, the functions you work with will belong to one of two families, the family being identified by the type of information object expected for the first argument (input). These are:

  • Functions whose first argument must be a data frame. Usually, a pipe |> will be used to insert the data frame into the function. Examples: tilde_graph() which you met in Lesson 2; summarize() and several other “data wrangling” functions you will meet in the next Lessons such as filter(), select(), and mutate().
  • Functions where the first argument is a variable from a data frame, or some arithmetic or similar combination of variables. Examples: n_distinct(), and mean(), as well as median(), var(), sd(), and a handful of other statistics-related functions.
Tip

As you learn the 20 or so functions we will need for these Lessons, a good habit is to note whether the function takes a data frame or a variable as the first argument (first input). There are a few functions that take other kinds of first arguments, but mostly you will work with functions whose first argument (first input) must be either a data frame or a variable.



Second arguments

WHY n_distinct() must be used within summarize(): you CAN’T USE an expresson like n_distinct(species), which takes a variable as input, on its own, outside of an appropriate function like summarize() that takes a data frame as its first argument. The data frame in the position of the first argument sets the context for the name of the variable.

ARITHMETIC ON VARIABLES, e.g. mean(cos(flipper)) or mean(flipper^2).

Functions whose first argument is a variable

START BY PUTTING “variable” in quotation marks to set the stage for later on, when you do arithmetic on variables.

Syntax with parens,

Example: head(), nrow(), names(). Find the mean flipper length with summarize(). Note that the output is a data frame.


Just a handful of wrangling operations are needed to accomplish a wide range of re-organizations of data frames. Here’s the list:

  • mutate(): create one or more new variables by combining already existing variables;
  • filter(): extract a subset of rows from a data frame that meet one or more criteria that you specify;
  • arrange(): sort the rows of a data frame according to a criterion you specify;
  • summarize(): create a new data frame with a single row summarizing, in whatever manner you wish, the possibly many rows in the original data frame.
  • group by(): instruct the computer to perform any of the previous operations on a group-by-group basis, using groups defined by a variable you create or choose;
  • select(): pull out a subset of columns (that is, variables) from a data frame.

Every one of these wrangling functions takes a data frame as the first argument and produces a (new) data frame as the output.

filter and summarize

mutate, arrange, group_by,

Shrinkage verbs

Much of the time when we use mutate(), we are generating a new variable out of those already in the data frame. All sorts of mathematical and character transformations are available. For instance, the KidsFeet data frame records the length and width of 39 third- and fourth-grade children. If we would like to work with the aspect ratio of the feet, which is length divided by width, mutate() will do the work for us.

We will use a short subsample of the KidsFeet data for the purposes of demonstration.
KidsFeet |> 
  mutate(aspect = length / width)

?(caption)

sex    length   width   aspect
----  -------  ------  -------
B        26.1     9.1     2.87
B        24.5     9.7     2.53
B        23.6     9.0     2.62
G        26.0     9.0     2.89
G        24.5     9.0     2.72

This Lesson introduces transformations of a different kind, some of which you are already familiar with. We will call these “shrinkage” transformations because, rather than dealing with the data frame rows one at a time, these transformations work on the rows collectively.

Perhaps the simplest shrinkage transformation is averaging. An average, of course, combines (shrinks) many numerical values to give a single representative one. Two examples of averages are the mean and median. When mutate() encounters a shrinkage transformation of this sort, it inserts the same value for all of the rows. You can think of mean or median as shrinking the range of values of its argument into a single number.

KidsFeet |> mutate(mean(length))

?(caption)

sex    length   width   mean(length)
----  -------  ------  -------------
B        26.1     9.1           24.9
B        24.5     9.7           24.9
B        23.6     9.0           24.9
G        26.0     9.0           24.9
G        24.5     9.0           24.9

Usually, we prefer to give column names that are short and have no special characters. To accomplish this, use named arguments to mutate(). The names are up to you. Here’s an example:

KidsFeet |> mutate(mlen = mean(length))

?(caption)

sex    length   mlen
----  -------  -----
B        26.1   24.9
B        24.5   24.9
B        23.6   24.9
G        26.0   24.9
G        24.5   24.9
Rank of a variable

Two interesting shrinkage transforms with important uses in statistics are rank() and percent_rank(). These tell where each row would stand if the values had been sorted in ascending order:

KidsFeet |> mutate(rank=rank(length), percentile=100*percent_rank(length))

?(caption)

sex    length   rank   percentile
----  -------  -----  -----------
B        26.1    5.0          100
B        24.5    2.5           25
B        23.6    1.0            0
G        26.0    4.0           75
G        24.5    2.5           25

Unlike mean(), the rank() transformations do not insert the same value in each row. Still, the output values depend collectively on the values in the input. This is completely characteristic of a shrinkage transformation.

Groupwise operations

The group_by() wrangling verb sets up mutate() to use shrinkage transformations separately for each group. For instance:

KidsFeet |> 
  group_by(sex) |>
  mutate(rank=rank(length))

?(caption)

sex    length   rank
----  -------  -----
B        26.1      3
B        24.5      2
B        23.6      1
G        26.0      2
G        24.5      1

Rank 1 appears twice in ?tbl-grouped-rank, once for the girls and once for the both. Within both groups, rank 1 is assigned to the row with the smallest length.

In Lesson 8 we will work extensively with groupwise means as a way of summarizing the similarities within a group and the differences between groups.