|> select(mass, species) Tiny
mass species
----- ----------
3950 Chinstrap
4400 Adelie
5600 Gentoo
4700 Gentoo
3500 Adelie
5600 Gentoo
3950 Adelie
3250 Chinstrap
... and so on for 4 rows in total.
The data that comes to us is often not in the ideal organization for display in a way that suits our purposes. This Lesson (and the next) is about basic computer techniques for re-organizing data frames. There is huge benefit of sticking to the conventions of data frames: the hard work of programming has already been done for us by experts. Instead of loops and arrays—the kinds of things you learn in an introductory computer programming course—there is a small set of basic operations that are easily invoked to re-organize data. Each of these operations (except, perhaps, the one covered in Lesson 6) can be understood intuitively.
We can’t anticipate every way that you might find it useful to re-organize data to suit your purposes, simply because we don’t know what your purposes will be in any given project. Even so, the experience of millions of data professionals around the world gives us confidence that, whatever you seek to accomplish can be done by stringing together “data wrangling operations” in an appropriate order. An analogy that’s relevant is writing and playing the harmony of rock music on a guitar. The guitarist has to learn a few chords. The music is composed or played by organizing a sequence of these chords and short breaks, customizing each one for duration and loudness. A new and unique piece of music can be composed by the careful combination of chords and their sequence.
When data wrangling, you will be working with a small set of functions that always take a data frame as input, and always produce a data frame as output. At the core of that small set are five wrangling functions that you will be using extensively: select()
, summarize()
, mutate()
, filter()
, and arrange()
. To illustrate:
|> select(mass, species) Tiny
mass species
----- ----------
3950 Chinstrap
4400 Adelie
5600 Gentoo
4700 Gentoo
3500 Adelie
5600 Gentoo
3950 Adelie
3250 Chinstrap
... and so on for 4 rows in total.
|> summarize(mf = mean(flipper)) Tiny
mf
--------
205.125
|> mutate(root = sqrt(mass)) Tiny
species mass flipper sex root
---------- ----- -------- ------- -----
Chinstrap 3950 201 male 62.8
Adelie 4400 196 male 66.3
Gentoo 5600 228 male 74.8
Gentoo 4700 219 female 68.6
Adelie 3500 189 female 59.2
Gentoo 5600 228 male 74.8
Adelie 3950 189 male 62.8
Chinstrap 3250 191 female 57.0
... and so on for 3 rows in total.
|> filter(species == "Gentoo") Tiny
species mass flipper sex
-------- ----- -------- -------
Gentoo 5600 228 male
Gentoo 4700 219 female
Gentoo 5600 228 male
|> arrange(mass) Tiny
species mass flipper sex
---------- ----- -------- -------
Chinstrap 3250 191 female
Adelie 3500 189 female
Chinstrap 3950 201 male
Adelie 3950 189 male
Adelie 4400 196 male
Gentoo 4700 219 female
Gentoo 5600 228 male
Gentoo 5600 228 male
Like rock-and-roll cords, these wrangling functions can be put into a pipeline: a sequence of wrangling steps connected by the pipe symbol |>
. Also like rock-and-role, it’s best to listen to some examples to get an idea of the range of overall effects you can create.
Before proceeding to examples of wrangling tasks and how they can be accomplished, let’s look carefully at the “grammar” involved in wrangling commands. In English, sentences have a subject and a verb, and often other components such as adverbs. We will use that analogy here.
By analogy, consider the short English sentence, “Jane runs.” Jane is the subject and “runs” is the verb. English sentences don’t produce an “output,” but try to imagine the situation as subject Jane being the input, “run” modifying Jane in a specific way, and a running version of Jane is the output. Extending your imagination further, think of “Jane runs” written in R as Jane |> runs()
.
In data-wrangling sentences, the wrangling verbs (that is, functions) almost always are modified by an adverb. As you know, an adverb is a word that modifies a verb. For example, “Jane runs swiftly,” or “Jane sometimes runs.” Again, return to your imagination where these two sentences, with adverbs, might be written in R as Jane |> runs(swiftly)
and Jane |> runs(sometimes)
. In English you can have multiple adverbs, as in “Jane sometimes runs swiftly.” Imagine this in R as Jane |> runs(sometimes, swiftly)
.
Every wrangling command starts with a subject, the data frame at the head of the pipeline. The above examples use Tiny
as the subject.
The subject is followed by a pipe symbol (|>
) and then, always, by a wrangling verb. The chief characteristic of a wrangling verb is that it takes a data frame as input and produces a data frame as output.
Almost always, the wrangling verb is modified by an adverb. The adverb goes in the position of the second argument and there may be a third argument or more.
There is only a small set of possible wrangling verbs: arrange()
, group_by()
, and so on. But there is an infinite variety of adverbs. Key to successful data wrangling is the choice of a verb and an adverb that accomplishes the task at hand. We turn now to the structure of proper wrangling adverbs.
First, a wrangling adverb must be a proper R expression, that is, obeying the rules of R grammar. Primarily this means that if a function is used in an adverb, it will be followed by parentheses.
Second, a wrangling adverb almost always includes the name of one or more variables that are in the data frame piped as input to the wrangling verb.
An error message is not a criticism of you or your work. Ideally, you should see error messages as hints about what’s not yet correct about your command. Reading the error message is essential to being able to make use of the hint. Admittedly, many R error messages are cryptic, but usually they will tell you where in the command lies the problem.
Third, the R expression forming an adverb will work only inside the parentheses of a wrangling verb. If you try to use an adverbial expression as a full, stand-alone R command you will usually get an error message.
Perhaps the simplest wrangling adverbs consist purely of the name of a variable in the data frame. Examples: select(species)
or arrange(mass)
. The adverbs here are just species
or mass
while select()
and arrange()
are the wrangling verbs being modified by the adverbs.
Another very common adverb pattern is a function applied to a variable name. For example, summarize(mean(flipper))
or mutate(max(mass))
. If the function is arithmetical, use arithmetic notation, e.g. mutate(flipper / mass)
.
The wrangling verbs summarize()
and mutate()
create columns. It’s nice if those columns have a simple name. You can set the name to be used by preceding the adverb by the name would want followed by an equal sign. Examples: summarize(mn = mean(flipper))
or mutate(ratio = flipper / mass)
.
As mentioned previously, a wrangling adverb may be as simple as the name of a variable, or much more complicated, involving both variable names and functions. The English analogy to a complicated wrangling adverb is an adverbial phrase, as in, “Jane runs swiftly as the wind.”
It’s important to note that the functions used in a wrangling adverbial phrase are not themselves wrangling verbs. Wrangling verbs alway take a data frame as the first input. In contrast, functions in adverbial phrases usually take a variable or R expression (as in flipper / mass
) as the first input.
Adverbial-phrase functions like mean()
produce as output a single quantity. So mean(flipper)
calculates a single number as output. We will call these “shrinkage” verbs, since they condense a set of values into a single value.
Other adverbial-phrase functions take a set of values and provide as output another set of values. Typical of these are the mathematical functions, for instance, sqrt()
or division. You can use either type of adverbial-phrase function with most wrangling verbs, depending on what you are trying to accomplish. Within summarize()
however, use only shrinkage verbs.
Use .by=species
rather THAN GROUP_BY()
natality2014::Natality_2014_100k
and graph it versus baby’s weight dbwt
. Is there an obvious relationship between the two variables? CHANGE THIS TO Births2022
where the variables are weight_pre
, weight_delivery
, weight
.Explain why the values in the count
variable of the output is different between these two similar-looking R commands:
|> summarize(count = n_distinct(species)) Tiny
count
------
3
|> summarize(count = n_distinct(species), .by = species) Tiny
species count
---------- ------
Chinstrap 1
Adelie 1
Gentoo 1
The n()
adverbial wrangling shrinkage function counts the number of rows. It is unusual in that it doesn’t need any input; it is counting rows, not values in a variable.
rows
column in these two statements. What is that relationship?|> summarize(rows = n()) Tiny
rows
-----
8
|> summarize(rows = n(), .by = species) Tiny
species rows
---------- -----
Chinstrap 2
Adelie 3
Gentoo 3
Tiny
with Big
in your wrangling statements and explain if your answer to (1) still holds up.Using the .by=
variable argument in wrangling verbs causes the wrangling operation to be done separately for each of the groups of rows defined by the variable.
Here are two ways you might re-arrange the rows of Tiny
:
Tiny |> arrange(mass, .by=species)
Tiny |> arrange(species, .by=mass)
Run both commands and observe how the outputs differ from one another. Give a brief explanation of what features of the Tiny
data cause them to differ.
Suppose you are tasked to create a new data frame out of the dataframe Penguins
that has a ratio
column given flipper / mass
for each penguin.
Should you use mutate()
or summarize()
to create the new data frame? Answer: Use mutate()
. You’re creating a new column with the same rows as in the original: one for each penguin.
Do the wrangling to create ratio
, then use tilde_graph()
to graph ratio ~ species
with a violin annotation.
|>
Penguins mutate(ratio = flipper / mass) |>
tilde_graph(ratio ~ species, annot = "violin")
Reading the graph you created in (b), say which species tends to have the lowest ratio of flipper length to mass. Answer: Although there is some overlap with the other two species, Gentoo penguins tend to have the lowest ratio.
Form the wrangling command to calculate the mean and variance of ratio
for each species of penguin.
|>
Penguins mutate(ratio = flipper / mass) |>
group_by(species) |>
summarize(mean(ratio), var(ratio))
species mean(ratio) var(ratio)
---------- ------------ -----------
Adelie 0.0519743 3.38e-05
Chinstrap 0.0528887 2.14e-05
Gentoo 0.0429841 1.20e-05
See cheetah-loose-saucer for a data cleaning example.
BEING HARVESTED, not yet part of the chapter.
What do the names n_distinct
and species
refer to. Just from the open parenthesis directly following n_distinct
you can tell that it is a function. That doesn’t tell you what the function does; just that it’s a function. Similarly, you have to refer to context to know what species
is about. If you remember from the first few examples in this Lesson—Tiny |> print()
and Tiny |> names()
—you’ll recognize species
as the name of one of the variables in Tiny
.
To a very great extent, the functions you work with will belong to one of two families, the family being identified by the type of information object expected for the first argument (input). These are:
|>
will be used to insert the data frame into the function. Examples: tilde_graph()
which you met in Lesson 2; summarize()
and several other “data wrangling” functions you will meet in the next Lessons such as filter()
, select()
, and mutate()
.n_distinct()
, and mean()
, as well as median()
, var()
, sd()
, and a handful of other statistics-related functions.As you learn the 20 or so functions we will need for these Lessons, a good habit is to note whether the function takes a data frame or a variable as the first argument (first input). There are a few functions that take other kinds of first arguments, but mostly you will work with functions whose first argument (first input) must be either a data frame or a variable.
WHY n_distinct()
must be used within summarize()
: you CAN’T USE an expresson like n_distinct(species)
, which takes a variable as input, on its own, outside of an appropriate function like summarize()
that takes a data frame as its first argument. The data frame in the position of the first argument sets the context for the name of the variable.
ARITHMETIC ON VARIABLES, e.g. mean(cos(flipper))
or mean(flipper^2)
.
START BY PUTTING “variable” in quotation marks to set the stage for later on, when you do arithmetic on variables.
Syntax with parens,
Example: head()
, nrow()
, names()
. Find the mean flipper length with summarize()
. Note that the output is a data frame.
Just a handful of wrangling operations are needed to accomplish a wide range of re-organizations of data frames. Here’s the list:
Every one of these wrangling functions takes a data frame as the first argument and produces a (new) data frame as the output.
Much of the time when we use mutate()
, we are generating a new variable out of those already in the data frame. All sorts of mathematical and character transformations are available. For instance, the KidsFeet
data frame records the length and width of 39 third- and fourth-grade children. If we would like to work with the aspect ratio of the feet, which is length divided by width, mutate()
will do the work for us.
KidsFeet
data for the purposes of demonstration.|>
KidsFeet mutate(aspect = length / width)
?(caption)
sex length width aspect
---- ------- ------ -------
B 26.1 9.1 2.87
B 24.5 9.7 2.53
B 23.6 9.0 2.62
G 26.0 9.0 2.89
G 24.5 9.0 2.72
This Lesson introduces transformations of a different kind, some of which you are already familiar with. We will call these “shrinkage” transformations because, rather than dealing with the data frame rows one at a time, these transformations work on the rows collectively.
Perhaps the simplest shrinkage transformation is averaging. An average, of course, combines (shrinks) many numerical values to give a single representative one. Two examples of averages are the mean and median. When mutate()
encounters a shrinkage transformation of this sort, it inserts the same value for all of the rows. You can think of mean or median as shrinking the range of values of its argument into a single number.
|> mutate(mean(length)) KidsFeet
?(caption)
sex length width mean(length)
---- ------- ------ -------------
B 26.1 9.1 24.9
B 24.5 9.7 24.9
B 23.6 9.0 24.9
G 26.0 9.0 24.9
G 24.5 9.0 24.9
Usually, we prefer to give column names that are short and have no special characters. To accomplish this, use named arguments to mutate()
. The names are up to you. Here’s an example:
|> mutate(mlen = mean(length)) KidsFeet
?(caption)
sex length mlen
---- ------- -----
B 26.1 24.9
B 24.5 24.9
B 23.6 24.9
G 26.0 24.9
G 24.5 24.9
Two interesting shrinkage transforms with important uses in statistics are rank()
and percent_rank()
. These tell where each row would stand if the values had been sorted in ascending order:
|> mutate(rank=rank(length), percentile=100*percent_rank(length)) KidsFeet
?(caption)
sex length rank percentile
---- ------- ----- -----------
B 26.1 5.0 100
B 24.5 2.5 25
B 23.6 1.0 0
G 26.0 4.0 75
G 24.5 2.5 25
Unlike mean()
, the rank()
transformations do not insert the same value in each row. Still, the output values depend collectively on the values in the input. This is completely characteristic of a shrinkage transformation.
The group_by()
wrangling verb sets up mutate()
to use shrinkage transformations separately for each group. For instance:
|>
KidsFeet group_by(sex) |>
mutate(rank=rank(length))
?(caption)
sex length rank
---- ------- -----
B 26.1 3
B 24.5 2
B 23.6 1
G 26.0 2
G 24.5 1
Rank 1 appears twice in ?tbl-grouped-rank, once for the girls and once for the both. Within both groups, rank 1 is assigned to the row with the smallest length
.
In Lesson 8 we will work extensively with groupwise means as a way of summarizing the similarities within a group and the differences between groups.