Constructing a data graphic transforms a data frame into a new representation—the graphic—which can be useful to the human viewer for identifying patterns, relationships, and trends that exist in the data but which are obscured by the tabular form of a data frame. We use the computer to generate data graphics because it is fast, reliable, and flexible.
We will do a lot of transforming of data frames into forms that are more useful for understanding and allow us to answer questions about the data. The generic word for transforming information into a different form is “computation.” Only in recent decades has computation involved the familiar electronic devices with screens and keyboards. For all of human history, we have transformed information. For instance, here are two pieces of information: The Babylonian farmer has 5 𒁀𒌷𒂵 (bushels) of barley with a market price of 2 dishekels per bushel. The buyer needs this information in another form: the total cost. The computation is simple: multiply the number of bushels by the price: \(2\ \text{dishekels per bushel} \times5 \ \text{bushels}\rightarrow 10 \ \text{dishekels}\).
The input(s) to the computation are the pieces of information. The output is another piece of information in a different form, say a graphic. The computation itself is one of a set of standard operations; for the Babylonian, the operation is called “multiplication.” There are, of course, other computations that have been given names, for instance “factoring into primes,” “sorting into alphabetical order,” or “square-root.” The humans who want the output of the computation has to apply the named computation to the input(s) and receive the output. And, of course, the human has to be familiar with the various possible transformations in order to select or construct the one that will fit the task at hand.
Instructors should note and point out to their students that our use of R will be limited to a very small part of the language. The origins of R go back to the 1980s, but the part we will use—an official part of the language—is only about a dozen years old. Characteristic of this modern component of the language is the pipe operator (written |>) which makes more readable the chains of commands often seen in data wrangling procedures.
The reader consulting the Internet or resources will often encounter R statements written without pipes. These will look foreign, but the difference is superficial. The pipe style can be translated into the older, non-pipe style without loss of functionality but with (in the opinion of many programmers) considerable loss of readability. We’ll make a habit in the next few chapters of providing the old-style commands for those who are used to that style.
Computer languages, like the R language that we will use in these Lessons, are a means to specify the correct computation, apply the computation to the inputs, and store, display, or pass along the output as an input for further computation. In R and many other computer languages, the computations are called “functions,” and each function has a name. The R “sentences” in which functions are applied to objects are called “commands.” A part of a command, analogous to a phrase in a sentence, is called an “expressions.”
A simple example involving an aptly named data frame, Tiny, can get us started. This uses the print() function whose job is to render its input into a form suitable for immediate display, that is, to let you “see” the contents of an information object.
Tiny records data about penguins of a few different species. There are only eight specimens in Tiny, which makes it easy to see the whole data frame but hard to draw any conclusions about real penguins. Tiny is drawn from Big, which records the measurements on hundreds of penguins.
Tiny |>print()
species mass flipper sex
1 Chinstrap 3950 201 male
2 Adelie 4400 196 male
3 Gentoo 5600 228 male
4 Gentoo 4700 219 female
5 Adelie 3500 189 female
6 Gentoo 5600 228 male
7 Adelie 3950 189 male
8 Chinstrap 3250 191 female
There are two names used in the above command: Tiny and print. In R, names always refer to an information object. Just as in the real world, the objects come in a variety of types. For instance, “toaster” is the name of a common kitchen appliance while “Lusitania” is the name of a ship. For people who do not speak English, “toaster” is just a word; they won’t know what kind of object it refers to. And for those unfamiliar with the history of World War I, “Lusitania” is without meaning.
In English, words and names have grammatical conventions. For instance, in a word like “toaster” we use quotation marks to signify that it is the word we are talking about rather than a particular object. Proper names in English, like “Danny,” are written with initial capitalization. In contrast, names of information objects in R are written without quotation marks and capitalization, if any, is taken literally. The name Tiny has an initial capitalization and tiny or tiNY—legitimate names both—don’t refer to the data object named Tiny.
Almost always, you will have to deduce the type of an information object by context. (For instance, we said earlier that Tiny refers to a data frame and print names a function.) To make things easier, functions named in the text of these Lessons will be followed by parentheses. For instance, the name in the previous parenthetical note will be written print() just as a reminder of what kind of information object is being named.
In addition to names, an R command usually has punctuation. For instance, the command Tiny |> print() has two bits of punctation:
() signifies that we are not merely naming a function, but putting it into action. By “putting into action” we mean “applying the function to inputs. Naturally, you will need to know where the inputs are coming from. In Tiny |> print(), the input to print() is coming from a”pipe.”
|> as punctuation signifies a “pipe.” The point of a pipe is to hand off the object on the left (namely Tiny in this example) as an input to the function.
The word “input” is very general and we will need it to describe other things later in these Lessons. For that reason, we will use another word—a technical word—in place of “input” when talking about R functions. That word is “argument.” Of course, “argument” means several things in English, for instance a dispute or a set of reasons. When talking about functions, we are using the mathematical/computing sense of the word: an information object (such as a data frame) passed into a function so that the function can do its work on the object.
For the next few Lessons, we’ll write “input (argument)” and “argument (input)” just to remind you what “argument” stands for. With a little practice, you’ll remember that “argument” refers to what gets passed into a function.
In these Lessons, a typical command will start with the name of a data frame at the start of the sentence, followed by pipe punctuation (that is, |>), then followed by the name of a function.
Similar commands to Tiny |> print() are these, that report different aspects of the data-frame argument to the function:
Tiny |>nrow()
[1] 8
Tiny |>ncol()
[1] 4
Tiny |>names()
[1] "species" "mass" "flipper" "sex"
Much of the time, the functions you will use will get the first input from a pipe but will require additional inputs to specify the details of the operation being requested. You have seen this already with tilde_graph(), for instance:
Tiny |>tilde_graph(flipper ~ mass + sex)
Figure 3.1: tilde_graph() takes two arguments: the first is a data frame, the second is a tilde expression specifying the desired organization of the graphic in terms of variables. The first argument (Tiny) is being piped into tilde_graph(), the second argument goes in the parentheses that follow tilde_graph(). There are only eight specimens in Tiny. Try the command but using Big instead of Tiny to see a more compelling graph.
In the above command, the first argument to tilde_graph() is being piped in. The second argument, contained in the parentheses following the name data_graph, specifies details about the operation, in this case which variables to display on the vertical and horizontal axes and which variable to represent with color.
Tiny |>summarize(n_distinct(species))
n_distinct(species)
1 3
Another, similar example:
Tiny |>summarize(mean(flipper))
mean(flipper)
1 205.125
We will come back to the functions used in the above examples—summarize(), n_distinct(), and mean()—in Lesson sec-wrangling, although the names are so suggestive that you may already be able to intuit what the commands are doing. For now, however, let’s look at the structure of the commands, just as we might look at the structure of an English sentence in terms of verbs and punctuation.
At the heart of the commands, right after the |> pipe, is a function named summarize. There are two clues that summarize names a function:
Ordinarily, we will write function names followed by () as a hint that the name refers to a function. We are breaking that convention here since the literal name of a function does not include the parentheses.
The name is immediately followed by an open parenthesis: (. This is always a definitive sign that a name refers to a function.
summarize comes immediately after the pipe symbol. The pipe symbol must always be followed by a function.
You can also see that summarize() in each command is being given a second argument. (The first argument is being piped in, the second is contained in the parentheses following summarize.) The second argument—n_distinct(species) and mean(flipper), respectively—has a structure of its own. Noting the parentheses following both n_distinct and mean, we can deduce that these are also functions. And, as usually the case with functions, they are being passed an argument of their own: species and flipper, respectively.
The output from a function
Earlier, we defined a computation as an operation that transforms inputs (arguments) into an output. Now it’s time to to talk about what happens to the output of a computation. We will consider three things to do with an output:
Printing
The most common thing that a newcomer does with the output from a function is to look at it on the computer screen. This is called, not surprisingly, “printing the output.” Example:
Tiny |>nrow()
[1] 8
The nrow() function counts the number of rows in its argument and returns that number as the output. Since nothing else has been specified about what to do with the output, R prints it to the screen.
Earlier, we accomplished such printing with this command:
Tiny |>print()
You won’t need to print() outputs, since this happens automatically if you don’t specify anything else to happen. Idiomatically, you need only type the name of an information object to see it printed, like this:
Tiny
species mass flipper sex
---------- ----- -------- -------
Chinstrap 3950 201 male
Adelie 4400 196 male
Gentoo 5600 228 male
Gentoo 4700 219 female
Adelie 3500 189 female
Gentoo 5600 228 male
Adelie 3950 189 male
Chinstrap 3250 191 female
Storing
Occasionally, you will want to store the created output to use it later on. The storing process has its own unique syntax, which is easy to spot since it involves a symbol, <-, that is not used for any other purpose in R. As an example, we will store the output from Tiny |> summarize(mean(flipper)) under the name My_summary:
My_summary <- Tiny |>summarize(mean(flipper))
Always, there is an expression to the right of <-. Here, that expression is the computation we used earlier: Tiny |> summarize(mean(flipper)). Similarly, there is always an expression to the left of <-. In these Lessons, the left-hand expression will always be a name, the name under which we want to store the output of the right-hand expression.
In English, such a storage command is referred to with the noun “assignment” or the verb “assign,” as in, “We’re assigning the output to My_summary.
Experience shows that it is easier to read a command like the above if it is broken up like the lines of a poem. You will will see in these Lessons so many examples of command line-breaking that you will soon pick up the style. For now, an example will suffice to draw the practice to your attention.
My_summary <- Tiny |>summarize(mean(flipper))
Line breaks in such commands come after the pipe symbol.
Using the output as an input
Continuing a pipeline.
Using a sub-expression inside the parentheses.
Using a previously assigned name.
Names
Strictly speaking, a name in R can be any character or sequence of characters at all, even cuneiform marks from Babylonian. The price you pay for such flexibility is the need to use back-quotes around the name.
`𒁀𒌷𒂵 `<- Tiny |>summarize(mean(flipper))`𒁀𒌷𒂵 `
mean(flipper)
1 205.125
The back-quoting is awkward and tedious. There is a loophole, however. You can avoid backquoting any name that consists only of alphanumeric characters, the period, and the underscore _.
Additional restrictions: an unbackquoted name cannot start with a numeral or an underscore.
The data frame that we have just assigned to the name 𒁀𒌷𒂵 has a variable named mean(flipper). This is certainly a descriptive name, but it will be hard to use later on since it will require back-quoting. You can avoid this by telling summarize() that you want the result to be named in accordance with the loophole, like this:
Which of these is not a valid expression for handing a data frame named Big as an argument to the head() function.
Big |> head()
head(Big)
head() <| Big
Big -> head()
Hint: Try them out and see what happens.
Exercise 4.2
In the text, an example command was given:
Tiny |>summarize(ns =n_distinct(species))
ns
---
3
Tiny is a data frame and summarize() is a function whose first argument (the slot filled by the pipe) must always be a data frame.
What kind of information object is species? ## A variable
What kind of information object is n_distinct()? ## A function that takes as input a variable.
What would be the output of the command if you replaced species by sex? ## the number of distinct sexes represented in the data frame.
What would be the output of the command if you replace species by flipper? ## The number of distinct flipper lengths. In the small sample contained in Tiny, there are no repeats in the flipper length.
Big is a superset from which the 8 rows in Tiny were selected.
What will be the output of Tiny |> summarize(ns = n_distinct(species)) if you replace Tiny with Big in the command? ## Still 3. Evidently, all of the species in Big appear in one row or another in Tiny.
Using Big as the input to summarize() and flipper as the variable given as an object to the function n_distinct(), what will be the result of the computation? # 56.
Are there any repeats in the flipper lengths recorded in Big? (Hint: Take the answer in ii and compare it to the output of Big |> nrow().)
Exercise 4.3
None of the following are complete commands, that is, each of them will lead to an error message rather than an output.
ns = n_distinct(species)
summarize()
Tiny |> summarize()
Tiny |> summarize(species)
For each, give a brief explanation of what’s missing or why the expressions listed can’t work.
Exercise 4.4
What kind of first argument—data frame or variable—does each of these functions require?