6  Computing with functions and arguments

You have already seen the R command patterns that we will use throughout these Lessons: pipelines composed of actions separated by |>, names for data frames, functions and their arguments. This Lesson recapitulates and explains those patterns, the better to help you construct your own commands. The Lesson also emphasizes a small technical vocabulary that helps in communicating with other people to share insights and identify errors.

Learning command patterns is a powerful enabler in statistical thinking. The computer is an indispensable tool for statistical thinking. In addition to doing the work of calculations, computer software is a medium for describing and sharing with others statistical techniques and concepts. For instance, three fundamental operations in statistics—randomize, repeat, collect—are easily expressed in computer languages but have no counterpart in the algebraic notation traditionally uused in mathematics education.

Chain of operations

A typical computing task consists of a chain of operations. For instance, each wrangling operation receives a data-frame object from an incoming pipe |>, operates on that data frame to perform the action described by the function’s name and arguments, and produces another data frame as output. Depending on the overall task, the output from the operation may be piped into another action or displayed on the screen.

There are also operations, like point_plot(), that translate a data frame into another kind of object: a graphic. Starting with Lesson sec-regression, we will work with model_train(), a function that translates a data frame into a model. In this Lesson, we will meet two other ways of dealing with the output of a chain of operations: storing the output under a name for later use and formatting a data frame into a table suited to human readers.

Manufacturing processes provide a helpful analogy for understanding the step-by-step structure of a computation. Simple manufacturing processes might involve one or a handful of work steps arranged in a chain. Complex operations can involve many chains coming together in elaborate configurations. The video in Figure fig-pencil-video shows the steps of pencil manufacturing. These involves several inputs:

Figure 6.1: Manufacturing a pencil, step by step.

The overall manufacturing process takes several inputs that are shaped and combined to produce the pencil: cedar wood slabs, glue, graphite, enamel paint Each input is processed in a step-by-step manner. At some steps, two partially processed components are combined. For instance, there is a step that grooves the cedar slabs (which are sourced from another production line). The next step put glue in the groves. In still another step, the graphite rods (which come from their own production process) are placed into the glue-filled groves. (Lesson sec-databases introduces the data-wrangling process, join, that combines two inputs.)

There are several different forms of conveyors in the pencil manufacturing line that carry the materials from one manufacturing step to the next. We need only one type of conveyor—the pipe—to connect computing steps.

Manufacturing processes often involve storage or delivery. The video in Figure fig-pencil-video ends before the final steps in the process: boxing the pencils, warehousing them, and the parts of the chain that deliver them to the consumer end-user.Storage, retrieval, and customer use all have their counterparts in computing processes. By default, the object produced by the computing chain is directly delivered to the customer, here by displaying it in some appropriate place, for instance directly under the computer command, or in a viewing panel or document:

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by = year)
country year GDP pop GDPpercap
Korea 2020 874 32 27.31250
France 2020 1203 55 21.87273
Cuba 1950 60 8 7.50000
France 1950 250 40 6.25000

In our computer notation, the storage operation can look like this:

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by=year) -> High_income_countries

At the very end of the pipeline chain, there is a storage arrow (->, as opposed to the pipe, |>) followed by a storage name (High_income_countries). The effect is to place the object at the output end of the chain to be stored in computer memory in a location identified by the storage name.

Retrieval from storage is even simpler: just use the storage name as an input. For instance:

High_income_countries |> 
  select(-GDP, -pop) |> 
  filter(year == 2020) |> 
  kable(digits=2)
country year GDPpercap
Korea 2020 27.31
France 2020 21.87
Pointing out storage from the start

In a previous example we placed the storage arrow (->) at the end of a left-to-right chain of operations. In practice, programmers and authors prefer another arrangement—which we will use from now on in these Lessons—where the storage arrow is at the left end of the chain. The storage arrow still points to the storage name. For instance,

High_income_countries <- Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by=year) 

Using this storage_name <- idiom it is easier to scan code for storage names and to spot when the output of the chain is to be delivered directly to the customer.

What’s in a pipe?

The pipe—that is, |>—carries material from one operation to another. In computer-speak, the word “object” describes this material. That is, pipes convey objects.

Objects come in different “types.” Computer programmers learn to deal with dozens of object types. Fortunately, we can accomplish what we need in statistical computing with just a handful. You have already met two types:

  1. data frames
  2. graphics, consisting of one or more layers, e.g. the point plot as one layer and the annotations as another layer placed on top.

In later lessons, we will introduce two more types—(3) models and (4) simulations.

Pipes connect to functions

At the receiving end of a pipe is an operation on the object conveyed by the pipe. A better word for such an operation is “function.” It is easy to spot the functions in a pipeline: they always consist of a name—such as summarize or point_plot—followed directly by ( and, eventually, a closing ). For example, in

Listing 6.1
Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by = year)

the first function is named mutate. The function output is being piped to a second function, named filter. From now on, whenever we name a function we will write the name followed by () to remind the reader that the name refers to a function: so mutate() and filter(). There are other things that names can refer to. For instance, Nats at the start of the pipeline is a data frame, and GDP, GDPpercap and pop, and year are variables. Such names for non-functions are never followed directly by (.

One of the names appearing in Listing lst-GDP-mean is mean. What kind of computing object does this name refer to?

Because the name is directly followed by a parentheses, we know mean must refer to a function.

Following our convention for writing function names, we should have written the name as mean(), but that would have made the question too easy!

Arguments (inside the parentheses)

Almost always when using a function the human writer of a computer expression needs to specify some details of how the function is to work. These details are always put inside the parentheses following the name of the function. To illustrate, consider the task of plotting the data in the SAT data frame. The skeleton of the computer command is

This skeleton is not a complete command, as becomes evident when the (incomplete) command is evaluated:

What’s missing from the erroneous command is a detail needed to complete the operation. This missing detail is what variables from SAT to map to y and x. This detail should be provided to point_plot() as an argument. As you saw in Lesson sec-point-plots, the argument is written as a tilde expression, for instance sat ~ frac to map sat to y and frac to x. Once we have constructed the appropriate argument for the task at hand, we place it inside the parentheses that follow the function name.

Run Listing lst-sat-point-error with no argument in the parentheses following point_plot. The resulting error message is, admittedly, cryptic. Nonetheless, scan the error message to look for something familiar that provides a clue to how you can fix the command.

Then add in the missing argument to put frac on the x-axis and sat on the y-axis.

Many functions have more than one argument. Some arguments, like the tilde expression argument to point_plot(), may be required. When an argument is not required, the argument itself is given a name and it will have a default value. In the case of point_plot(), there is a second argument named annot= to specify what kind of annotation layer to add on top of the point plot. The default value of annot= turns off the annotation layer.

Named arguments, like annot=, will always be followed by a single equal sign, followed by the value to which that argument is to be set. For instance, point_plot() allows four different values for annot=:

  1. the default (which turns off the annotation)
  2. annot = "violin" specifying a density display annotation
  3. annot = "model" which annotates with a graph of a model
  4. annot = "bw" which creates a traditional “box-and-whiskers” display of distribution. (We will not use such box-and-whiskers annotations in these Lessons, preferring violins instead. Still, they are often seen in practice.)

In these Lessons, the single = sign always signifies a named argument.

A closely related use for = is to give a name to a calculated result from mutate() or summarize(). For instance, suppose you want to calculate the mean sat score and mean fraction in the SAT data frame. This is easy:

SAT |> summarize(mean(sat), mean(frac))
mean(sat) mean(frac)
965.92 35.24

We will often use this unnamed style when the results are intended for the human reader. But if such a calculation is to be fed down the pipeline to further calculations, it can be helpful to give simple names to the result. Frivolously, we’ll illustrate using the names eel and fish:

The reason for the frivolity here is to point out that you get to choose the names for the results calculated by mutate() and summarize(). Needless to say, it’s best to avoid frivolous or misleading names.

Listing lst-store-arrow-sat uses the storage arrow to store a summary of the SAT data frame under the name Results. Run the chunk and note that there is no output printed.

Listing 6.2
  1. Explain why nothing is being printed.

  2. Add a second to Listing lst-store-arrow-sat that will cause Results to be printed. (Hint: The second command will be very short and simple.)

  3. Identify which of the following statements use the named-argument syntax correctly. Answer first just from reading the statement. Then confirm your answer by copying the statement into Listing lst-store-arrow-sat.

    When the statement is not correct, explain why.

    1. Results <- SAT |> summarize(eel = mean(sat), mean(frac) = fish)
    2. Results <- SAT |> summarize(eel <- mean(sat), fish <- mean(frac))
    3. Results <- SAT |> summarize(eel == mean(sat), fish == mean(frac))
    4. Results <- SAT |> summarize(eel = mean(sat), fish = mean(frac))
  4. The following command is valid but uses = in place of the storage arrow. Explain how you can tell, nonetheless, that Results is not a named argument.

    1. `Results = SAT |> summarize(mean(sat), mean(frac))
  1. The use of the storage arrow suppresses printing.
  2. The second command simply refers to the storage name: Results
    1. The name must always be placed to the left of =.
    2. You cannot use the storage arrow (<-) in place of = for a named argument. Instead of a simple name for the output column, the entire argument gets used as the name. This is very inconvenient when you want to refer to the column in a later calculation.
    3. Double equal signs (==) are for comparing the left and right side, rather than creating a column name.
    4. The statement is correct.
  3. Named arguments are only seen as arguments. That is, they must be inside the parentheses following a function name.

Variable names in arguments

Many of the functions we use are on the receiving end of a pipe carrying a data frame. Examples, perhaps already familiar to you: filter(), point_plot(), mutate(), and so on.

A good analogy for a data frame is a shipping box. Inside the shipping box: one or more variables. When a function receives the shipping box data frame, it opens it, providing access to each variable contained therein. In constructing arguments to the function, you do not have to think about the box, just the contents. You refer to the contents only by their names. select() provides a good example, since each argument can be simply the name of a variable, e.g. 

For most uses, the arguments to a function will be an expressions constructed out of variable names. Some examples:

  • SAT |> filter(frac > 50) where the argument checks whether each value of frac is greater than 50.
  • SAT |> mutate(efficiency = sat / expend) where the argument gives a name (efficiency) to an arithmetic combination of sat and expend.
  • SAT |> point_plot(frac ~ expend) where the argument to point_plot() is an expression involving both frac and expend.
  • SAT |> filter(expend > median(expend)) where the argument involves calculating the median expenditure across the state using the median() reduction function, then comparing the calculated median to the actual expenditure in each state. The overall effect is to remove any state with a below-median expenditure from the output of filter().
  • SAT |> select(-state, -frac) uses the - sign to exclude the variables from the output.

Is the argument sat ~ frac for point_plot() a named argument?

No. The symbol between sat and frac is a tilde. Named arguments always use the single equal sign:

The first argument to point_plot() is named tilde =. Placing sat ~ frac as the first argument is entirely equivalent to using the wordier tilde = sat ~ frac. All other arguments, however, must be referred to by name.

Exercises

Exercise 6.1  

Which of these are not valid expressions for handing a data frame named Big as the input to the head() function.

  1. Penguins |> head()
  2. head(Penguins)
  3. head() <| Penguins
  4. Penguins -> head()
  5. head() <- Penguins

Hint: Try them out and see what happens. (In the expressions that generate error messages, the word “assignment” is used instead of “storage.” They mean the same thing. See Enrichment topic enr-why-storage.)

id=Q06-101


Exercise 6.2  

id=Q06-102


Exercise 6.3  

Consider this example of a wrangling command:

Tiny is a data frame and summarize() is a wrangling function whose input must always be a data frame.

  1. What kind of information object is species? Answer: A variable

  2. What kind of information object is n_distinct()? Answer: A function that takes as input a variable.

  3. What would be the output of the command if you replaced species by sex? Answer: The number of distinct sexes represented in the data frame.

  4. What would be the output of the command if you replace species by flipper? Answer: The number of distinct flipper lengths. In the small sample contained in Tiny, there are no repeats in the flipper length.

  5. Big is a superset from which the 8 rows in Tiny were selected.

    1. What will be the output of Tiny |> summarize(ns = n_distinct(species)) if you replace Tiny with Big in the command? Answer: Still 3. Evidently, all of the species in Big appear in one row or another in Tiny.

    2. Using Big as the input to summarize() and flipper as the variable given as an object to the function n_distinct(), what will be the result of the computation? Answer: 56

    3. Are there any repeats in the flipper lengths recorded in Big? (Hint: Take the answer in (ii) and compare it to the output of Big |> nrow().) Answer: Yes, there are many repeated values in the flipper column in Big; out of 344 values, there are only 56 distinct values.

id=Q06-103


Exercise 6.4  

None of the following are complete commands, that is, each of them will lead to an error message rather than an output.

  1. ns = n_distinct(species)
  2. summarize()
  3. Tiny |> summarize()
  4. Tiny |> summarize(species)

For each, give a brief explanation of what’s missing or why the expressions listed can’t work.

Answer:

  1. The statement as given is correct as an argument to summarize(), but can work only if you pipe a data frame that has a variable species (such as the Penguins data frame).
  2. summarize() is a data wrangling function. It must be piped a data frame as input.
  3. You can write, e.g., Tiny |> summarize() but the result will be a data frame with zero rows. There would be no point in this; you should give an argument to summarize saying what calculation to do, e.g. as in (1).
  4. species is the name of a variable within Tiny. But summarize() wants it’s arguments to be functions applied to variables, e.g. summarize(n_distinct(species)) or summarize(how_many = n_distinct(species)).

id=Q06-104


Exercise 6.5  

These two commands differ in only one place, whether there is a .by argument. Yet they produce different outputs. Explain what .by is doing to shape the output

id=Q06-105



Enrichment topics

Using your web browser, open this link in a new tab: https://www.mosaic-web.org/go/datasets/engines.csv.

Depending on how your browser is set up, you will either be directed to a web page showing a data frame about engines or the browser will download a file named “engines.csv” onto your computer.

The .csv suffix on the file name indicates that the file stored at the address https://www.mosaic-web.org/go/datasets/engines.csv is in a format called “comma separated values.” The CSV format is a common way to store spreadsheet files.

In these Lessons most data frames will be accessed in a single step, by name. However, in professional work, data is stored in computer files or on the interweb. For such data, two steps are needed to access the data from within R.

Step 1. Read the file into R, translate the contents into the native R format for data frames, and store the data frame under a name. For a CSV file, an appropriate R function to read and translate the file is readr::read_csv(). As an argument to the function, give the address of the file, making sure to enclose the address in quotation marks: "https://www.mosaic-web.org/go/datasets/engines.csv". This will cause readr::read_csv() to access the web address, then copy and translate the contents into an R format format for data frames. Use the storage arrow <- to store the data frame under the name Engines.

Step 2. Use the storage name, in this example Engines, to access the data frame from within R.

Your task: Read in the “engines.csv” file to R as a data frame, storing it as Engines. Then use nrow() to calculate the number of rows in the data frame. In addition, use names to see the variable names.

Sometimes, you will see an argument written as letters and numbers inside quotation marks, as in annot = "model". The quotation marks instruct the computer to take the contents literally instead of pointing to a function or a variable. (In computer terminology, the content of the quotation marks is called a character string.)

The style of R commands does not use quotations around the names of objects, functions, and variables are not placed in quotations. When you see quotation marks in an example in these Lessons, take note. They are needed, for instance, in saying what kind of annotation should be drawn by point_plot(). If you forget to use the quotation marks where they are needed, the computer will signal an error. Try it!

The error message is terse, but it gives hints; for example, 'arg' suggests the error is about an argument, annot is the name of the problematic argument, and character is meant to point you to some issue involving character strings.

Written English uses space to separate words. It is helpful to the human reader to follow analogous forms in R commands.

  • Use spaces around storage arrows and pipes: x <- 7 |> sqrt() reads better than x<-7|>sqrt().
  • Use spaces between an argument name and its value: mutate(percap = GDP / pop) rather than mutate(percap=GDP/pop).
  • When writing long pipelines, put a newline after the pipe symbol. You can see several instances of this in previous examples in this Lesson. DO NOT, however, start a line with a pipe symbol.

In English, a sentence like “Walk the dog!” is an imperative, a command. Similarly, in R, commands are always imperatives. The English imperative sentence, “Jane, walk the dog!” directs the imperative to a particular actor, namely Jane. The R imperative is always directed to “the computer,” as in, “Computer, select the country and GDP columns for the output.”

“Walk the dog!” has both a verb (“walk”) and a noun (“the dog”). The noun in such an imperative is the object of the verb; the entity that the action (walk) is to be applied to.

R structures sentences/commands differently. Every sentence is a command. The actor is always the computer, there’s no reason to state that explicitly. So the imperative in R looks like this:

the_dog |> walk()

In word order, the object of the action preceeds the action. In data-wrangling commands, the object is always a data frame.

Now a little about arguments …. “Walk the dog!” doesn’t specify an important detail: Who is to hold the leash? An argument can fill in this detail:

the_dog |> walk(“Carlos”)`

Calling the <- token the “storage arrow” is unconventional. Those experienced with computing know that the act of giving a computer object a named storage location is called “assignment.” From the student’s point of view, however, “assignment” has many meanings which have nothing to do with computer storage. For instance, in many courses students are obliged to hand in their work at regular intervals: “assignments.” Synonyms for “assignment” are “task,” “duty,” “job,” and “chore.”

Warning

Kable() won’t work meaningfully in webr-r. So do we want to include this:

We are using the word “table” to refer specifically to a printed display intended for a human reader, as opposed to data frames which, although often readable, are oriented around computer memory.

The readability of tabular content goes beyond placing the content in neatly aligned columns and rows to include the issue of the number of “significant digits” to present. All of the functions we use for statistical computations make use of internal hardware that deals with numbers to a precision of fifteen digits. Such precision is warranted for internal calculations, which often build on one another. But fifteen digits is much more than can be readily assimilated by the human reader. To see why, let’s display calculate yearly GDP growth (in percent) with all the digits that are carried along in internal calculations:

Growth_rate <- Nats |> 
  pivot_wider(country, 
              values_from = c(GDP, pop), 
              names_from = year) |>
  mutate(yearly_growth = 
           100.*((GDP_2020 / GDP_1950)^(1/70.)-1)) |>
  select(country, yearly_growth)
Growth_rate
country yearly_growth
Korea 3.14547099309945
Cuba 0.411820047041944
France 2.26982406656688
India 1.87345150307259

GDP, like many quantities, can be measured only approximately. It would be generous to ascribe a precision of about 1 part in 100 to GDP. Informally, this suggests that only the first two or three digits of a calculation based on GDP can have any real meaning.

The problem of significant digits has two parts: 1) how many digits are worth displaying and 2) how to instruct the computer to display only that number of digits. Point (1) often depends on expert knowledge of a field. Point (2) is much more straightforward; use a computer function that controls the number of digits printed. There are many such functions. For simplicity, we focus on one widely used in the R community, kable().

The purpose of kable() can be described in plain English: to format tabular output for the human reader. Whenever encountering a new function, you will want to find out what are the inputs and what is the output. The primary input to kable() is a data frame. Additional arguments, if any, specify details of the formatting, such as the number of digits to show. For instance:

Growth_rate |> 
  kable(digits = 1, 
        caption = "Annual growth in GDP from 1950 to 2020",
        col.names = c("", "Growth rate (%)"))
Annual growth in GDP from 1950 to 2020
Growth rate (%)
Korea 3.1
Cuba 0.4
France 2.3
India 1.9

The output of kable(), perhaps surprisingly, is not a data frame. Instead, the output is instructions intended for the display’s typesetting facility. The typesetting instructions for web-browsers are often written in a special-purpose language called HTML. So far as these Lessons are concerned, is not important that you understand the HTML instructions. Even so, we show them to you to emphasize an important point: You can’t use the output of kable() as the input to data-wrangling or graphics operation.

<table>
<caption>Annual growth in GDP from 1950 to 2020</caption>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:right;"> Growth rate (%) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Korea </td>
   <td style="text-align:right;"> 3.1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cuba </td>
   <td style="text-align:right;"> 0.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 2.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> India </td>
   <td style="text-align:right;"> 1.9 </td>
  </tr>
</tbody>
</table>
Under construction

Calculators, Hollerith cards, Fisher’s quote, MCMC, machine learning.

We will take a statistical view of the appropriate number of digits to show in sec-confidence-intervals.