# 6  Computing with functions and arguments

You have already seen the R command patterns that we will use throughout these Lessons: pipelines composed of actions separated by `|>`, names for data frames, functions and their arguments. This Lesson recapitulates and explains those patterns, the better to help you construct your own commands. The Lesson also emphasizes a small technical vocabulary that helps in communicating with other people to share insights and identify errors.

Learning command patterns is a powerful enabler in statistical thinking. The computer is an indispensable tool for statistical thinking. In addition to doing the work of calculations, computer software is a medium for describing and sharing with others statistical techniques and concepts. For instance, three fundamental operations in statisticsâ€”randomize, repeat, collectâ€”are easily expressed in computer languages but have no counterpart in the algebraic notation traditionally uused in mathematics education.

## Chain of operations

A typical computing task consists of a chain of operations. For instance, each wrangling operation receives a data-frame object from an incoming pipe `|>`, operates on that data frame to perform the action described by the functionâ€™s name and arguments, and produces another data frame as output. Depending on the overall task, the output from the operation may be piped into another action or displayed on the screen.

There are also operations, like `point_plot()`, that translate a data frame into another kind of object: a graphic. Starting with Lesson sec-regression, we will work with `model_train()`, a function that translates a data frame into a model. In this Lesson, we will meet two other ways of dealing with the output of a chain of operations: storing the output under a name for later use and formatting a data frame into a table suited to human readers.

Manufacturing processes provide a helpful analogy for understanding the step-by-step structure of a computation. Simple manufacturing processes might involve one or a handful of work steps arranged in a chain. Complex operations can involve many chains coming together in elaborate configurations. The video in Figure fig-pencil-video shows the steps of pencil manufacturing. These involves several inputs:

The overall manufacturing process takes several inputs that are shaped and combined to produce the pencil: cedar wood slabs, glue, graphite, enamel paint Each input is processed in a step-by-step manner. At some steps, two partially processed components are combined. For instance, there is a step that grooves the cedar slabs (which are sourced from another production line). The next step put glue in the groves. In still another step, the graphite rods (which come from their own production process) are placed into the glue-filled groves. (Lesson sec-databases introduces the data-wrangling process, join, that combines two inputs.)

There are several different forms of conveyors in the pencil manufacturing line that carry the materials from one manufacturing step to the next. We need only one type of conveyorâ€”the pipeâ€”to connect computing steps.

Manufacturing processes often involve storage or delivery. The video in Figure fig-pencil-video ends before the final steps in the process: boxing the pencils, warehousing them, and the parts of the chain that deliver them to the consumer end-user.Storage, retrieval, and customer use all have their counterparts in computing processes. By default, the object produced by the computing chain is directly delivered to the customer, here by displaying it in some appropriate place, for instance directly under the computer command, or in a viewing panel or document:

``````Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by = year)``````
country year GDP pop GDPpercap
Korea 2020 874 32 27.31250
France 2020 1203 55 21.87273
Cuba 1950 60 8 7.50000
France 1950 250 40 6.25000

In our computer notation, the storage operation can look like this:

``````Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by=year) -> High_income_countries``````

At the very end of the pipeline chain, there is a storage arrow (`->`, as opposed to the pipe, `|>`) followed by a storage name (`High_income_countries`). The effect is to place the object at the output end of the chain to be stored in computer memory in a location identified by the storage name.

Retrieval from storage is even simpler: just use the storage name as an input. For instance:

``````High_income_countries |>
select(-GDP, -pop) |>
filter(year == 2020) |>
kable(digits=2)``````
country year GDPpercap
Korea 2020 27.31
France 2020 21.87
Pointing out storage from the start

In a previous example we placed the storage arrow (`->`) at the end of a left-to-right chain of operations. In practice, programmers and authors prefer another arrangementâ€”which we will use from now on in these Lessonsâ€”where the storage arrow is at the left end of the chain. The storage arrow still points to the storage name. For instance,

``````High_income_countries <- Nats |>
mutate(GDPpercap = GDP / pop) |>
filter(GDPpercap > mean(GDPpercap), .by=year) ``````

Using this `storage_name <-` idiom it is easier to scan code for storage names and to spot when the output of the chain is to be delivered directly to the customer.

## Whatâ€™s in a pipe?

The pipeâ€”that is, `|>`â€”carries material from one operation to another. In computer-speak, the word â€śobjectâ€ť describes this material. That is, pipes convey objects.

Objects come in different â€śtypes.â€ť Computer programmers learn to deal with dozens of object types. Fortunately, we can accomplish what we need in statistical computing with just a handful. You have already met two types:

1. data frames
2. graphics, consisting of one or more layers, e.g. the point plot as one layer and the annotations as another layer placed on top.

In later lessons, we will introduce two more typesâ€”(3) models and (4) simulations.

## Pipes connect to functions

At the receiving end of a pipe is an operation on the object conveyed by the pipe. A better word for such an operation is â€śfunction.â€ť It is easy to spot the functions in a pipeline: they always consist of a nameâ€”such as `summarize` or `point_plot`â€”followed directly by `(` and, eventually, a closing `)`. For example, in

the first function is named `mutate`. The function output is being piped to a second function, named `filter`. From now on, whenever we name a function we will write the name followed by `()` to remind the reader that the name refers to a function: so `mutate()` and `filter()`. There are other things that names can refer to. For instance, `Nats` at the start of the pipeline is a data frame, and `GDP`, `GDPpercap` and `pop`, and `year` are variables. Such names for non-functions are never followed directly by `(`.

One of the names appearing in Listing lst-GDP-mean is `mean`. What kind of computing object does this name refer to?

Because the name is directly followed by a parentheses, we know `mean` must refer to a function.

Following our convention for writing function names, we should have written the name as `mean()`, but that would have made the question too easy!

## Arguments (inside the parentheses)

Almost always when using a function the human writer of a computer expression needs to specify some details of how the function is to work. These details are always put inside the parentheses following the name of the function. To illustrate, consider the task of plotting the data in the `SAT` data frame. The skeleton of the computer command is

This skeleton is not a complete command, as becomes evident when the (incomplete) command is evaluated:

Whatâ€™s missing from the erroneous command is a detail needed to complete the operation. This missing detail is what variables from `SAT` to map to y and x. This detail should be provided to `point_plot()` as an argument. As you saw in Lesson sec-point-plots, the argument is written as a tilde expression, for instance `sat ~ frac` to map `sat` to y and `frac` to x. Once we have constructed the appropriate argument for the task at hand, we place it inside the parentheses that follow the function name.

Run Listing lst-sat-point-error with no argument in the parentheses following `point_plot`. The resulting error message is, admittedly, cryptic. Nonetheless, scan the error message to look for something familiar that provides a clue to how you can fix the command.

Then add in the missing argument to put `frac` on the x-axis and `sat` on the y-axis.

Many functions have more than one argument. Some arguments, like the tilde expression argument to `point_plot()`, may be required. When an argument is not required, the argument itself is given a name and it will have a default value. In the case of `point_plot()`, there is a second argument named `annot=` to specify what kind of annotation layer to add on top of the point plot. The default value of `annot=` turns off the annotation layer.

Named arguments, like `annot=`, will always be followed by a single equal sign, followed by the value to which that argument is to be set. For instance, `point_plot()` allows four different values for `annot=`:

1. the default (which turns off the annotation)
2. `annot = "violin"` specifying a density display annotation
3. `annot = "model"` which annotates with a graph of a model
4. `annot = "bw"` which creates a traditional â€śbox-and-whiskersâ€ť display of distribution. (We will not use such box-and-whiskers annotations in these Lessons, preferring violins instead. Still, they are often seen in practice.)

In these Lessons, the single `=` sign always signifies a named argument.

A closely related use for `=` is to give a name to a calculated result from `mutate()` or `summarize()`. For instance, suppose you want to calculate the mean sat score and mean fraction in the `SAT` data frame. This is easy:

``SAT |> summarize(mean(sat), mean(frac))``
mean(sat) mean(frac)
965.92 35.24

We will often use this unnamed style when the results are intended for the human reader. But if such a calculation is to be fed down the pipeline to further calculations, it can be helpful to give simple names to the result. Frivolously, weâ€™ll illustrate using the names `eel` and `fish`:

The reason for the frivolity here is to point out that you get to choose the names for the results calculated by `mutate()` and `summarize()`. Needless to say, itâ€™s best to avoid frivolous or misleading names.

Listing lst-store-arrow-sat uses the storage arrow to store a summary of the `SAT` data frame under the name `Results`. Run the chunk and note that there is no output printed.

1. Explain why nothing is being printed.

2. Add a second to Listing lst-store-arrow-sat that will cause `Results` to be printed. (Hint: The second command will be very short and simple.)

3. Identify which of the following statements use the named-argument syntax correctly. Answer first just from reading the statement. Then confirm your answer by copying the statement into Listing lst-store-arrow-sat.

When the statement is not correct, explain why.

1. `Results <- SAT |> summarize(eel = mean(sat), mean(frac) = fish)`
2. `Results <- SAT |> summarize(eel <- mean(sat), fish <- mean(frac))`
3. `Results <- SAT |> summarize(eel == mean(sat), fish == mean(frac))`
4. `Results <- SAT |> summarize(eel = mean(sat), fish = mean(frac))`
4. The following command is valid but uses `=` in place of the storage arrow. Explain how you can tell, nonetheless, that `Results` is not a named argument.

1. `Results = SAT |> summarize(mean(sat), mean(frac))
1. The use of the storage arrow suppresses printing.
2. The second command simply refers to the storage name: `Results`
1. The name must always be placed to the left of `=`.
2. You cannot use the storage arrow (`<-`) in place of `=` for a named argument. Instead of a simple name for the output column, the entire argument gets used as the name. This is very inconvenient when you want to refer to the column in a later calculation.
3. Double equal signs (`==`) are for comparing the left and right side, rather than creating a column name.
4. The statement is correct.
3. Named arguments are only seen as arguments. That is, they must be inside the parentheses following a function name.

## Variable names in arguments

Many of the functions we use are on the receiving end of a pipe carrying a data frame. Examples, perhaps already familiar to you: `filter()`, `point_plot()`, `mutate()`, and so on.

A good analogy for a data frame is a shipping box. Inside the shipping box: one or more variables. When a function receives the shipping box data frame, it opens it, providing access to each variable contained therein. In constructing arguments to the function, you do not have to think about the box, just the contents. You refer to the contents only by their names. `select()` provides a good example, since each argument can be simply the name of a variable, e.g.

For most uses, the arguments to a function will be an expressions constructed out of variable names. Some examples:

• `SAT |> filter(frac > 50)` where the argument checks whether each value of `frac` is greater than 50.
• `SAT |> mutate(efficiency = sat / expend)` where the argument gives a name (`efficiency`) to an arithmetic combination of `sat` and `expend`.
• `SAT |> point_plot(frac ~ expend)` where the argument to `point_plot()` is an expression involving both `frac` and `expend`.
• `SAT |> filter(expend > median(expend))` where the argument involves calculating the median expenditure across the state using the `median()` reduction function, then comparing the calculated median to the actual expenditure in each state. The overall effect is to remove any state with a below-median expenditure from the output of `filter()`.
• `SAT |> select(-state, -frac)` uses the `-` sign to exclude the variables from the output.

Is the argument `sat ~ frac` for `point_plot()` a named argument?

No. The symbol between `sat` and `frac` is a tilde. Named arguments always use the single equal sign:

The first argument to `point_plot()` is named `tilde =`. Placing `sat ~ frac` as the first argument is entirely equivalent to using the wordier `tilde = sat ~ frac`. All other arguments, however, must be referred to by name.

## Exercises

Exercise 6.1

Which of these are not valid expressions for handing a data frame named `Big` as the input to the `head()` function.

1. `Penguins |> head()`
2. `head(Penguins)`
3. `head() <| Penguins`
4. `Penguins -> head()`
5. `head() <- Penguins`

Hint: Try them out and see what happens. (In the expressions that generate error messages, the word â€śassignmentâ€ť is used instead of â€śstorage.â€ť They mean the same thing. See Enrichment topic enr-why-storage.)

id=Q06-101

Exercise 6.2

id=Q06-102

Exercise 6.3

Consider this example of a wrangling command:

`Tiny` is a data frame and `summarize()` is a wrangling function whose input must always be a data frame.

1. What kind of information object is `species`? Answer: A variable

2. What kind of information object is `n_distinct()`? Answer: A function that takes as input a variable.

3. What would be the output of the command if you replaced `species` by `sex`? Answer: The number of distinct sexes represented in the data frame.

4. What would be the output of the command if you replace `species` by `flipper`? Answer: The number of distinct flipper lengths. In the small sample contained in `Tiny`, there are no repeats in the flipper length.

5. `Big` is a superset from which the 8 rows in `Tiny` were selected.

1. What will be the output of `Tiny |> summarize(ns = n_distinct(species))` if you replace `Tiny` with `Big` in the command? Answer: Still 3. Evidently, all of the species in `Big` appear in one row or another in `Tiny`.

2. Using `Big` as the input to `summarize()` and `flipper` as the variable given as an object to the function `n_distinct()`, what will be the result of the computation? Answer: 56

3. Are there any repeats in the flipper lengths recorded in `Big`? (Hint: Take the answer in (ii) and compare it to the output of `Big |> nrow()`.) Answer: Yes, there are many repeated values in the `flipper` column in `Big`; out of 344 values, there are only 56 distinct values.

id=Q06-103

Exercise 6.4

None of the following are complete commands, that is, each of them will lead to an error message rather than an output.

1. `ns = n_distinct(species)`
2. `summarize()`
3. `Tiny |> summarize()`
4. `Tiny |> summarize(species)`

For each, give a brief explanation of whatâ€™s missing or why the expressions listed canâ€™t work.

1. The statement as given is correct as an argument to `summarize()`, but can work only if you pipe a data frame that has a variable `species` (such as the `Penguins` data frame).
2. `summarize()` is a data wrangling function. It must be piped a data frame as input.
3. You can write, e.g., `Tiny |> summarize()` but the result will be a data frame with zero rows. There would be no point in this; you should give an argument to summarize saying what calculation to do, e.g. as in (1).
4. `species` is the name of a variable within `Tiny`. But `summarize()` wants itâ€™s arguments to be functions applied to variables, e.g. `summarize(n_distinct(species))` or `summarize(how_many = n_distinct(species))`.

id=Q06-104

Exercise 6.5

These two commands differ in only one place, whether there is a `.by` argument. Yet they produce different outputs. Explain what `.by` is doing to shape the output

id=Q06-105

## Enrichment topics

Depending on how your browser is set up, you will either be directed to a web page showing a data frame about engines or the browser will download a file named â€śengines.csvâ€ť onto your computer.

The `.csv` suffix on the file name indicates that the file stored at the address https://www.mosaic-web.org/go/datasets/engines.csv is in a format called â€ścomma separated values.â€ť The CSV format is a common way to store spreadsheet files.

In these Lessons most data frames will be accessed in a single step, by name. However, in professional work, data is stored in computer files or on the interweb. For such data, two steps are needed to access the data from within R.

Step 1. Read the file into R, translate the contents into the native R format for data frames, and store the data frame under a name. For a CSV file, an appropriate R function to read and translate the file is `readr::read_csv()`. As an argument to the function, give the address of the file, making sure to enclose the address in quotation marks: `"https://www.mosaic-web.org/go/datasets/engines.csv"`. This will cause `readr::read_csv()` to access the web address, then copy and translate the contents into an R format format for data frames. Use the storage arrow `<-` to store the data frame under the name `Engines`.

Step 2. Use the storage name, in this example `Engines`, to access the data frame from within R.

Your task: Read in the â€śengines.csvâ€ť file to R as a data frame, storing it as `Engines`. Then use `nrow()` to calculate the number of rows in the data frame. In addition, use `names` to see the variable names.

Sometimes, you will see an argument written as letters and numbers inside quotation marks, as in `annot = "model"`. The quotation marks instruct the computer to take the contents literally instead of pointing to a function or a variable. (In computer terminology, the content of the quotation marks is called a character string.)

The style of R commands does not use quotations around the names of objects, functions, and variables are not placed in quotations. When you see quotation marks in an example in these Lessons, take note. They are needed, for instance, in saying what kind of annotation should be drawn by `point_plot()`. If you forget to use the quotation marks where they are needed, the computer will signal an error. Try it!

The error message is terse, but it gives hints; for example, `'arg'` suggests the error is about an argument, `annot` is the name of the problematic argument, and `character` is meant to point you to some issue involving character strings.

Written English uses space to separate words. It is helpful to the human reader to follow analogous forms in R commands.

• Use spaces around storage arrows and pipes: `x <- 7 |> sqrt()` reads better than `x<-7|>sqrt()`.
• Use spaces between an argument name and its value: `mutate(percap = GDP / pop)` rather than `mutate(percap=GDP/pop)`.
• When writing long pipelines, put a newline after the pipe symbol. You can see several instances of this in previous examples in this Lesson. DO NOT, however, start a line with a pipe symbol.

In English, a sentence like â€śWalk the dog!â€ť is an imperative, a command. Similarly, in R, commands are always imperatives. The English imperative sentence, â€śJane, walk the dog!â€ť directs the imperative to a particular actor, namely Jane. The R imperative is always directed to â€śthe computer,â€ť as in, â€śComputer, select the `country` and `GDP` columns for the output.â€ť

â€śWalk the dog!â€ť has both a verb (â€śwalkâ€ť) and a noun (â€śthe dogâ€ť). The noun in such an imperative is the object of the verb; the entity that the action (walk) is to be applied to.

R structures sentences/commands differently. Every sentence is a command. The actor is always the computer, thereâ€™s no reason to state that explicitly. So the imperative in R looks like this:

`the_dog |> walk()`

In word order, the object of the action preceeds the action. In data-wrangling commands, the object is always a data frame.

Now a little about arguments â€¦. â€śWalk the dog!â€ť doesnâ€™t specify an important detail: Who is to hold the leash? An argument can fill in this detail:

`the_dog` |> walk(â€śCarlosâ€ť)`

Calling the `<-` token the â€śstorage arrowâ€ť is unconventional. Those experienced with computing know that the act of giving a computer object a named storage location is called â€śassignment.â€ť From the studentâ€™s point of view, however, â€śassignmentâ€ť has many meanings which have nothing to do with computer storage. For instance, in many courses students are obliged to hand in their work at regular intervals: â€śassignments.â€ť Synonyms for â€śassignmentâ€ť are â€śtask,â€ť â€śduty,â€ť â€śjob,â€ť and â€śchore.â€ť

Warning

Kable() wonâ€™t work meaningfully in webr-r. So do we want to include this:

We are using the word â€śtableâ€ť to refer specifically to a printed display intended for a human reader, as opposed to data frames which, although often readable, are oriented around computer memory.

The readability of tabular content goes beyond placing the content in neatly aligned columns and rows to include the issue of the number of â€śsignificant digitsâ€ť to present. All of the functions we use for statistical computations make use of internal hardware that deals with numbers to a precision of fifteen digits. Such precision is warranted for internal calculations, which often build on one another. But fifteen digits is much more than can be readily assimilated by the human reader. To see why, letâ€™s display calculate yearly GDP growth (in percent) with all the digits that are carried along in internal calculations:

``````Growth_rate <- Nats |>
pivot_wider(country,
values_from = c(GDP, pop),
names_from = year) |>
mutate(yearly_growth =
100.*((GDP_2020 / GDP_1950)^(1/70.)-1)) |>
select(country, yearly_growth)
Growth_rate``````
country yearly_growth
Korea 3.14547099309945
Cuba 0.411820047041944
France 2.26982406656688
India 1.87345150307259

GDP, like many quantities, can be measured only approximately. It would be generous to ascribe a precision of about 1 part in 100 to GDP. Informally, this suggests that only the first two or three digits of a calculation based on GDP can have any real meaning.

The problem of significant digits has two parts: 1) how many digits are worth displaying and 2) how to instruct the computer to display only that number of digits. Point (1) often depends on expert knowledge of a field. Point (2) is much more straightforward; use a computer function that controls the number of digits printed. There are many such functions. For simplicity, we focus on one widely used in the R community, `kable()`.

The purpose of `kable()` can be described in plain English: to format tabular output for the human reader. Whenever encountering a new function, you will want to find out what are the inputs and what is the output. The primary input to `kable()` is a data frame. Additional arguments, if any, specify details of the formatting, such as the number of digits to show. For instance:

``````Growth_rate |>
kable(digits = 1,
caption = "Annual growth in GDP from 1950 to 2020",
col.names = c("", "Growth rate (%)"))``````
Annual growth in GDP from 1950 to 2020
Growth rate (%)
Korea 3.1
Cuba 0.4
France 2.3
India 1.9

The output of `kable()`, perhaps surprisingly, is not a data frame. Instead, the output is instructions intended for the displayâ€™s typesetting facility. The typesetting instructions for web-browsers are often written in a special-purpose language called HTML. So far as these Lessons are concerned, is not important that you understand the HTML instructions. Even so, we show them to you to emphasize an important point: You canâ€™t use the output of `kable()` as the input to data-wrangling or graphics operation.

``````<table>
<caption>Annual growth in GDP from 1950 to 2020</caption>
<tr>
<th style="text-align:left;">  </th>
<th style="text-align:right;"> Growth rate (%) </th>
</tr>
<tbody>
<tr>
<td style="text-align:left;"> Korea </td>
<td style="text-align:right;"> 3.1 </td>
</tr>
<tr>
<td style="text-align:left;"> Cuba </td>
<td style="text-align:right;"> 0.4 </td>
</tr>
<tr>
<td style="text-align:left;"> France </td>
<td style="text-align:right;"> 2.3 </td>
</tr>
<tr>
<td style="text-align:left;"> India </td>
<td style="text-align:right;"> 1.9 </td>
</tr>
</tbody>
</table>``````
Under construction

Calculators, Hollerith cards, Fisherâ€™s quote, MCMC, machine learning.

We will take a statistical view of the appropriate number of digits to show in sec-confidence-intervals.