6  Computing with functions and arguments

Published

2021-01-15

You have already encountered several computer commands for working with data frames and constructing graphics. the command patterns you have seen in fact cover the large majority of what’s needed for these Lessons. This Lesson gives an explicit description and explanation of the patterns, better to enable you to construct your own commands. Part of this involves providing a nomenclature that supports communicating with other people in order to share insights about how to accomplish tasks and to identify and describe pattern errors.

This blog post gives a little history of the long-standing link between statistics and computing.

The computer is the indispensable tool for statistical thinking. In addition to doing the work of calculations, computer software is a medium for describing and sharing with others statistical techniques and concepts. For instance, three fundamental operations in statistics—randomize, repeat, collect—are easily written in computer languages but have no counterpart in the algebraic notation traditionally uused in mathematics education.

Chain of operations

A typical computing task consists of a chain of operations. For instance, each wrangling operation receives input from the ingoing pipe, modifies it, and produces a data frame as an output. This data frame is then “piped” to the next operation. The are also operations that convert the input data frame into a graphic or printed table for the human end-user.

Manufacturing processes provide a useful analogy for understanding the step-by-step structure of a computation. Data frames constitute the basic inputs.Simple manufacturing processes might involve one or a handful of work-steps arranged in a chain. (In these Lessons, most of the computing chains will be short and individually simple.) Complex operations can involve many chains coming together in elaborate configurations. The video in Figure 6.1 shows the steps of pencil manufacturing. These involves several inputs:

Figure 6.1: Manufacturing a pencil, step by step.

There are several inputs that are shaped and combined to produce the pencil: cedar wood slabs, glue, graphite, enamel paint. Each input is processed in a step-by-step manner. At some steps, two partially processed components are combined. For instance, the cedar slabs (which themselves come from another production line) are groved in one step. The next step puts glue in the groves. In still another step, the graphite rods (which come from their own production process) are placed into the glue-filled groves. (Lesson 7 introduces the data-wrangling process, join, that combines two inputs.)

There are several different forms of conveyor in the pencil manufacturing line that carry the materials from one manufacturing step to the next. We need only one type of conveyor—the pipe—to connect computing steps.

Manufacturing processes often involve storage or delivery. The video in Figure 6.1 ends before the final steps in the process: boxing the pencils, warehousing them, and the parts of the chain that deliver them to the consumer end-user.Storage, retrieval, and customer use all have their counterparts in computing processes. By default, the object produced by the computing chain is directly delivered to the customer, here by displaying it in some appropriate place, for instance directly under the computer command, or in a viewing panel or document:

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by = year)
country    year    GDP   pop   GDPpercap
--------  -----  -----  ----  ----------
Korea      2020    874    32    27.31250
France     2020   1203    55    21.87273
Cuba       1950     60     8     7.50000
France     1950    250    40     6.25000

In our computer notation, the storage operation can look like this:

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by=year) -> High_income_countries

At the very end of the pipeline chain, there is an arrow symbol (->, as opposed to the pipe, |>) followed by a name (Big_economies). This puts the object created by the pipeline process into storage in a spot identified by the storage name.

Retrieval from storage is even simpler: just use the storage name as an input. For instance:

High_income_countries |> 
  select(-GDP, -pop) |> 
  filter(year == 2020) |> 
  kable(digits=2)
country year GDPpercap
Korea 2020 27.31
France 2020 21.87
Pointing out storage from the start

In a previous example we placed the storage arrow (->) at the end of a left-to-right chain of operations. This is a perfectly valid organization which we used to emphasize the progression of a computational object from step to step and out to storage.

In practice, the -> is rarely used. Most programmers and authors prefer another arrangement—which we will use from now on in these Lessons—where the storage arrow is at the very start of the chain. The storage arrow still points to the storage name. That is,

High_income_countries <- Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by=year) 

Using this storage_name <- idiom it is easier to scan code for storage names and to spot when the output of the chain is to be delivered directly to the customer.

What’s in a pipe?

The pipe—that is, |>—carries material from one operation to another. In computer-speak, the word “object” is used to describe this material. That is, pipes convey objects.

Objects come in different “types.” Computer programmers learn to deal with dozens of object types. Fortunately, we can accomplish what we need in statistical computing with just a handful. You have already met two types:

  1. data frames
  2. graphics, consisting of one or more layers, e.g. the point plot as one layer and the annotations as another layer placed on top.

In later lessons, we will introduce two more types—(3) models and (4) simulations.

Pipes connect to functions

At the receiving end of a pipe is an operation on the object conveyed by the pipe. A better word for such an operation is “function.” It is easy to spot the functions in a pipeline: they always consist of a name—such as summarize or pointplot—followed directly by ( and, eventually, a closing ). For example, in

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by = year)

the first function is named mutate. The function output is being piped to a second function, named filter. From now on, whenever we name a function we will write the name followed by () to remind the reader that the name refers to a function: so mutate() and filter(). There are other things that names can refer to. For instance, Nats at the start of the pipeline is a data frame, and GDP, GDPpercap and pop, and year are variables. Such names for non-functions are never followed directly by (.

Example: What does mean refer to?

Another name appearing in the previous code block is mean. What kind of thing does this name refer to?

Because the name is directly followed by a parentheses, we know mean must refer to a function.

Following our convention for writing function names, we should have written the name as mean(), but that would have made the question too easy!

Arguments (inside the parentheses)

Almost always when using a function the human writer of a computer expression needs to specify some details of how the function is to work. These details are always put inside the parentheses following the name of the function. To illustrate, consider the task of plotting the data in the SAT data frame. The skeleton of the computer command is

SAT |> pointplot()

This skeleton is not a complete command, as becomes evident when the (incomplete) command is evaluated:

SAT |> pointplot()
Error in data_from_tilde(data, tilde): argument "tilde" is missing, with no default

What’s missing from the erroneous command is a detail needed to complete the operation: Which variable from SAT will be put on the vertical axis and which on the horizontal axis. This detail is provided to pointplot() as an argument. As you saw in Lesson 2, this argument is to be written in the form of a tilde expression, for instance sat ~ frac. The argument, of course, is placed inside the parentheses that follow the function name, like this:

SAT |> pointplot(sat ~ frac) 

Many functions have more than one argument. Some arguments may be required, like the tilde expression argument to pointplot(). When an argument is not required, the argument itself is given a name and it will have a default value. In the case of pointplot(), there is a second argument named annot= to specify what kind of annotation layer to add on top of the point plot. The default value of annot= turns off the annotation layer.

Named arguments, like annot=, will always be followed by a single equal sign, followed by the value to which that argument is to be set. For instance, pointplot() allows three different values for annot=: the default (which turns off the annotation), or annot = "violin" specifying a density display annotation, or annot = "model" specifying that the annotation layer shows a model.

In these Lessons, the single = sign always signifies a named argument.

A closely related use for = is to give a name to a calculated result from mutate() or summarize(). For instance, suppose you want to calculate the mean sat score and mean fraction in the SAT data frame. This is easy:

SAT |> summarize(mean(sat), mean(frac))
 mean(sat)   mean(frac)
----------  -----------
    965.92        35.24

We will often use this unnamed style when the results are intended for the human reader. But if such a calculation is being used to feed the pipeline to further calculations, it can be helpful to give simple names to the result. Frivolously, we’ll illustrate using the names eel and fish:

SAT |> summarize(eel = mean(sat), fish = mean(frac))
    eel    fish
-------  ------
 965.92   35.24

The reason for the frivolity here is to point out that you get to choose the names for the results calculated by mutate() and summarize(). Needless to say, it’s best to avoid frivolous or misleading names.

Variable names in arguments

Many of the functions we use are on the receiving end of a pipe carrying a data frame. Examples, perhaps already familiar to you: filter(), pointplot(), mutate(), and so on.

A good analogy for a data frame is a shipping box. Inside the shipping box: one or more variables. When a function receives the shipping box data frame, it opens it, providing access to each of the variables contained therein. In constructing arguments to the function, you do not have to think about the box, just the contents. You refer to the contents only by their names. select() provides a good example, since each argument can be simply the name of a variable, e.g. 

Even select() sometimes uses expressions constructed out of variable names, such as the - that directs select() to omit a variable from the output.

SAT |> select(-state, -frac) |> head()
 expend   ratio   salary   verbal   math    sat
-------  ------  -------  -------  -----  -----
  4.405    17.2   31.144      491    538   1029
  8.963    17.6   47.951      445    489    934
  4.778    19.3   32.175      448    496    944
  4.459    17.1   28.934      482    523   1005
  4.992    24.0   41.078      417    485    902
  5.443    18.4   34.571      462    518    980
SAT |> select(state, frac) |> head()
state         frac
-----------  -----
Alabama          8
Alaska          47
Arizona         27
Arkansas         6
California      45
Colorado        29

For most uses, the arguments to a function will be an expression constructed out of variable names. Some examples:

  • SAT |> filter(frac > 50) where the argument checks whether each value of frac is greater than 50.
  • SAT |> mutate(efficiency = sat / expend) where the argument gives a name (efficiency) to an arithmetic combination of sat and expend.
  • SAT |> pointplot(frac ~ expend) where the argument to pointplot() is an expression involving both frac and expend.
  • SAT |> filter(expend > median(expend)) where the argument involves calculating the median expenditure across the state using the median() reduction function, then comparing the calculated median to the actual expenditure in each state. The overall effect is to remove from the output of filter() any state with a below-median expenditure.
## Quotation marks
Styling with space

Written English uses space to separate words. It is helpful to the human reader to follow analogous forms in R commands.

  • Use spaces around storage arrows and pipes: x <- 7 |> sqrt() reads better than x<-7|>sqrt().
  • Use spaces between an argument name and its value: mutate(percap = GDP / pop) rather than mutate(percap=GDP/pop).
  • When writing long pipelines, put a newline after the pipe symbol. You can see several instances of this in previous examples in this Lesson. DO NOT, however, start a line with a pipe symbol.

Displaying tables

The strategy for computing in these lessons is to turn a potentially complicated task into a series of simple steps connected by pipes. This strategy applies also to the generation of printed content intended for a human reader. You can think of this as a “print it prettily” operation placed at the end of a chain of operations connected by pipes.

We include within “printed” the display on a screen.

The readability of tabular content goes beyond placing the content in neatly aligned columns and rows to include the issue of the number of “significant digits” to present. All of the functions we use for statistical computations make use of internal hardware that deals with numbers to a precision of fifteen digits. Such precision is warranted for internal calculation, which often build on one another. But fifteen digits is much more than can be readily assimilated by the human reader. To see why, let’s display calculate yearly GDP growth (in percent) with all the digits that are carried along in internal calculations:

Growth_rate <- Nats |> 
  pivot_wider(country, values_from = c(GDP, pop), names_from = year) |>
  mutate(yearly_growth = 100.*((GDP_2020 / GDP_1950)^(1/70.)-1)) |>
  select(country, yearly_growth)
Growth_rate
  country     yearly_growth
1   Korea  3.14547099309945
2    Cuba 0.411820047041944
3  France  2.26982406656688
4   India  1.87345150307259

GDP, like many quantities, can be measured only approximately. It would be generous to ascribe a precision of about 1 part in 100 to GDP. Informally, this suggests that only the first two or three digits of a calculation based on GDP can have any real meaning.

The problem of significant digits has two parts: 1) how many digits are worth displaying and 2) how to instruct the computer to display only that number of digits. Point (1) often depends on expert knowledge of a field. Point (2) is much more straightforward; use a computer function that controls the number of digits printed. There are many such functions. For simplicity, we focus on one widely used in the R community, kable().

We will take a statistical view of the appropriate number of digits to show in Chapter 20.

The purpose of kable() can be described in plain English: to format tabular output for the human reader. Whenever encountering a new function, you will want to find out what are the inputs and what is the output. The primary input to kable() is a data frame. Additional arguments, if any, specify details of the formatting, such as the number of digits to show. For instance:

Growth_rate |> 
  kable(digits = 1, 
        caption = "Annual growth in GDP from 1950 to 2020",
        col.names = c("", "Growth rate (%)"))
Annual growth in GDP from 1950 to 2020
Growth rate (%)
Korea 3.1
Cuba 0.4
France 2.3
India 1.9

The output of kable(), perhaps surprisingly, is not a data frame. Instead, the output is instructions intended for the display’s typesetting facility. The typesetting instructions for web-browsers are often written in a special-purpose language called HTML. So far as these Lessons are concerned, is not important that you understand the HTML instructions. Even so, we show them to you to emphasize an important point: You can’t use the output of kable() as the input to data-wrangling or graphics operation.

<table>
<caption>Annual growth in GDP from 1950 to 2020</caption>
 <thead>
  <tr>
   <th style="text-align:left;">  </th>
   <th style="text-align:right;"> Growth rate (%) </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> Korea </td>
   <td style="text-align:right;"> 3.1 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> Cuba </td>
   <td style="text-align:right;"> 0.4 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> France </td>
   <td style="text-align:right;"> 2.3 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> India </td>
   <td style="text-align:right;"> 1.9 </td>
  </tr>
</tbody>
</table>

Exercises

Which of these is not a valid expression for handing a data frame named Big as an argument to the head() function.

  1. Big |> head()
  2. head(Big)
  3. head() <| Big
  4. Big -> head()

Hint: Try them out and see what happens.

Is the argument sat ~ frac for pointplot() a named argument? Answer: No. The symbol between sat and frac is a tilde. Named arguments always use the single equal sign: =.

In the text, an example command was given:

Tiny |> summarize(ns = n_distinct(species))
  ns
 ---
   3

Tiny is a data frame and summarize() is a function whose first argument (the slot filled by the pipe) must always be a data frame.

  1. What kind of information object is species? ## A variable

  2. What kind of information object is n_distinct()? ## A function that takes as input a variable.

  3. What would be the output of the command if you replaced species by sex? ## the number of distinct sexes represented in the data frame.

  4. What would be the output of the command if you replace species by flipper? ## The number of distinct flipper lengths. In the small sample contained in Tiny, there are no repeats in the flipper length.

  5. Big is a superset from which the 8 rows in Tiny were selected.

    1. What will be the output of Tiny |> summarize(ns = n_distinct(species)) if you replace Tiny with Big in the command? ## Still 3. Evidently, all of the species in Big appear in one row or another in Tiny.

    2. Using Big as the input to summarize() and flipper as the variable given as an object to the function n_distinct(), what will be the result of the computation? # 56.

    3. Are there any repeats in the flipper lengths recorded in Big? (Hint: Take the answer in ii and compare it to the output of Big |> nrow().)

None of the following are complete commands, that is, each of them will lead to an error message rather than an output.

  1. ns = n_distinct(species)
  2. summarize()
  3. Tiny |> summarize()
  4. Tiny |> summarize(species)

For each, give a brief explanation of what’s missing or why the expressions listed can’t work.

What kind of first argument—data frame or variable—does each of these functions require?

  1. nrow()
  2. mean()
  3. n_distinct()

DRAFT. Using sort() or shuffle() within mutate.

Using R: what happens when you close R?

These two commands differ in only one place, whether there is a .by argument. Yet they produce different outputs. Explain what .by is doing to shape the output

Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap))
Nats |>
  mutate(GDPpercap = GDP / pop) |>
  filter(GDPpercap > mean(GDPpercap), .by = year)

DRAFT: Show a table of the number of births and average weight on each weekday or month. Use some kable formatting, then ask the students to reproduce the whole deal.

For each of the following, make up an R expression that uses an object named fireplace. The expression should have enough context to be able to identify the name as belonging to

  1. a data table Answer: Put it at the start of a chain, e.g. fireplace %>% nrow()
  2. a function Answer: Follow it by an open parenthesis, e.g. fileplace()
  3. the name of a named argument Answer: Follow it by = inside the parentheses of a function, e.g., fun(fileplace = 7)
  4. a variable Answer: place it inside the parentheses of a function, but not in the position of the name of a named argument, e.g., fun(fileplace) or fun(x = fireplace)

Consider these R expressions. (You don’t have to know what the various functions do to solve this problem.)

# prepare the data
Princes <-
  babynames::babynames |>
  filter(name == "Prince") |>
  summarise(yearlyTotal = sum(n), .by = c(year, sex))

# now graph it!
Princes |>
  pointplot(yearlyTotal ~ year + sex, annot = "model")

There are several kinds of named objects in the above expressions.

  1. function name
  2. data table name
  3. variable name
  4. name of a named argument

Using the naming convention and position rules, identify what kind of object each of the following names is used for. That is, assign one of the types (a) through (d) to each name.

1) BabyNames 2) filter 3) name 4) ==
5) .by 6) year 7) sex 8) summarise
9) sum 10) n 11) pointplot
  1. yearlyTotal in the first command.

  2. yearlyTotal in the second command.

Answer:

  1. babynames::babynames: data table. It’s at the head of a chain.
  2. filter: function. Functions are followed by (.
  3. name: a variable (See ==.)
  4. ==: a function (Tricky, see below.)
  5. .by: name of a named argument.
  6. year: variable name
  7. sex: variable name
  8. summarise: function
  9. sum: function
  10. n: variable name
  11. pointplot: function. It’s followed by (.
  12. yearlyTotal in the first command: name of a named object
  13. yearlyTotal in the second command: variable name.

The tricky one here is ==. This is a function. Like the mathematical functions + and *, etc., == doesn’t use parentheses and goes between it’s arguments. name == "Prince" is equivalent to "=="(name, "Prince"). It’s easy to mistake == with =. Keep in mind that == is a function and = goes after the name of a named argument. Some other similar functions that you might encounter: !=, >, >=, %in%.

For each of these computations, say what R function is the most appropriate:

  1. Count the number of cases in a data table. Answer: nrow()
  2. List the names of the variables in a data table. Answer: names()
  3. For data tables in an R package, display the documentation (“codebook”) for the data table. Answer: help() or ?
  4. Load the LST package into your R session. Answer: library(LST)

Some of these are legitimate storage names, others are not. For the ones that are not legitimate, say what is wrong.

  1. essay14 Answer: no problems
  2. first-essay Answer: a dash (-) is not one of the allowed punctuation marks in an object name.
  3. "MyData" Answer: being in quotes, "MyData" is a constant, not an object name.
  4. third_essay Answer: no problems. An underscore is legitimate in an object name.
  5. small sample Answer: a space is not allowed in an object name.
  6. functionList Answer: no problems
  7. FuNcTiOnLiSt Answer: no problems. Admittedly, it’s a perverse and hard to type name, but it’s legal.
  8. .MyData. Answer: no problems. Periods are allowed in a function name. It doesn’t matter where they occur. It would even be legal to use and name like this: ..... <- 7. But this is bad style!
  9. sqrt() Answer: parentheses are not allowed in function names. In the text of this book, the author uses parentheses when referring to a function. That’s just to help remind you that the object name, in this case sqrt is referring to a function as opposed to a data table or variable.

These questions refer to the diamonds data table in the ggplot2 package. Take a look at the codebook (using help()) so that you’ll understand the meaning of the tasks. (Motivated by Garrett Grolemund.)

Consider this command pattern, which can be made to perform a specific task by substituting a real function or argument instead of the placeholders verb1, …, arg1, …

diamonds |> 
  verb1( args1, .by = args2 ) |> 
  verb2(verb3( args3 )) |> 
  head( 1 )

For each of the following tasks, give appropriate R functions or arguments to substitute in place of verb1, verb2, verb3, args1, args2, and args3.

  1. Which color diamonds seem to be largest on average (in terms of carats)?

  2. Which clarity of diamonds has the largest average “table” per carat?

Answer:

# Task 1
diamonds |> 
  summarise(size=mean(carat, na.rm = TRUE), .by = color) |> 
  arrange(desc(size)) |> 
  head(1)
color        size
------  ---------
J        1.162137
# Task 2
diamonds |> 
  summarise(ave_table=mean(table, na.rm = TRUE), .by = clarity) |> 
  arrange(desc(ave_table)) |> 
  head(1)
clarity    ave_table
--------  ----------
I1          58.30378

:::

Consider this R command:

babynames::babynames %>% filter(name == "Prince")

  1. Is the result of the calculation going to be stored? Answer: No. Storage is indicated by the <- storage arrow, pointing to the storage name. If so, what is the storage name? Answer: No storage, so no storage name.

  2. Re-write the above command to store the result under the name Results. Answer: Prepend the command like this: Results <- What kind of object will be stored as Results? Answer: A data frame, the result from the filter() operation.

  3. Continue the pipeline in (2) with |> pointplot(n ~ year + sex). What kind of object will be stored as Results? Answer: A graphics object.

  4. Run the command from (3). Then, in a second command, display the stored Results. Recalling that the n variable is the number of babies given the name “Prince,” what does the graphic tell you about the popularity and gender of the name?

Answer:

The command to display a stored object is simply the storage name.

The graph shows that the name “Prince” has grown in popularity since about 1995. The large majority of babies given that name are male.

Consider this list of some possible mistakes in storing a value under a name.

  1. No storage arrow
  2. Unmatched quotes in character string
  3. Improper syntax for function argument
  4. Invalid storage name
  5. No mistake

For each of the following assignment statements, say what is the mistake.

  1. ralph <- sqrt 10 Answer: Improper syntax for function argument
  2. ralph2 <-- "Hello to you!" Answer: No assignment operator
  3. 3ralph <- "Hello to you!" Answer: Invalid object name
  4. ralph4 <- "Hello to you! Answer: Unmatched quotes
  5. ralph5 <- date() Answer: There’s no mistake. It’s fine as is.
Warning

Instructors should note and point out to their students that our use of R will be limited to a very small part of the language. The origins of R go back to the 1980s, but the part we will use—an official part of the language—is only about a dozen years old. Characteristic of this modern component of the language is the pipe operator (written |>) which makes more readable the chains of commands often seen in data wrangling procedures.

The reader consulting the Internet or resources will often encounter R statements written without pipes. These will look foreign, but the difference is superficial. The pipe style can be translated into the older, non-pipe style without loss of functionality but with (in the opinion of many programmers) considerable loss of readability. We’ll make a habit in the next few chapters of providing the old-style commands for those who are used to that style.

MAYBE HAVE A DIAGRAM SHOWING THE BASIC TYPES (data frame, graphics, models, simulations) and which functions carry from one type to another.

D G Graphic DF Data frame train model_train() DF->train pointplot pointplot() DF->pointplot wrangling mutate() summarize() arrange() select() filter() DF->wrangling M Model confint conf_interval() model_eval() M->confint S Simulation sample sample() S->sample train->M confint->DF pointplot->G sample->DF wrangling->DF

As an example for graphics -> graphics, the following ….

SAT |> 
  pointplot(sat ~ frac) |>
  gf_labs(x = "A nonsense label for the horizontal axis", y = "The vertical axis", title="SAT data frame")

Computer languages, like the R language that we will use in these Lessons, are a means to specify the correct computation, apply the computation to the inputs, and store, display, or pass along the output as an input for further computation. In R and many other computer languages, the computations are called “functions,” and each function has a name. The R “sentences” in which functions are applied to objects are called “commands.” A part of a command, analogous to a phrase in a sentence, is called an “expressions.”

A simple example involving an aptly named data frame, Tiny, can get us started. This uses the print() function whose job is to render its input into a form suitable for immediate display, that is, to let you “see” the contents of an information object.

MORE

  n_distinct(species)
1                   3

Another, similar example:

Tiny |> summarize(mean(flipper))
  mean(flipper)
1       205.125

<–

We will come back to the functions used in the above examples—summarize(), n_distinct(), and mean()—in Lesson 5, although the names are so suggestive that you may already be able to intuit what the commands are doing. For now, however, let’s look at the structure of the commands, just as we might look at the structure of an English sentence in terms of verbs and punctuation.

At the heart of the commands, right after the |> pipe, is a function named summarize. There are two clues that summarize names a function:

  1. The name is immediately followed by an open parenthesis: (. This is always a definitive sign that a name refers to a function.
  2. summarize comes immediately after the pipe symbol. The pipe symbol must always be followed by a function.

You can also see that summarize() in each command is being given a second argument. (The first argument is being piped in, the second is contained in the parentheses following summarize.) The second argument—n_distinct(species) and mean(flipper), respectively—has a structure of its own. Noting the parentheses following both n_distinct and mean, we can deduce that these are also functions. And, as usually the case with functions, they are being passed an argument of their own: species and flipper, respectively.

–>

Tiny records data about penguins of a few different species. There are only eight specimens in Tiny, which makes it easy to see the whole data frame but hard to draw any conclusions about real penguins. Tiny is drawn from Big, which records the measurements on hundreds of penguins.Ordinarily, we will write function names followed by () as a hint that the name refers to a function. We are breaking that convention here since the literal name of a function does not include the parentheses.