# 2 A pattern for data analysis

Lesson 1 presented the organization of data into data frames. This Lesson starts our journey in statistical thinking by introducing a pattern for **data analysis**. The pattern will be central to almost everything we do in these Lessons.

**paradigm**.” A paradigm is a flexible framework of concepts and procedures for structuring thought.

The word “analysis” refers to taking something apart in order to examine its structure. Correspondingly, “data analysis” amounts to the taking apart of data. As you saw in Lesson 1, the data frames consist of specimens (rows) and variables (columns). Our data analysis paradigm deals with the relationships among variables.

- One variable is selected by the analyst (that is, you!) to play the role of a
**response variable**. The goal of the analysis is to*explain*or*account for*for the row-by-row variation in the response variable.

- Other variables—the
**explanatory variables**—will be used to do the explaining of the response variable. That is, the row-by-row variation in the explanatory variables will be matched up to the variation in the response variable, enabling us to take the response variable apart into components, each component corresponding to an explanatory variable.

*interactions*between pairs of explanatory variables.

For instance, the original motivation for collecting the `Galton`

data discussed in Lesson 1 was to understand the genetic heritability of *height*. Height varies from person to person. Francis Galton’s interest was to understand a person’s height in terms of the parents’ contribution and other factors such as sex. Consequently, `height`

is a suitable choice for the *response* variable, while `mother`

, `father`

, and `sex`

would be *explanatory* variables.

## Notation

We will be applying the data analysis paradigm of *response* versus *explanatory* variables hundreds of times in the following Lessons. We need notation to make it clear which variables are in which roles.

- In English, we will use phrases like “child’s height
*versus*mother’s height” or “height*as a function of*father’s height and child’s sex.” - The R computer notation uses the
*tilde character*: the squiggly line . The response variable name always goes on the left side of the tilde, the names of the explanatory variables on the right side, separated by plus signs. For instance, the two English phrases in the previous bullet point would be written (using the names of the`Galton`

variables) as:`height ~ mother`

`height ~ father + sex`

Often, we will use *generic variable names* to make a general point about graphics, data analysis, or modeling. These are:

`y`

— the response variable`x`

— a numerical explanatory variable`g`

— a categorical explanatory variable`c`

— a second or third explanatory variable of any sort.

`g`

as standing for “group,” since the value of a categorical variable identifies which of two or more groups a specimen belongs to. `c`

stands for “covariate” or “confounder,” terms that will be introduced in later Lessons.## Too abstract?

It will be easier to understand the importance of this *data analysis paradigm* once you have seen it in applications. For instance, in the graphics to be introduced in Lesson 3 the *response* variable will always go on the vertical axis. The first explanatory variable will be represented by position on the horizontal axis. A second explanatory variable (if any) will be shown as color.

I suggest that, as you proceed through the following Lessons, you review this short Lesson occasionally until it becomes second nature.