2  A pattern for data analysis

Lesson 1 presented the organization of data into data frames. This Lesson starts our journey in statistical thinking by introducing a pattern for data analysis. The pattern will be central to almost everything we do in these Lessons.

A more technical word for “pattern” is “paradigm.” A paradigm is a flexible framework of concepts and procedures for structuring thought.

The word “analysis” refers to taking something apart in order to examine its structure. Correspondingly, “data analysis” amounts to the taking apart of data. As you saw in Lesson 1, the data frames consist of specimens (rows) and variables (columns). Our data analysis paradigm deals with the relationships among variables.

  1. One variable is selected by the analyst (that is, you!) to play the role of a response variable. The goal of the analysis is to explain or account for for the row-by-row variation in the response variable.
  2. Other variables—the explanatory variables—will be used to do the explaining of the response variable. That is, the row-by-row variation in the explanatory variables will be matched up to the variation in the response variable, enabling us to take the response variable apart into components, each component corresponding to an explanatory variable.
Later, we will consider interactions between pairs of explanatory variables.

For instance, the original motivation for collecting the Galton data discussed in Lesson 1 was to understand the genetic heritability of height. Height varies from person to person. Francis Galton’s interest was to understand a person’s height in terms of the parents’ contribution and other factors such as sex. Consequently, height is a suitable choice for the response variable, while mother, father, and sex would be explanatory variables.

Notation

We will be applying the data analysis paradigm of response versus explanatory variables hundreds of times in the following Lessons. We need notation to make it clear which variables are in which roles.

  • In English, we will use phrases like “child’s height versus mother’s height” or “height as a function of father’s height and child’s sex.”
  • The R computer notation uses the tilde character: the squiggly line . The response variable name always goes on the left side of the tilde, the names of the explanatory variables on the right side, separated by plus signs. For instance, the two English phrases in the previous bullet point would be written (using the names of the Galton variables) as:
    • height ~ mother
    • height ~ father + sex

Often, we will use generic variable names to make a general point about graphics, data analysis, or modeling. These are:

  • y — the response variable
  • x — a numerical explanatory variable
  • g — a categorical explanatory variable
  • c — a second or third explanatory variable of any sort.
Think of g as standing for “group,” since the value of a categorical variable identifies which of two or more groups a specimen belongs to. c stands for “covariate” or “confounder,” terms that will be introduced in later Lessons.

Too abstract?

It will be easier to understand the importance of this data analysis paradigm once you have seen it in applications. For instance, in the graphics to be introduced in Lesson 3 the response variable will always go on the vertical axis. The first explanatory variable will be represented by position on the horizontal axis. A second explanatory variable (if any) will be shown as color.

I suggest that, as you proceed through the following Lessons, you review this short Lesson occasionally until it becomes second nature.