2 A pattern for data analysis
Lesson 1 presented the organization of data into data frames. This Lesson starts our journey in statistical thinking by introducing a pattern for data analysis. The pattern will be central to almost everything we do in these Lessons.
The word “analysis” refers to taking something apart in order to examine its structure. Correspondingly, “data analysis” amounts to the taking apart of data. As you saw in Lesson 1, the data frames consist of specimens (rows) and variables (columns). Our data analysis paradigm deals with the relationships among variables.
- One variable is selected by the analyst (that is, you!) to play the role of a response variable. The goal of the analysis is to explain or account for for the row-by-row variation in the response variable.
- Other variables—the explanatory variables—will be used to do the explaining of the response variable. That is, the row-by-row variation in the explanatory variables will be matched up to the variation in the response variable, enabling us to take the response variable apart into components, each component corresponding to an explanatory variable.
For instance, the original motivation for collecting the Galton
data discussed in Lesson 1 was to understand the genetic heritability of height. Height varies from person to person. Francis Galton’s interest was to understand a person’s height in terms of the parents’ contribution and other factors such as sex. Consequently, height
is a suitable choice for the response variable, while mother
, father
, and sex
would be explanatory variables.
Notation
We will be applying the data analysis paradigm of response versus explanatory variables hundreds of times in the following Lessons. We need notation to make it clear which variables are in which roles.
- In English, we will use phrases like “child’s height versus mother’s height” or “height as a function of father’s height and child’s sex.”
- The R computer notation uses the tilde character: the squiggly line
. The response variable name always goes on the left side of the tilde, the names of the explanatory variables on the right side, separated by plus signs. For instance, the two English phrases in the previous bullet point would be written (using the names of the
Galton
variables) as:height ~ mother
height ~ father + sex
Often, we will use generic variable names to make a general point about graphics, data analysis, or modeling. These are:
y
— the response variablex
— a numerical explanatory variableg
— a categorical explanatory variablec
— a second or third explanatory variable of any sort.
g
as standing for “group,” since the value of a categorical variable identifies which of two or more groups a specimen belongs to. c
stands for “covariate” or “confounder,” terms that will be introduced in later Lessons.Too abstract?
It will be easier to understand the importance of this data analysis paradigm once you have seen it in applications. For instance, in the graphics to be introduced in Lesson 3 the response variable will always go on the vertical axis. The first explanatory variable will be represented by position on the horizontal axis. A second explanatory variable (if any) will be shown as color.
I suggest that, as you proceed through the following Lessons, you review this short Lesson occasionally until it becomes second nature.