12 Regression modeling
This Lesson is very early in the construction phase.
In Lesson 10 we used graphics to explore data, sometimes to understand the distribution of a single variable, sometimes to look for patterns displayed by two or three variables when placed together. The descriptions from EDA were usually qualitative, using words like “central hump,” “skew,” “positively correlated,” “negatively correlated,” and so on.
Now we turn to a new technique, regression modeling, that extracts from data a quantitative description of the relationships between variables.
Response and explanatory variables
Model specification with tilde expressions
y ~ x + a + b
is a way of listing the response variable (on the left side) and the explanatory variable(s) (on the right side).
We will write it this way, with abstract names. y
is the response variable. x
is an explanatory variable, and a
, b
, and such are other explanatory variables.
Model function
Example: Minimal mathematics
Fitting
Produces coefficients in a formula
Example: y ~ 1
and the mean
The 1
is called the “intercept” term. There’s hardly ever a good reason to leave out the intercept term, so the R regression system always inserts it even if you don’t put it in yourself. If you insist on suppressing the intercept, you can do so by using -1
on the explanatory side of the model specification.
Categorical explanatory variables
Compare one level to each of the other levels.
The base level.
One coefficient for each of the other levels of the categorical variable
Model output
Fitted model values
Residuals
Centered on zero by the nature of the fitting process.
Regression with lots of data
Example: tiny relationships. Graphically, as we add more data we may not see a pattern emerging, but with regression, the more the data the more power to pull possible relationships out of data.