12  Regression modeling

Under construction!

This Lesson is very early in the construction phase.

In Lesson 10 we used graphics to explore data, sometimes to understand the distribution of a single variable, sometimes to look for patterns displayed by two or three variables when placed together. The descriptions from EDA were usually qualitative, using words like “central hump,” “skew,” “positively correlated,” “negatively correlated,” and so on.

Now we turn to a new technique, regression modeling, that extracts from data a quantitative description of the relationships between variables.

Response and explanatory variables

Model specification with tilde expressions

y ~ x + a + b is a way of listing the response variable (on the left side) and the explanatory variable(s) (on the right side).

We will write it this way, with abstract names. y is the response variable. x is an explanatory variable, and a, b, and such are other explanatory variables.

Model function

Example: Minimal mathematics

Fitting

Produces coefficients in a formula

Example: y ~ 1 and the mean

The 1 is called the “intercept” term. There’s hardly ever a good reason to leave out the intercept term, so the R regression system always inserts it even if you don’t put it in yourself. If you insist on suppressing the intercept, you can do so by using -1 on the explanatory side of the model specification.

Categorical explanatory variables

Compare one level to each of the other levels.

The base level.

One coefficient for each of the other levels of the categorical variable

Model output

Fitted model values

Residuals

Centered on zero by the nature of the fitting process.

Regression with lots of data

Example: tiny relationships. Graphically, as we add more data we may not see a pattern emerging, but with regression, the more the data the more power to pull possible relationships out of data.