# 12 Regression modeling

In Lesson 10 we used graphics to explore data, sometimes to understand the distribution of a single variable, sometimes to look for patterns displayed by two or three variables when placed together. The descriptions from EDA were usually *qualitative*, using words like “central hump,” “skew,” “positively correlated,” “negatively correlated,” and so on.

Now we turn to a new technique, **regression modeling**, that extracts from data a *quantitative* description of the relationships between variables.

## Response and explanatory variables

## Model specification with tilde expressions

`y ~ x + a + b`

is a way of listing the response variable (on the left side) and the explanatory variable(s) (on the right side).

We will write it this way, with abstract names. `y`

is the response variable. `x`

is an explanatory variable, and `a`

, `b`

, and such are other explanatory variables.

## Model function

Example: Minimal mathematics

## Fitting

Produces **coefficients** in a formula

Example: `y ~ 1`

and the mean

The `1`

is called the “intercept” term. There’s hardly ever a good reason to leave out the intercept term, so the R regression system *always* inserts it even if you don’t put it in yourself. If you insist on suppressing the intercept, you can do so by using `-1`

on the explanatory side of the model specification.

## Categorical explanatory variables

Compare one level to each of the other levels.

The base level.

One coefficient for each of the other levels of the categorical variable

## Model output

Fitted model values

## Residuals

Centered on zero by the nature of the fitting process.

## Regression with lots of data

Example: tiny relationships. Graphically, as we add more data we may not see a pattern emerging, but with regression, the more the data the more power to pull possible relationships out of data.