11  Residuals

Having looked at various patterns, including straight-line relationships, time to look at how “far” the data is from the pattern.

Patterns that are closer to the data are more “likely.”

Show a grossly misfit pattern, point out that it’s not very likely.

Error models

An “error model” is a simple description of the “deviation” of a residual from the pattern.

The most common one is a bell-shaped curve, dnorm(r_i, mean=0, sd=??)

The error model is the means by which we can measure the likelihood of a pattern model.

We have a pattern model, e.g. \(y = b_0 + b_1 x\) and an error model, e.g. dnorm(mean=0, sd=??).

Example: some data showing x-y correlation. Stipulate a straight-line model. Calculate the point-by-point residuals. Plug them into dnorm() to get the likelihood of the model against each point in the data. The likelihood of the whole set of points is the products of the likelihood for each individual point. Change the model and see how the likelihood changes. (Change the error model by setting the standard deviation of the noise model.)

::: {.callout-note} ## A simple likelihood calculation

Pattern model: a single value for all points.

Error model: dnorm(mean=0, sd=??)

We have some data, e.g. PGA_index. Our model is a single-number parameter and the sd of the error model. Evaluate the error model at each deviation.

YOU HAVE ZOOMED in on this plot, but start by zooming out.

Res <- list()
param_candidates <- seq(250, 320, by=1)
sd_candidates <- seq(1, 20, by=1)

for (k in 1:length(sd_candidates)) {
  sd <- sd_candidates[k]
  deviations <- outer(PGA_index$dist, param_candidates, FUN="-")
  probs <- dnorm(deviations, mean=0, sd=sd) |> log()
  Likelihoods <- colSums(probs) 
  Res[[k]] <- tibble::tibble(L = Likelihoods, param = param_candidates, sd=sd)
}

All <- bind_rows(Res)
ggplot(All |> filter(L > -175), aes(x=param, y=L, color=as.character(sd), group=sd)) +
  geom_line() +
  ylim(-175,-125) +
  xlim(290,310)
Warning: Removed 133 rows containing missing values (`geom_line()`).

Likelihood (technically)

It will be hard for you to overcome the everyday use of “likelihood” as synonymous with “probability” or “chance.” In the technical meaning of likelihood, chance has nothing at all to do with it. Better synonyms for the technical meaning are “plausibility,” “prospect,” “match,” or “verisimilitude.”

Features of a likelihood: the data are fixed: what’s in the data frame.

There is a hypothesis: a pattern model and an error model, each with its parameters.

Evaluate the data according to the pattern model and the error model.

If \(\mathbb{D}\) is the data, and \(\alpha, \beta, \ldots\) are the parameters of the pattern and error model, then the likelihood is \({\cal L}_\mathbb{D} (\alpha, \beta, \ldots)\). This is an entirely deterministic calculation: For given \(\mathbb{D}, \alpha, \beta, \ldots\) everyone will get exactly the same value for the likelihood.

We are usually interested in comparing likelihoods for different parameter values.

Errors?

The term error model suggests that the deviations of the data from the pattern is a matter of an error somewhere. This is completely misleading. A better term for the deviations is residual: what’s left over when we subtract the hypothesized pattern.

Sometimes the residuals contain important information, or demonstrate that some other factor is at work.

Example: CO_2_ annual emissions from different cars.

ggplot(MPG, aes(x=fuel_year, y=CO2_year)) + geom_point()

The big pattern is the diagonal line. But some vehicles deviate from that line. Is this an error or something else?

See this site for a listing of CO_2_ produced by burning fuels of different types. Diesel. has about 10% more CO_2_ per gallon. Refer to the Kg CO_2_ per volume column. Note that cars use motor fuel.

ggplot(MPG, aes(x=fuel_year, y=CO2_year, color=fuel)) + geom_point() + geom_lm()
Warning: Using the `size` aesthietic with geom_line was deprecated in ggplot2 3.4.0.
ℹ Please use the `linewidth` aesthetic instead.

lm(CO2_year ~ fuel_year*fuel - fuel - fuel_year - 1, data = MPG) |> conf_interval()
term .lwr .coef .upr
fuel_year:fuelDU 10.167002 10.196572 10.226142
fuel_year:fuelG 8.880599 8.886041 8.891484
fuel_year:fuelGM 8.860811 8.886846 8.912882
fuel_year:fuelGP 8.858321 8.865385 8.872449
fuel_year:fuelGPR 8.861161 8.867703 8.874245

Exercises

Model index ~ dist + accuracy for PGA_index. Which golfer doesn’t follow the pattern?