Topic 8 Nonlinearity in linear models

Basic idea: Use linear regression to construct models \(f(x)\) that are nonlinear in \(x\).

8.1 Smoothers

Sometimes the model form we want to impose is described by broad properties:

  1. The “smoothness” of the model.
  2. The generalizability of the model, e.g. extrapolation outside the range of the inputs

The broad class of model forms used here are called smoothers. They are a linear combination of a set of functions, called basis functions, that have nice properties.

8.1.1 Ideas of smoothness

  • Continuity: Is the \(n\)th derivative continuous? The higher is \(n\), the smoother is the function.
  • Bumpiness: The integral of the absolute value of the 2nd derivative.

  • Aside: Is continuity always good? Regression discontinuity video

8.1.2 Polynomials

Polynomials have been central to math education for a long time, and there has been a rich theory of them since around the 13th century. For instance:

The fundamental theorem of algebra states that every non-constant single-variable polynomial with complex coefficients has at least one complex root. This includes polynomials with real coefficients, since every real number is a complex number with an imaginary part equal to zero.

  • Polynomials are completely smooth in the sense of continuity: all derivatives are continuous.
  • But they may be bumpy
  • And their behavior for large and small \(x\) is crazy.

8.1.3 The model matrix

The set of basis functions evaluated at the inputs \(x\).

make_model_matrix <- function(x, basis_funs) {
  MM <- matrix(0, nrow=length(x), ncol=length(basis_funs))
  for (i in 1:length(basis_funs)) {
    MM[,i] = basis_funs[[i]](x)
  }
  
  return(MM)
}

Polynomial basis functions:

polynomial_basis_3 <- list(
  function(x) 1,
  function(x) x,
  function(x) x^2,
  function(x) x^3
)
monomial <- function(k) function(x) x^k

make_polynomial_basis <- function(p) {
  lapply(0:p, FUN=monomial)
}
show_smoother(basis=make_polynomial_basis(10),
              data=sample(CPS85, size=100), bootstrap=10, confidence=TRUE)
## Warning in sqrt(rowSums(MMM %*% vcov(mod) * MMM)): NaNs produced

8.1.4 Sigmoidal Functions

sigmoidal_basis <- list(
  function(x) 1, 
  function(x) dnorm(x, mean=25, sd = 10),
  function(x) dnorm(x, mean=40, sd = 15),
  function(x) dnorm(x, mean=55, sd=10)
)

8.1.5 Hat functions

hat <- function(from, width) 
  function(x) {ifelse(x>from & x<(from+width), 1, 0)}
hat_basis <- list(
  function(x) 1,
  hat(20, 10), 
  hat(30, 10),
  hat(40, 10),
  hat(43, 17),
  hat(22, 6),
  hat(50, 10)
)

8.1.6 Fourier analysis

E.g. taking a function apart into Fourier components — sines and cosines

fourier <- function(fun, period) {
  function(x) fun(2*pi*x / period)
}
fourier_set <- list(
  function(x) 1,
#  fourier(cos, 5),
#  fourier(cos, 10),
  fourier(cos, 20),
  fourier(cos, 30),
#  fourier(sin, 5),
#  fourier(sin, 10),
  fourier(sin, 20),
  fourier(sin, 30))
show_smoother(basis=fourier_set, data=mosaic::sample(Wage, size=100), bootstrap=20, confidence=TRUE)

show_smoother(basis=hat_basis, data=mosaic::sample(Wage, size=100), bootstrap=20, confidence=FALSE)

8.2 Steps

step_fun <- function(where) { function(x) ifelse(x > where, 1, 0)}
step_basis <- list(
  function(x) 1,
  step_fun(30),
  step_fun(35),
  step_fun(50),
  step_fun(55)
)
show_smoother(basis=step_basis, data=mosaic::sample(Wage, size=100), bootstrap=0, confidence=FALSE)

8.3 Other functions

  • triangles
  • gaussian: dnorm()
  • sigmoidals: pnorm()
  • spline basis: 1, \(x\), \(x^2\), \(x^3\), \((x-\xi_j)^3_+\)

8.4 Holes in the data

Leave out the middle of the data

8.5 Bootstrapping

8.6 Normal theory confidence bands

  • covariance matrix for model coefficients
  • rowSums(MM %*% cov * MM)
data(SwimRecords, package = "mosaicData")
model_formula <- time ~ year
MM <- model.matrix(model_formula, data=SwimRecords)
mod <- lm(model_formula, data = SwimRecords)
SE <- rowSums(MM %*% vcov(mod) * MM)
top <- fitted(mod) + 2*SE
bottom <- fitted(mod) - 2*SE
#if(!is.null(knitr::current_input() ))
   plot(time ~ year, data=SwimRecords)
points(MM[,2], top, pch=20)
points(MM[,2], bottom, pch=20)

8.7 Splines

8.7.1 B-splines

library(splines)
x <- seq(0, 100, by = 0.1)
funs <- splines::bs(x, df = 15)
plot(x, funs[,1], type="l")
for (k in 1:ncol(funs))
  lines(x, funs[,k], col=topo.colors(ncol(funs))[k])

mod <- lm(time ~ sex * splines::bs(year, df=5), data=SwimRecords)
plot(time ~ year, data=SwimRecords)
with(SwimRecords, points(year, fitted(mod), pch=20))

For_plotting <- expand.grid(sex = c("F", "M"), year = x)
preds <- predict(mod, newdata = For_plotting)
plot(preds ~ year, col = sex, pch=20, data = For_plotting)
with(SwimRecords, points(year, time))

8.7.2 Natural splines

funs <- splines::ns(x, df = 15)
plot(x, funs[,1], type="l")
for (k in 1:ncol(funs))
  lines(x, funs[,k], col=topo.colors(ncol(funs))[k])

mod <- lm(time ~ sex * splines::bs(year, df=5), data=SwimRecords)
plot(time ~ year, data=SwimRecords)
with(SwimRecords, points(year, fitted(mod), pch=20))

For_plotting <- expand.grid(sex = c("F", "M"), year = x)
preds <- predict(mod, newdata = For_plotting)
plot(preds ~ year, col = sex, pch=20, data = For_plotting)
with(SwimRecords, points(year, time))

8.7.3 Smoothing splines

mod <- with(SwimRecords, smooth.spline(year, time, df = 5))
multiples <- duplicated( SwimRecords$time )
mod2 <- with(SwimRecords, smooth.spline(year, time, cv = TRUE))
## Warning in smooth.spline(year, time, cv = TRUE): cross-validation with non-
## unique 'x' values seems doubtful
mod2$df
## [1] 3.47454
plot(time ~ year, data=SwimRecords)
lines(mod, col="red", lwd=2)
lines(mod2, col="blue", lwd=3)

8.7.4 Smoothers in k dimensions

8.8 GAMS

SOMETHING’S WRONG

library(gam)
mod <- gam(time ~ sex * s(year, 2), data=SwimRecords)
preds <- predict(mod, newdata=X)
plot(X$year, preds)