Topic 8 Nonlinearity in linear models
Basic idea: Use linear regression to construct models \(f(x)\) that are nonlinear in \(x\).
8.1 Smoothers
Sometimes the model form we want to impose is described by broad properties:
- The “smoothness” of the model.
- The generalizability of the model, e.g. extrapolation outside the range of the inputs
The broad class of model forms used here are called smoothers. They are a linear combination of a set of functions, called basis functions, that have nice properties.
8.1.1 Ideas of smoothness
- Continuity: Is the \(n\)th derivative continuous? The higher is \(n\), the smoother is the function.
Bumpiness: The integral of the absolute value of the 2nd derivative.
Aside: Is continuity always good? Regression discontinuity video
8.1.2 Polynomials
Polynomials have been central to math education for a long time, and there has been a rich theory of them since around the 13th century. For instance:
The fundamental theorem of algebra states that every non-constant single-variable polynomial with complex coefficients has at least one complex root. This includes polynomials with real coefficients, since every real number is a complex number with an imaginary part equal to zero.
- Polynomials are completely smooth in the sense of continuity: all derivatives are continuous.
- But they may be bumpy
- And their behavior for large and small \(x\) is crazy.
8.1.3 The model matrix
The set of basis functions evaluated at the inputs \(x\).
make_model_matrix <- function(x, basis_funs) {
MM <- matrix(0, nrow=length(x), ncol=length(basis_funs))
for (i in 1:length(basis_funs)) {
MM[,i] = basis_funs[[i]](x)
}
return(MM)
}Polynomial basis functions:
polynomial_basis_3 <- list(
function(x) 1,
function(x) x,
function(x) x^2,
function(x) x^3
)monomial <- function(k) function(x) x^k
make_polynomial_basis <- function(p) {
lapply(0:p, FUN=monomial)
}show_smoother(basis=make_polynomial_basis(10),
data=sample(CPS85, size=100), bootstrap=10, confidence=TRUE)## Warning in sqrt(rowSums(MMM %*% vcov(mod) * MMM)): NaNs produced

8.1.4 Sigmoidal Functions
sigmoidal_basis <- list(
function(x) 1,
function(x) dnorm(x, mean=25, sd = 10),
function(x) dnorm(x, mean=40, sd = 15),
function(x) dnorm(x, mean=55, sd=10)
)8.1.5 Hat functions
hat <- function(from, width)
function(x) {ifelse(x>from & x<(from+width), 1, 0)}
hat_basis <- list(
function(x) 1,
hat(20, 10),
hat(30, 10),
hat(40, 10),
hat(43, 17),
hat(22, 6),
hat(50, 10)
)8.1.6 Fourier analysis
E.g. taking a function apart into Fourier components — sines and cosines
fourier <- function(fun, period) {
function(x) fun(2*pi*x / period)
}
fourier_set <- list(
function(x) 1,
# fourier(cos, 5),
# fourier(cos, 10),
fourier(cos, 20),
fourier(cos, 30),
# fourier(sin, 5),
# fourier(sin, 10),
fourier(sin, 20),
fourier(sin, 30))
show_smoother(basis=fourier_set, data=mosaic::sample(Wage, size=100), bootstrap=20, confidence=TRUE)
show_smoother(basis=hat_basis, data=mosaic::sample(Wage, size=100), bootstrap=20, confidence=FALSE)
8.2 Steps
step_fun <- function(where) { function(x) ifelse(x > where, 1, 0)}
step_basis <- list(
function(x) 1,
step_fun(30),
step_fun(35),
step_fun(50),
step_fun(55)
)
show_smoother(basis=step_basis, data=mosaic::sample(Wage, size=100), bootstrap=0, confidence=FALSE)
8.3 Other functions
- triangles
- gaussian:
dnorm() - sigmoidals:
pnorm() - spline basis: 1, \(x\), \(x^2\), \(x^3\), \((x-\xi_j)^3_+\)
8.4 Holes in the data
Leave out the middle of the data
8.5 Bootstrapping
8.6 Normal theory confidence bands
- covariance matrix for model coefficients
rowSums(MM %*% cov * MM)
data(SwimRecords, package = "mosaicData")
model_formula <- time ~ year
MM <- model.matrix(model_formula, data=SwimRecords)
mod <- lm(model_formula, data = SwimRecords)
SE <- rowSums(MM %*% vcov(mod) * MM)
top <- fitted(mod) + 2*SE
bottom <- fitted(mod) - 2*SE
#if(!is.null(knitr::current_input() ))
plot(time ~ year, data=SwimRecords)
points(MM[,2], top, pch=20)
points(MM[,2], bottom, pch=20)
8.7 Splines
8.7.1 B-splines
library(splines)
x <- seq(0, 100, by = 0.1)
funs <- splines::bs(x, df = 15)
plot(x, funs[,1], type="l")
for (k in 1:ncol(funs))
lines(x, funs[,k], col=topo.colors(ncol(funs))[k])
mod <- lm(time ~ sex * splines::bs(year, df=5), data=SwimRecords)
plot(time ~ year, data=SwimRecords)
with(SwimRecords, points(year, fitted(mod), pch=20))
For_plotting <- expand.grid(sex = c("F", "M"), year = x)
preds <- predict(mod, newdata = For_plotting)
plot(preds ~ year, col = sex, pch=20, data = For_plotting)
with(SwimRecords, points(year, time))
8.7.2 Natural splines
funs <- splines::ns(x, df = 15)
plot(x, funs[,1], type="l")
for (k in 1:ncol(funs))
lines(x, funs[,k], col=topo.colors(ncol(funs))[k])
mod <- lm(time ~ sex * splines::bs(year, df=5), data=SwimRecords)
plot(time ~ year, data=SwimRecords)
with(SwimRecords, points(year, fitted(mod), pch=20))
For_plotting <- expand.grid(sex = c("F", "M"), year = x)
preds <- predict(mod, newdata = For_plotting)
plot(preds ~ year, col = sex, pch=20, data = For_plotting)
with(SwimRecords, points(year, time))
8.7.3 Smoothing splines
mod <- with(SwimRecords, smooth.spline(year, time, df = 5))
multiples <- duplicated( SwimRecords$time )
mod2 <- with(SwimRecords, smooth.spline(year, time, cv = TRUE))## Warning in smooth.spline(year, time, cv = TRUE): cross-validation with non-
## unique 'x' values seems doubtful
mod2$df## [1] 3.47454
plot(time ~ year, data=SwimRecords)
lines(mod, col="red", lwd=2)
lines(mod2, col="blue", lwd=3)
8.7.4 Smoothers in k dimensions
8.8 GAMS
SOMETHING’S WRONG
library(gam)
mod <- gam(time ~ sex * s(year, 2), data=SwimRecords)
preds <- predict(mod, newdata=X)
plot(X$year, preds)