Topic 4 Classifiers

4.1 Classification overview

  • Response variable: categorical. Typically just a few levels: 2 or 3.
  • Two types of outputs from classification models:
    1. The predicted category given the inputs
    2. Probability of each category given the inputs
      • Type (2) can be fitted with maximum likelihood.
  • Trade-offs:
    • Flexibility vs interpretability
    • Accuracy vs bias
  • Four model architectures
    1. Logistic regression. Especially important for interpretability.
    2. Linear discriminant analysis
    3. Quadratic discriminant analysis
    4. K nearest neighbors

4.2 Day 10 preview

  1. Probability and odds
  2. Multivariate gaussians (Maybe)
  3. Programming activity: Poker hands

4.3 Probability and odds

Probability \(p(event)\) is a number between zero and one.

Simple way to make a probability model for yes/no variable: encode outcome as zero and one, use regression.

Whickham$alive <- as.numeric(with(Whickham, outcome == "Alive"))

Model of mortality in Whickham

res <- mean( alive ~ smoker, data=Whickham)
res
##        No       Yes 
## 0.6857923 0.7611684
res / (1-res)
##       No      Yes 
## 2.182609 3.187050
mod2 <- glm(alive ~ age, data=Whickham, family = "binomial")
f <- makeFun(mod2)
plotFun(f(age) ~ age, age.lim = c(20,100))

plotPoints(jitter(alive) ~ age, data=Whickham, add=TRUE,
           pch=20, alpha=.3)

If we’re going to use likelihood to fit, the estimated probability can’t be \(\leq 0\).

4.4 Log Odds

Gerolamo Cardano (1501-1576) defined odds as the ratio of favorable to unfavorable outcomes.

For an event whose probability is \(p\), it’s odds are \(w = \frac{p}{1-p}\).

A probability is a number between 0 and one.

An odds is a ratio of two positive numbers. 5:9, 9:5, etc.

“Odds are against it,” could be taken to mean that the odds is less than 1. More unfavorable outcomes than favorable ones.

Given odds \(w\), the probability is \(p = \frac{w}{1+w}\). There’s a one-to-one correspondence between probability and odds.

The log odds is a number between \(-\infty\) and \(\infty\).

4.5 Why use odds?

Making Book

Several horses in a race. People bet on each one amounts \(H_i\).

What should be the winnings when horse \(j\) wins? Payoff means you get your original stake back plus your winnings.

If it’s arranged to pay winnings of
\(\sum{i \neq j} \frac{H_i}{H_j}\) + the amount \(H_j\)
the net income will be zero for the bookie.

Shaving the odds means to pay less than the zero-net-income winnings.

Link function

You can build a linear regression to predict the log odds, \(\ln w\). The output of the linear regression is free to range from \(-\infty\) to \(\infty\). Then, to measure likelihood, unlog to get odds \(w\), then \(p = \frac{w}{1+w}\).

4.6 Use of glm()

Response should be 0 or 1. We don’t take the log odds of the response. Instead, the likelihood is
- \(p\) if the outcome is 1 - \(1-p\) if the outcome is 0

Multiply these together of all the cases to get the total likelihood.

4.7 Interpretation of coefficients

Each adds to the log odds in the normal, linear regression way. Negative means less likely; positive more likely.

4.8 Example: Logistic regression of default

names(Default)
## [1] "default" "student" "balance" "income"
ggplot(Default, 
       aes(x = income, y = balance, alpha = default, color = default)) + 
  geom_point() #+ facet_wrap( ~ student)

model_of_default <-
  glm(default == "Yes" ~ balance + income, data = Default, family = "binomial")
f <- makeFun(model_of_default)
plotFun(f(income=income, balance=balance) ~ income + balance,
        income.lim <- c(0,70000), balance.lim = c(0, 3000))

summary(model_of_default)
## 
## Call:
## glm(formula = default == "Yes" ~ balance + income, family = "binomial", 
##     data = Default)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4725  -0.1444  -0.0574  -0.0211   3.7245  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.154e+01  4.348e-01 -26.545  < 2e-16 ***
## balance      5.647e-03  2.274e-04  24.836  < 2e-16 ***
## income       2.081e-05  4.985e-06   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2920.6  on 9999  degrees of freedom
## Residual deviance: 1579.0  on 9997  degrees of freedom
## AIC: 1585
## 
## Number of Fisher Scoring iterations: 8
logodds <- predict(model_of_default, newdata = list(balance = 1000, income = 40000)) #,
                   # type = "response")
logodds
##         1 
## -5.061006
odds <- exp(logodds)
odds / (1 + odds)
##           1 
## 0.006299244
logistic <- function(x) {exp(x) / (1 + exp(x))}
logistic(-3.36)
## [1] 0.03356922
table(Default$default)
## 
##   No  Yes 
## 9667  333