Notes on Machine Learning

General Approach

Use a function for the method you are interested in
1. provide it a data set and a description of the models
2. many R fuctions use a formula to describe the repsonse and predictor variables used in the model

model <- function_for_method (y ~ x1 + x2 + x3, data = SomeData)

Evaluate by predicting using the fuction that is produced by the model.

predict(model, data = NewData)

Extract specific information from the model (which variables are important? how well does prediction work? plot of model? etc. etc.)

extract_info_from_model(model)

Toy Example showing how a partitioning step works.

Toy <- data.frame(
  y = c(13, 14, 10, 17, 19, 30),
  x1 = c(3, 4, 5, 7, 9, 15),
  x2 = c("A", "A", "B", "A", "A", "B")
)

Toy %>% group_by( x1 < 6) %>% 
  mutate(y.hat = mean(y),
         resid = y - y.hat,
         r2 = resid^2
         ) %>%
  ungroup() %>%
  mutate(SSR = sum(r2))

## # A tibble: 6 x 8
##       y    x1     x2 `x1 < 6`    y.hat      resid         r2      SSR
##   <dbl> <dbl> <fctr>    <lgl>    <dbl>      <dbl>      <dbl>    <dbl>
## 1    13     3      A     TRUE 12.33333  0.6666667  0.4444444 106.6667
## 2    14     4      A     TRUE 12.33333  1.6666667  2.7777778 106.6667
## 3    10     5      B     TRUE 12.33333 -2.3333333  5.4444444 106.6667
## 4    17     7      A    FALSE 22.00000 -5.0000000 25.0000000 106.6667
## 5    19     9      A    FALSE 22.00000 -3.0000000  9.0000000 106.6667
## 6    30    15      B    FALSE 22.00000  8.0000000 64.0000000 106.6667

Toy %>% group_by(x2) %>% 
  mutate(y.hat = mean(y),
         resid = y - y.hat,
         r2 = resid^2
         ) %>%
  ungroup() %>%
  mutate(SSR = sum(r2))

## # A tibble: 6 x 7
##       y    x1     x2 y.hat  resid       r2    SSR
##   <dbl> <dbl> <fctr> <dbl>  <dbl>    <dbl>  <dbl>
## 1    13     3      A 15.75  -2.75   7.5625 222.75
## 2    14     4      A 15.75  -1.75   3.0625 222.75
## 3    10     5      B 20.00 -10.00 100.0000 222.75
## 4    17     7      A 15.75   1.25   1.5625 222.75
## 5    19     9      A 15.75   3.25  10.5625 222.75
## 6    30    15      B 20.00  10.00 100.0000 222.75

Recursive Partitioning (rpart)

Also called CART (classification and rgression tree)

KidsFeet data

head(KidsFeet)

##     name birthmonth birthyear length width sex biggerfoot domhand
## 1  David          5        88   24.4   8.4   B          L       R
## 2   Lars         10        87   25.4   8.8   B          L       L
## 3   Zach         12        87   24.5   9.7   B          R       R
## 4   Josh          1        88   25.2   9.8   B          L       R
## 5   Lang          2        88   25.1   8.9   B          L       R
## 6 Scotty          3        88   25.7   9.7   B          R       R

A model

mod <- rpart(width ~ sex + length + biggerfoot + domhand, data = KidsFeet)
mod

## n= 39 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
## 1) root 39 9.867692 8.992308  
##   2) length< 25.15 25 4.561600 8.756000  
##     4) sex=G 14 2.597143 8.614286 *
##     5) sex=B 11 1.325455 8.936364 *
##   3) length>=25.15 14 1.417143 9.414286 *

prp(mod)

mod <- rpart(width ~ sex + length + biggerfoot + domhand + name, data = KidsFeet,
             control = rpart.control(minsplit = 3, cp = 0.00001)
)
prp(mod)

Cross-Validation

Train the model on some data (training data) and check how well it works on different data (testing data) and assess the quality of the model based on how well it does in the test data. This keeps the model from overfitting the data.

Random Forest

Idea: Make many random trees.

Randomly pick a subset of the explanatory variables (often 1/3 or square root of the variables)
Make a CART based on just those explanatory variables.
Result is the average of the outputs of the trees in the forest.

require(randomForest)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

mod2 <- randomForest(width ~ sex + length + biggerfoot + domhand, data = KidsFeet)
gmodel(mod2, ~ length + sex + biggerfoot)

effect_size(mod2, ~ length)

##       slope length to:length sex biggerfoot domhand
## 1 0.2163715   24.5  25.81759   B          L       R

importance(mod2)

##            IncNodePurity
## sex            1.0690121
## length         2.6741353
## biggerfoot     0.2885772
## domhand        0.3257205

XGBoost

XGBoost (Gradient Boosted Random Forests) is the current state-of-the-art generalization of this. xgboost package in R will do it.

Nice introduction to the algorithm is here: http://xgboost.readthedocs.io/en/latest/model.html

Some comments

Other popular things

Support Vector Machines
Linear
Ridge Regression
Lasso
Elastic Net

These things exist in many platforms

R typically provides an interface to the same underlying code (often in C/C++) that you can use via otehr mechanisms (like Python) as well.
Let’s you work in R syntax using algorithms created in other programming languages.
The origins of R go back to attempting to provide an interface to FORTRAN so that analysists wouldn’t need to write a new FORTRAN program for each analysis.

Smoothing over differences between methods

Because different methods were coded by different people, the interfaces are not always compatible. A couple packages have tried to smooth over these differences in interface:

Zelig
caret

Unsupervised Learning

Attempting to find clusters/groups in data without any “correct answers” to compare with.

Singular Value Decomposition

See Scottish Parliment example in Data Computing (Kaplan)