Use a function for the method you are interested in
provide it a data set and a description of the models
many R fuctions use a formula to describe the repsonse and predictor variables used in the model
model <- function_for_method (y ~ x1 + x2 + x3, data = SomeData)
predict(model, data = NewData)
extract_info_from_model(model)
Toy <- data.frame(
y = c(13, 14, 10, 17, 19, 30),
x1 = c(3, 4, 5, 7, 9, 15),
x2 = c("A", "A", "B", "A", "A", "B")
)
Toy %>% group_by( x1 < 6) %>%
mutate(y.hat = mean(y),
resid = y - y.hat,
r2 = resid^2
) %>%
ungroup() %>%
mutate(SSR = sum(r2))
## # A tibble: 6 x 8
## y x1 x2 `x1 < 6` y.hat resid r2 SSR
## <dbl> <dbl> <fctr> <lgl> <dbl> <dbl> <dbl> <dbl>
## 1 13 3 A TRUE 12.33333 0.6666667 0.4444444 106.6667
## 2 14 4 A TRUE 12.33333 1.6666667 2.7777778 106.6667
## 3 10 5 B TRUE 12.33333 -2.3333333 5.4444444 106.6667
## 4 17 7 A FALSE 22.00000 -5.0000000 25.0000000 106.6667
## 5 19 9 A FALSE 22.00000 -3.0000000 9.0000000 106.6667
## 6 30 15 B FALSE 22.00000 8.0000000 64.0000000 106.6667
Toy %>% group_by(x2) %>%
mutate(y.hat = mean(y),
resid = y - y.hat,
r2 = resid^2
) %>%
ungroup() %>%
mutate(SSR = sum(r2))
## # A tibble: 6 x 7
## y x1 x2 y.hat resid r2 SSR
## <dbl> <dbl> <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 13 3 A 15.75 -2.75 7.5625 222.75
## 2 14 4 A 15.75 -1.75 3.0625 222.75
## 3 10 5 B 20.00 -10.00 100.0000 222.75
## 4 17 7 A 15.75 1.25 1.5625 222.75
## 5 19 9 A 15.75 3.25 10.5625 222.75
## 6 30 15 B 20.00 10.00 100.0000 222.75
Also called CART (classification and rgression tree)
head(KidsFeet)
## name birthmonth birthyear length width sex biggerfoot domhand
## 1 David 5 88 24.4 8.4 B L R
## 2 Lars 10 87 25.4 8.8 B L L
## 3 Zach 12 87 24.5 9.7 B R R
## 4 Josh 1 88 25.2 9.8 B L R
## 5 Lang 2 88 25.1 8.9 B L R
## 6 Scotty 3 88 25.7 9.7 B R R
mod <- rpart(width ~ sex + length + biggerfoot + domhand, data = KidsFeet)
mod
## n= 39
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 39 9.867692 8.992308
## 2) length< 25.15 25 4.561600 8.756000
## 4) sex=G 14 2.597143 8.614286 *
## 5) sex=B 11 1.325455 8.936364 *
## 3) length>=25.15 14 1.417143 9.414286 *
prp(mod)
mod <- rpart(width ~ sex + length + biggerfoot + domhand + name, data = KidsFeet,
control = rpart.control(minsplit = 3, cp = 0.00001)
)
prp(mod)
Train the model on some data (training data) and check how well it works on different data (testing data) and assess the quality of the model based on how well it does in the test data. This keeps the model from overfitting the data.
Idea: Make many random trees.
Randomly pick a subset of the explanatory variables (often 1/3 or square root of the variables)
Make a CART based on just those explanatory variables.
Result is the average of the outputs of the trees in the forest.
require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
mod2 <- randomForest(width ~ sex + length + biggerfoot + domhand, data = KidsFeet)
gmodel(mod2, ~ length + sex + biggerfoot)
effect_size(mod2, ~ length)
## slope length to:length sex biggerfoot domhand
## 1 0.2163715 24.5 25.81759 B L R
importance(mod2)
## IncNodePurity
## sex 1.0690121
## length 2.6741353
## biggerfoot 0.2885772
## domhand 0.3257205
XGBoost (Gradient Boosted Random Forests) is the current state-of-the-art generalization of this. xgboost
package in R will do it.
Nice introduction to the algorithm is here: http://xgboost.readthedocs.io/en/latest/model.html
R typically provides an interface to the same underlying code (often in C/C++) that you can use via otehr mechanisms (like Python) as well.
Let’s you work in R syntax using algorithms created in other programming languages.
The origins of R go back to attempting to provide an interface to FORTRAN so that analysists wouldn’t need to write a new FORTRAN program for each analysis.
Because different methods were coded by different people, the interfaces are not always compatible. A couple packages have tried to smooth over these differences in interface:
Zelig
caret
Attempting to find clusters/groups in data without any “correct answers” to compare with.
See Scottish Parliment example in Data Computing (Kaplan)