A very large sector of the economy has to do with merchandise retailing. The people who manage retailing have to make a variety of decisions, for instance,
In this example, we work with some data about the sales and prices of child car seats. The data come from the ISLR
package: a data table Carseats
.
Install the
ISLR
package on your R system.
Companies generally hold sales and marketing data confidential. This leads to an unfortunate situation for students and educators; realistic data is not readily available for the purposes of teaching and practice. The Carseats
data has been generated from a simulation.
Look at the
Carseats
codebook.
For instance (drawn from the students):
Carseats %>%
mutate(competition = mosaic::ntiles(CompPrice, 4)) %>%
ggplot(aes(y = Sales, x = Price)) +
geom_point(aes(color = Population)) +
facet_wrap(~ competition)
Can you see:
Could you convince your (imagined) product manager of the importance or unimportance of these relationships?
Some refinement:
Linear modeling to produce coefficients and confidence intervals. This is perhaps the earliest machine learning technique. (But it’s usually not classified as “machine learning.” If it had been invented in 2010 instead of 1910, it would earn the label “machine learning.”)
model_1 <- Sales ~ Price + Population + CompPrice
htest(model_1, data = Carseats, test = "coefficients")
## term estimate std.error statistic p.value
## 1 (Intercept) 5.632874697 0.9727948176 5.790404 1.431305e-08
## 2 Price -0.088157922 0.0058936327 -14.958164 2.029425e-40
## 3 Population 0.001711276 0.0007714885 2.218149 2.711135e-02
## 4 CompPrice 0.092966453 0.0091402498 10.171106 9.607284e-22
htest(model_1, data = Carseats, test = "anova" )
## term df sumsq meansq statistic p.value
## 1 Price 1 630.030405 630.030405 123.604393 3.561548e-25
## 2 Population 1 6.464456 6.464456 1.268249 2.607775e-01
## 3 CompPrice 1 527.307559 527.307559 103.451405 9.607284e-22
## 4 Residuals 396 2018.472278 5.097152 NA NA
model_2 <- Sales ~ .
htest(model_2, data = Carseats, test = "coefficients")
## term estimate std.error statistic p.value
## 1 (Intercept) 5.6606230631 0.6034486581 9.3804551 5.596251e-19
## 2 CompPrice 0.0928153421 0.0041476529 22.3777990 7.935340e-72
## 3 Income 0.0158028363 0.0018451176 8.5646772 2.579912e-16
## 4 Advertising 0.1230950886 0.0111236855 11.0660346 6.353734e-25
## 5 Population 0.0002078771 0.0003704559 0.5611385 5.750270e-01
## 6 Price -0.0953579188 0.0026710774 -35.7001707 1.175168e-124
## 7 ShelveLocGood 4.8501827110 0.1531099670 31.6777725 1.192737e-109
## 8 ShelveLocMedium 1.9567148062 0.1261056428 15.5164730 1.383807e-42
## 9 Age -0.0460451630 0.0031817142 -14.4718098 2.924395e-38
## 10 Education -0.0211018389 0.0197204930 -1.0700462 2.852637e-01
## 11 UrbanYes 0.1228863965 0.1129760904 1.0877204 2.773938e-01
## 12 USYes -0.1840928246 0.1498422926 -1.2285772 2.199750e-01
htest(model_2, data = Carseats, test = "anova")
## term df sumsq meansq statistic p.value
## 1 CompPrice 1 13.0666859 13.0666859 12.5855321 4.363308e-04
## 2 Income 1 79.0733616 79.0733616 76.1616479 7.829441e-17
## 3 Advertising 1 219.3512681 219.3512681 211.2741095 1.605525e-38
## 4 Population 1 0.3824026 0.3824026 0.3683214 5.442756e-01
## 5 Price 1 1198.8668836 1198.8668836 1154.7210803 2.373409e-118
## 6 ShelveLoc 2 1047.4749941 523.7374971 504.4519426 1.177738e-108
## 7 Age 1 217.3879264 217.3879264 209.3830638 2.972702e-38
## 8 Education 1 1.0503266 1.0503266 1.0116505 3.151346e-01
## 9 Urban 1 1.2202272 1.2202272 1.1752948 2.789892e-01
## 10 US 1 1.5671074 1.5671074 1.5094019 2.199750e-01
## 11 Residuals 388 402.8335143 1.0382307 NA NA
Displaying the model:
fitted_mod <- lm(model_2, data = Carseats)
gmodel(fitted_mod, ~ Price + ShelveLoc + Age + Advertising,
Age = c(30, 50, 70))
How would you re-arrange the variables to make the effect (or lack of effect) of each variable clear?
An middle-aged machine-learning technique
model_3 <- rpart(Sales ~ ., data = Carseats)
prp(model_3)
gmodel(model_3, ~ Price + ShelveLoc + Age + Advertising,
Age = c(30, 50, 70))
Things to note from the graphical model:
An newish machine-learning technique
model_4 <- randomForest(Sales ~ ., data = Carseats)
importance(model_4)
## IncNodePurity
## CompPrice 284.35289
## Income 232.21808
## Advertising 277.93418
## Population 184.03746
## Price 760.56054
## ShelveLoc 754.11964
## Age 330.90807
## Education 126.64105
## Urban 25.38849
## US 39.98243
gmodel(model_4, ~ Price + ShelveLoc + Age + Advertising,
Age = c(30, 50, 70))
A more nuanced relationship between price and sales and between shelf location and sales.
Compared to linear models and ANOVA, variables such as CompPrice
and income seem to be playing more of a role.
Compare models:
small_mod <- randomForest(Sales ~ Price + Age + ShelveLoc, data = Carseats)
med_mod <- randomForest(Sales ~ Price + Age + ShelveLoc + CompPrice + Advertising, data = Carseats)
big_mod <- randomForest(Sales ~ ., data = Carseats)
CV_results <- cv_pred_error(small_mod, med_mod, big_mod)
gf_point(mse ~ model, data = CV_results)
Alas, gmodel()
is not yet working on bootstrap ensembles of random forest models.
E <- ensemble(med_mod, nreps = 100)
One <- effect_size(E, ~ CompPrice, ShelveLoc = c("Good", "Medium", "Bad"))
gf_point(slope ~ ShelveLoc, data = One)
Two <- effect_size(E, ~ ShelveLoc, ShelveLoc = "Good")
Three <- effect_size(E, ~ CompPrice, ShelveLoc = c("Good", "Medium", "Bad"))
gf_point(slope ~ ShelveLoc, data = Three)
Let’s take a look using linear models:
small_mod <- lm(Sales ~ Price + Age + ShelveLoc, data = Carseats)
med_mod <- lm(Sales ~ Price + Age + ShelveLoc + CompPrice + Advertising, data = Carseats)
big_mod <- lm(Sales ~ ., data = Carseats)
CV_results2 <- cv_pred_error(small_mod, med_mod, big_mod)
gf_point(mse ~ model, data = CV_results2)