Business and marketing decisions

A very large sector of the economy has to do with merchandise retailing. The people who manage retailing have to make a variety of decisions, for instance,

In this example, we work with some data about the sales and prices of child car seats. The data come from the ISLR package: a data table Carseats.

Install the ISLR package on your R system.

The data

Companies generally hold sales and marketing data confidential. This leads to an unfortunate situation for students and educators; realistic data is not readily available for the purposes of teaching and practice. The Carseats data has been generated from a simulation.

Look at the Carseats codebook.

  1. Familiarize yourself with the variables
  2. Figure out what is the unit of observation
  3. Using your everyday experience, common sense, judgment, and whatever you know about retail, marketing and economics, frame a sensible hypothesis about what might be the determinants of the number of carseats sold at each location.
  4. Generate a graphic that has a chance of effectively supporting or refuting your hypothesis.

For instance (drawn from the students):

Hypothesis:

Graphic:

Carseats %>% 
  mutate(competition = mosaic::ntiles(CompPrice, 4)) %>%
  ggplot(aes(y = Sales, x = Price)) +
  geom_point(aes(color = Population)) + 
  facet_wrap(~ competition)

Can you see:

  • a relationship between sales and price?
  • a relationship between population and sales?
  • a relationship between competitor’s price and sales of your product?

Could you convince your (imagined) product manager of the importance or unimportance of these relationships?

Some refinement:

  • change the role of variables
  • maybe divide population into groups to get better graphical interpretation.

Quantitative presentations:

Linear Modeling

Linear modeling to produce coefficients and confidence intervals. This is perhaps the earliest machine learning technique. (But it’s usually not classified as “machine learning.” If it had been invented in 2010 instead of 1910, it would earn the label “machine learning.”)

model_1 <- Sales ~ Price + Population + CompPrice
htest(model_1, data = Carseats, test = "coefficients")
##          term     estimate    std.error  statistic      p.value
## 1 (Intercept)  5.632874697 0.9727948176   5.790404 1.431305e-08
## 2       Price -0.088157922 0.0058936327 -14.958164 2.029425e-40
## 3  Population  0.001711276 0.0007714885   2.218149 2.711135e-02
## 4   CompPrice  0.092966453 0.0091402498  10.171106 9.607284e-22
htest(model_1, data = Carseats, test = "anova" )
##         term  df       sumsq     meansq  statistic      p.value
## 1      Price   1  630.030405 630.030405 123.604393 3.561548e-25
## 2 Population   1    6.464456   6.464456   1.268249 2.607775e-01
## 3  CompPrice   1  527.307559 527.307559 103.451405 9.607284e-22
## 4  Residuals 396 2018.472278   5.097152         NA           NA
model_2 <- Sales ~ .
htest(model_2, data = Carseats, test = "coefficients")
##               term      estimate    std.error   statistic       p.value
## 1      (Intercept)  5.6606230631 0.6034486581   9.3804551  5.596251e-19
## 2        CompPrice  0.0928153421 0.0041476529  22.3777990  7.935340e-72
## 3           Income  0.0158028363 0.0018451176   8.5646772  2.579912e-16
## 4      Advertising  0.1230950886 0.0111236855  11.0660346  6.353734e-25
## 5       Population  0.0002078771 0.0003704559   0.5611385  5.750270e-01
## 6            Price -0.0953579188 0.0026710774 -35.7001707 1.175168e-124
## 7    ShelveLocGood  4.8501827110 0.1531099670  31.6777725 1.192737e-109
## 8  ShelveLocMedium  1.9567148062 0.1261056428  15.5164730  1.383807e-42
## 9              Age -0.0460451630 0.0031817142 -14.4718098  2.924395e-38
## 10       Education -0.0211018389 0.0197204930  -1.0700462  2.852637e-01
## 11        UrbanYes  0.1228863965 0.1129760904   1.0877204  2.773938e-01
## 12           USYes -0.1840928246 0.1498422926  -1.2285772  2.199750e-01
htest(model_2, data = Carseats, test = "anova")
##           term  df        sumsq       meansq    statistic       p.value
## 1    CompPrice   1   13.0666859   13.0666859   12.5855321  4.363308e-04
## 2       Income   1   79.0733616   79.0733616   76.1616479  7.829441e-17
## 3  Advertising   1  219.3512681  219.3512681  211.2741095  1.605525e-38
## 4   Population   1    0.3824026    0.3824026    0.3683214  5.442756e-01
## 5        Price   1 1198.8668836 1198.8668836 1154.7210803 2.373409e-118
## 6    ShelveLoc   2 1047.4749941  523.7374971  504.4519426 1.177738e-108
## 7          Age   1  217.3879264  217.3879264  209.3830638  2.972702e-38
## 8    Education   1    1.0503266    1.0503266    1.0116505  3.151346e-01
## 9        Urban   1    1.2202272    1.2202272    1.1752948  2.789892e-01
## 10          US   1    1.5671074    1.5671074    1.5094019  2.199750e-01
## 11   Residuals 388  402.8335143    1.0382307           NA            NA

Displaying the model:

fitted_mod <- lm(model_2, data = Carseats)
gmodel(fitted_mod, ~ Price + ShelveLoc + Age + Advertising,
       Age = c(30, 50, 70))

How would you re-arrange the variables to make the effect (or lack of effect) of each variable clear?

Regression tree

An middle-aged machine-learning technique

model_3 <- rpart(Sales ~ ., data = Carseats)
prp(model_3)

gmodel(model_3, ~ Price + ShelveLoc + Age + Advertising,
       Age = c(30, 50, 70))

Things to note from the graphical model:

  • Advertising doesn’t matter, except for high levels of advertising with the product in a “good” shelf location.
  • In general, people will look at bad shelf locations if the price is low.
  • Old people won’t reach down to the lower shelf even if the price is low.

Random Forest

An newish machine-learning technique

model_4 <- randomForest(Sales ~ ., data = Carseats)
importance(model_4)
##             IncNodePurity
## CompPrice       284.35289
## Income          232.21808
## Advertising     277.93418
## Population      184.03746
## Price           760.56054
## ShelveLoc       754.11964
## Age             330.90807
## Education       126.64105
## Urban            25.38849
## US               39.98243
gmodel(model_4, ~ Price + ShelveLoc + Age + Advertising,
       Age = c(30, 50, 70))

A more nuanced relationship between price and sales and between shelf location and sales.

Compared to linear models and ANOVA, variables such as CompPrice and income seem to be playing more of a role.

Deciding on variables

Compare models:

small_mod <- randomForest(Sales ~ Price + Age + ShelveLoc, data = Carseats)
med_mod <- randomForest(Sales ~ Price + Age + ShelveLoc + CompPrice + Advertising, data = Carseats)
big_mod <- randomForest(Sales ~ ., data = Carseats)
CV_results <- cv_pred_error(small_mod, med_mod, big_mod)
gf_point(mse ~ model, data = CV_results)

Alas, gmodel() is not yet working on bootstrap ensembles of random forest models.

Quantifying the effect of variables

E <- ensemble(med_mod, nreps = 100)
One <- effect_size(E, ~ CompPrice, ShelveLoc = c("Good", "Medium", "Bad"))
gf_point(slope ~ ShelveLoc, data = One)

Two <- effect_size(E, ~ ShelveLoc, ShelveLoc = "Good")
Three <- effect_size(E, ~ CompPrice, ShelveLoc = c("Good", "Medium", "Bad"))
gf_point(slope ~ ShelveLoc, data = Three)

Let’s take a look using linear models:

small_mod <- lm(Sales ~ Price + Age + ShelveLoc, data = Carseats)
med_mod <- lm(Sales ~ Price + Age + ShelveLoc + CompPrice + Advertising, data = Carseats)
big_mod <- lm(Sales ~ ., data = Carseats)
CV_results2 <- cv_pred_error(small_mod, med_mod, big_mod)
gf_point(mse ~ model, data = CV_results2)