CVC 2017

A note about these slides

The support for documentation creation in RStudio is great.

  • These slides are HTML, but I created them in RMarkdown (+ a little bit of HTML fiddling)

  • A single RMarkdown file can generate PDF, HTML, or Word

  • No need to know HTML, LateX or Word

  • But if you do, you can take advantage

  • You will learn about RMarkdown later today

Less Volume, More Creativity

Less Volume, More Creativity

Mike McCarthy

Head coach, Green Bay Packers (NFL Football)

  • Packers subscribe to "draft and develop"
  • Among the youngest teams in the league every year
  • Coaching staff constantly teaching young players

Mike McCarthy

Head coach, Green Bay Packers (NFL Football)

  • Packers subscribe to "draft and develop"
  • Among the youngest teams in the league every year
  • Coaching staff constantly teaching young players

Joe from Fitchburg, WI:

Do you have a favorite Mike McCarthy quote?

Mike McCarthy

Head coach, Green Bay Packers (NFL Football)

  • Packers subscribe to "draft and develop"
  • Among the youngest teams in the league every year
  • Coaching staff constantly teaching young players

Joe from Fitchburg, WI:

Do you have a favorite Mike McCarthy quote? Mine is "statistics are for losers".

Mike McCarthy

Head coach, Green Bay Packers (NFL Football)

  • Packers subscribe to "draft and develop"
  • Among the youngest teams in the league every year
  • Coaching staff constantly teaching young players

Joe from Fitchburg, WI:

Do you have a favorite Mike McCarthy quote? Mine is "statistics are for losers".

Vic Ketchman (packers.com):

"Less volume, more creativity."

Source: Ask Vic @ packers.com

More Mike McCarthy Quotes

You’ve got to watch that you don’t do too much. We have a philosophy on our coaching staff about less volume, more creativity.

A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that … In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players."

More Mike McCarthy Quotes

Q. (for McCarthy) How many offensive and defensive plays might you have coming into a game on average?

A. That's an excellent question because years ago when I first got into the NFL we had 150 passes in our game plan. I've put a sign on all of the coordinators' doors - less volume, more creativity. We function with more concepts with less volume. We're more around 50 (passes) into a game plan.

Source: http://www.jsonline.com/packerinsider/106968233.html (Nov 10, 2010)

The Minimal R Exercise

List every R command used throughout a course

Organize by syntactic similarity and by purpose

Scratch everything you could have done without

Replace dissimilar tools with more similar tools

Aim for a set of commands that is

  • small: fewer is better
  • coherent: commands should be as similar as possible
  • powerful: can do what needs doing

Result: Minimal R for Intro Stats

Less Volume, More Creativity

It is not enough to use R, it must be used efficiently and elegantly.

The mosaic package attempts to be part of one solution.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.

— Antoine de Saint-Exupery (writer, poet, pioneering aviator)

Introductory Tutorial

The Most Important Template

 

goal ( yyy ~ xxx , data = mydata )

 

The Most Important Template

 

goal (  y  ~  x  , data = mydata )

The Most Important Template

 

goal (  y  ~  x  , data = mydata , …)

 

Other versions:

# simpler version
goal( ~ x, data = mydata )          
# fancier version
goal( y ~ x | z , data = mydata )   
# unified version
goal( formula , data = mydata )     

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

 

What must R know to do that?

2 Questions

 

goal (  y  ~  x  , data = mydata )

 

What do you want R to do? (goal)

  • This determines the function to use

What must R know to do that?

  • This determines the inputs to the function
  • Must identify the variables and data frame

How do we make this plot?

How do we make this plot?

What is the Goal?

What does R need to know?

How do we make this plot?

What is the Goal?

  • a plot with points

What does R need to know?

  • which variable goes where
  • which data set

How do we tell R to make this plot?

What is the Goal?

  • a plot with points (gf_point())

What does R need to know?

  • which variable goes where (births ~ dayofyear)
  • which data set (data = Births78)
    • use ?Births78 for documentation

How do we make this plot?

 

goal (  y  ~  x  , data = mydata )

 

gf_point( births ~ dayofyear, data = Births78) 

Your turn: How do you make this plot?

Two Questions?

Your turn: How do you make this plot?

  1. Command: gf_boxplot()

  2. The data: HELPrct
  • Variables: age, substance
  • use ?HELPrct for info about data

Your turn: How do you make this plot?

gf_boxplot(age ~ substance, data = HELPrct)

Graphical Summaries: One Variable

gf_histogram( ~ age, data = HELPrct) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note: When there is one variable it is on the right side of the formula.

Graphical Summaries: One Variable

gf_histogram( ~ age, data = HELPrct, binwidth = 5) 

Note: When there is one variable it is on the right side of the formula.

Graphical Summaries: Overview

One Variable

gf_histogram( ~ age, data = HELPrct ) 
  gf_density( ~ age, data = HELPrct ) 
     gf_dens( ~ age, data = HELPrct ) 
 gf_freqpoly( ~ age, data = HELPrct ) 
       gf_qq( ~ age, data = HELPrct ) 
  gf_boxplot( ~ age, data = HELPrct ) 
      gf_bar( ~ sex, data = HELPrct )

Two Variables

  gf_point(  i1 ~ age,       data = HELPrct ) 
gf_boxplot( age ~ substance, data = HELPrct ) 
  • i1 average number of drinks (standard units) consumed per day in the past 30 days (measured at baseline)

The Graphics Template

One variable

plotname ( ~  x  , data = mydata , …)

  • gf_histogram(), gf_qq(), gf_dens(), gf_freqpoly(), gf_bar()

 

Two Variables

plotname (  y  ~  x  , data = mydata , …)

  • gf_point(), gf_line(), gf_boxplot()

Your turn

Create a plot of your own choosing with one of these data sets

names(KidsFeet)    # 4th graders' feet
?KidsFeet
names(Utilities)   # utility bill data
?Utilities
names(Births)      # like Births78, but more years
?Births
require(NHANES)
names(NHANES)      # body shape, etc.
?NHANES

Facets

  • Use y ~ x | z to create multipanel plots.
gf_freqpoly( ~ age | sex, data = HELPrct, binwidth = 5)

Facets

  • Use y ~ x | z to create multipanel plots.
gf_line( births ~ date | wday, data = Births78)

Facets in 2d

  • Use y ~ x | z1 ~ z2 to create a grid of multipanel plots.
gf_freqpoly( ~ age | sex ~ substance, data = HELPrct, binwidth = 5)

Other attributes

So far we have focused on position, but we can also change attributes like color, size, and shape. Two ways to map attributes to variables/exressions:

  • , attribute = ~ expr

  • + attribute:variable (or attribute::(expression)) as part of formula

gf_line( births ~ date,  color = ~ wday, data = Births78)  
gf_line( births ~ date + color:wday,     data = Births78)  

Mix and Match

We can combine attributes and facets.

gf_freqpoly( ~ age | sex, data = HELPrct, color = ~ substance, binwidth = 5)

Bells & Whistles

  • titles
  • axis labels
  • fonts and colors for labels
  • custom scales (user chosen colors, sizes, etc.)
  • etc, etc.

My approach:

  • Let the students ask or
  • Let the data analysis drive

Bells & Whistles

This is ugly, but hints at some of the things that can be modified with bells and whistles.

Note: The ggplot2 system, on which gformula is built, is based on a notation of layers. It is easy to add multiple layers onto a single plot. In this example, we have layered points and lines (using both gf_point() and gf_line()). More on this later…

Numerical Summaries: One Variable

Big idea: Replace plot name with summary name

  • Nothing else changes
gf_histogram( ~ age, data = HELPrct)
        mean( ~ age, data = HELPrct )
## [1] 35.7

Other Summaries

The mosaic package includes formula aware versions of mean(), sd(), var(), min(), max(), sum(), IQR(), …

Also provides favstats() to compute our favorites.

favstats( ~ age, data = HELPrct )
##  min Q1 median Q3 max mean   sd   n missing
##   19 30     35 40  60 35.7 7.71 453       0

Tallying

tally( ~ sex, data = HELPrct)
## sex
## female   male 
##    107    346
tally( ~ substance, data = HELPrct)
## substance
## alcohol cocaine  heroin 
##     177     152     124

Numerical Summaries: Two Variables

Three ways to think about this. All do the same thing.

sd(   age ~ substance, data = HELPrct )
sd( ~ age | substance, data = HELPrct )
sd( ~ age, groups = substance, data = HELPrct )
## alcohol cocaine  heroin 
##    7.65    6.69    7.99

Numerical Summaries: Tables

tally( sex ~ substance, data = HELPrct )
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94
tally( ~ sex + substance, data = HELPrct )
##         substance
## sex      alcohol cocaine heroin
##   female      36      41     30
##   male       141     111     94

Numerical Summaries

mean( age ~ substance | sex, data = HELPrct )
##  A.F  C.F  H.F  A.M  C.M  H.M    F    M 
## 39.2 34.9 34.7 38.0 34.4 33.1 36.3 35.5
mean( age ~ substance | sex, data = HELPrct, .format = "table" )
##   substance sex mean
## 1         A   F 39.2
## 2         A   M   38
## 3         C   F 34.9
## 4         C   M 34.4
## 5         H   F 34.7
## 6         H   M 33.1
  • I've abbreviated the names to make things fit on slide
  • Also works for median(), min(), max(), sd(), var(), favstats(), etc.

One Template to Rule a Lot

  • single and multiple variable graphical summaries
  • single and multiple variable numerical summaries
  • linear models
      mean( age ~ sex, data = HELPrct )
gf_boxplot( age ~ sex, data = HELPrct ) 
        lm( age ~ sex, data = HELPrct )
## female   male 
##   36.3   35.5
## (Intercept)     sexmale 
##      36.252      -0.784

Exercises

Answer each question with both a numerical summary and a graphical summary.

  1. Are boys feet larger than girls feet? (KidsFeet)

  2. Do boys and girls differently shaped feet? (KidsFeet)

Getting Oriented to RStudio

Access to RStudio Sever

Menu

Panes and Tabs

  • 4 panes (but some may be minimized).
  • Each pane can have multiple tabs.
  • Options let you decide which tabs go in which panes.

The Console – ephemeral, interactive commands

R is case sensitive

  • many students are not case sensitive

Arrows and Tab

  • up/down arrows scroll through history
  • TAB completion can simplify typing

If all else fails, try ESC

  • If you see a + prompt, it means R is waiting for more input
  • If this is unintentional, you probably have a typo
  • ESC will get you pack to the command prompt

Caclulation and Assignment

  • Basic calculation is similar to a calculator
  • Results can be stored in named variables with <- (or ->)
  • Good place to get started with R (but we will learn other options shortly)
product <- 5 * 3 * 27
product
## [1] 405
sqrt(100)
## [1] 10
log10(product)
## [1] 2.61

Environment Tab

  • Shows the objects in the working environment
  • Broom icon clears the working environment

RStudio Environment Tab

History Tab

  • Searchable list of previous commands
  • Commands can be copied to console or to files

RStudio History Tab

Packages Tab

  • List all installed packages
  • Install new packages
  • Load/unload packages by clicking check boxes
  • Click on name of package to find documentation, vignettes, and other info

RStudio Packages Tab

Help!

RStudio provides several ways to get help

  • Help tab documents functions and data sets in packages
  • ? followed by name of function or data set
  • Package vignettes
  • TAB completion provides useful hints

RStudio History Tab

Some Other Things in mosaic

Some other things

The mosaic package includes some other things, too

  • Data sets (you've already seen some of them)
  • xtras: xchisq.test(), xpnorm(), xqqmath()
  • mplot()
  • mplot(HELPrct) interactive plot creation
  • replacements for plot() in some situations
  • simplified histogram() controls (e.g., width)
  • simplified ways to add onto lattice plots

xpnorm()

xpnorm( 700, mean = 500, sd = 100)
## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 700) = P(Z <= 2) = 0.9772
##  P(X >  700) = P(Z >  2) = 0.02275

## [1] 0.977

xpnorm()

xpnorm( c(300, 700), mean = 500, sd = 100)
## 
## If X ~ N(500, 100), then 
## 
##  P(X <= 300) = P(Z <= -2) = 0.02275
##      P(X <= 700) = P(Z <=  2) = 0.97725
##  P(X >  300) = P(Z >  -2) = 0.97725
##      P(X >  700) = P(Z >   2) = 0.02275

## [1] 0.0228 0.9772

xchisq.test()

xchisq.test(phs)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  x
## X-squared = 20, df = 1, p-value = 8e-07
## 
##    104.00   10933.00 
## (  146.52) (10890.48)
## [12.05]  [ 0.16] 
## <-3.51>  < 0.41> 
##    
##    189.00   10845.00 
## (  146.48) (10887.52)
## [12.05]  [ 0.16] 
## < 3.51>  <-0.41> 
##    
## key:
##  observed
##  (expected)
##  [contribution to X-squared]
##  <Pearson residual>

Modeling

Modeling is really the starting point for the mosaic design.

  • linear models (lm() and glm()) defined the template
  • lattice and ggformula graphics use the template (so we chose one of these for plotting)
  • we added functionality so numerical summaries can be done with the same template
  • additional things added to make modeling easier for beginners.

Models as Functions

model <- lm(width ~ length * sex, data = KidsFeet)
Width <- makeFun(model)
Width( length = 25, sex = "B")
##    1 
## 9.17
Width( length = 25, sex = "G")
##    1 
## 8.94

Models as Functions – Plotting

gf_point( width ~ length + color::sex, data = KidsFeet) %>%
  gf_fun(Width(length, sex = "B") ~ length, color = "navy") %>%
  gf_fun(Width(length, sex = "G") ~ length, color = "red") %>%
  gf_refine(scale_color_manual(values = c("navy", "red")))

Resampling – You can do() it!

Lady Tasting Tea

  • Often used on first day of class

  • Story

  • woman claims she can tell whether milk has been poured into tea or vice versa.

  • Question: How do we test this claim?

Computer Simulation

Use rflip() to simulate flipping coins

rflip()
## 
## Flipping 1 coin [ Prob(Heads) = 0.5 ] ...
## 
## H
## 
## Number of Heads: 1 [Proportion Heads: 1]

Computer Simulation

Faster if we flip multiple coins at once:

rflip(10)
## 
## Flipping 10 coins [ Prob(Heads) = 0.5 ] ...
## 
## H H H T T T H H H T
## 
## Number of Heads: 6 [Proportion Heads: 0.6]
  • easier to consider heads = correct; tails = incorrect than to compare with a given pattern
  • this switch bothers me more than it bothers my students

Let's do that a lot of times

rflip(10) simulates 1 lady tasting 10 cups 1 time.

We can do that many times to see how guessing ladies do:

do(2) * rflip(10)
##    n heads tails prop
## 1 10     6     4  0.6
## 2 10     5     5  0.5
  • do() is clever about what it remembers
  • 2 isn't many – we'll do many next.

The distribution of guessing ladies

Ladies <- do(5000) * rflip(10)
head(Ladies, 1)
##    n heads tails prop
## 1 10     4     6  0.4
gf_histogram( ~ heads, data = Ladies, binwidth = 1 )

How often does guessing score 9 or 10?

tally( ~ (heads >= 9) , data = Ladies)
## (heads >= 9)
##  TRUE FALSE 
##    52  4948

How often does guessing score 9 or 10?

tally( ~ (heads >= 9) , data = Ladies, format = "prop")
## (heads >= 9)
##   TRUE  FALSE 
## 0.0104 0.9896

How often does guessing score 9 or 10?

tally( ~ (heads >= 9) , data = Ladies, format = "prop")
## (heads >= 9)
##   TRUE  FALSE 
## 0.0104 0.9896
 prop( ~ (heads >= 9), data = Ladies)
##   TRUE 
## 0.0104

A general approach to randomization

  1. Do it for your data
  2. Do it for "random" data
  3. Do it lots of times for "random" data
  • definition of "random" is important, but can often be handled by shuffle() or resample()

Example: Do mean ages differ by sex?

diffmean(age ~ sex, data = HELPrct)
## diffmean 
##   -0.784
do(1) * 
  diffmean(age ~ shuffle(sex), data = HELPrct)
##   diffmean
## 1    0.109
Null <- do(5000) * 
  diffmean(age ~ shuffle(sex), data = HELPrct)

Example: Do mean ages differ by sex?

prop( ~ (abs(diffmean) > 0.7841), data = Null )
##  TRUE 
## 0.362
gf_histogram( ~ diffmean, data = Null, bins = 50) %>%
  gf_vline(xintercept = -.7841) 

Example: Bootstrap for difference in means

Bootstrap <- do(5000) * 
  diffmean(age ~ sex, data= resample(HELPrct))

gf_histogram( ~ diffmean, data = Bootstrap, bins = 50) %>%
  gf_vline(xintercept = -.7841, color = "red")

Example: CI for difference in mean ages

cdata( ~ diffmean, data = Bootstrap, p = .95)
##       low        hi central.p 
##    -2.479     0.889     0.950
confint(Bootstrap, method = "quantile")
##       name lower upper level     method estimate
## 1 diffmean -2.48 0.889  0.95 percentile   -0.784

Example: CI for difference in mean ages

confint(Bootstrap)  # default uses st. err
##       name lower upper level     method estimate
## 1 diffmean -2.48 0.889  0.95 percentile   -0.784

Randomization and linear models

do(1) * lm(width ~ length, data = KidsFeet)
##   Intercept length sigma r.squared    F numdf dendf .row .index
## 1      2.86  0.248 0.396     0.411 25.8     1    37    1      1
do(3) * lm( width ~ shuffle(length), data = KidsFeet)
##   Intercept  length sigma r.squared     F numdf dendf .row .index
## 1     11.64 -0.1071 0.496   0.07664 3.071     1    37    1      1
## 2     11.54 -0.1031 0.498   0.07110 2.832     1    37    1      2
## 3      9.75 -0.0308 0.515   0.00635 0.236     1    37    1      3

Randomization and linear models

do(1) * 
  lm(width ~ length + sex, data = KidsFeet)
##   Intercept length   sexG sigma r.squared    F numdf dendf .row .index
## 1      3.64  0.221 -0.233 0.385      0.46 15.3     2    36    1      1
do(3) * 
  lm( width ~ length + shuffle(sex), data = KidsFeet)
##   Intercept length    sexG sigma r.squared    F numdf dendf .row .index
## 1      2.94  0.243  0.0779 0.400     0.417 12.9     2    36    1      1
## 2      3.45  0.228 -0.2083 0.388     0.451 14.8     2    36    1      2
## 3      2.84  0.248  0.0157 0.402     0.411 12.6     2    36    1      3

Randomization and linear models

Null <- do(5000) * 
  lm( width ~ length + shuffle(sex), data = KidsFeet)
gf_histogram( ~ sexG, data = Null) %>%
  gf_vline(xintercept = -0.2325, color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Randomization and linear models

gf_histogram( ~ sexG, data = Null) %>%
  gf_vline(xintercept = -0.2325, color = "red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

prop(~ (sexG <= -0.2325), data = Null)
##  TRUE 
## 0.037