Randy Pruim
2014 CVC Workshop
The new support for documentation creation in RStudio is great.
These slides are HTML, but I created them in RMarkdown (+ a little bit of HTML fiddling)
Other slides will be PDF created using beamer – but again, I wrote them in RMarkdown
A single RMarkdown file can generate PDF, HTML, or Word
A lot of times you end up putting in a lot more volume, because you are teaching fundamentals and you are teaching concepts that you need to put in, but you may not necessarily use because they are building blocks for other concepts and variations that will come off of that … In the offseason you have a chance to take a step back and tailor it more specifically towards your team and towards your players.“
Mike McCarthy, Head Coach, Green Bay Packers |
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.
— Antoine de Saint-Exupery (writer, poet, pioneering aviator) |
One key to successfully introducing R is finding a set of commands that is
It is not enough to use R, it must be used elegantly.
Two examples of this principle:
Goal: a minimal set of R commands for Intro Stats
Result: Minimal R Vignette
Much of the work on the mosaic
package has been motivated
by
# simpler version
goal( ~ x, data = mydata )
# fancier version
goal( y ~ x | z , data = mydata )
# unified version
goal( formula , data = mydata )
xyplot()
)births ~ dayofyear
)data=Births78
)
xyplot( births ~ dayofyear, data=Births78)
The data: HELPrct
Variables: age
, substance
Command: bwplot()
bwplot( age ~ substance, data=HELPrct)
bwplot( substance ~ age, data=HELPrct )
histogram( ~ age, data=HELPrct)
Note: When there is one variable it is on the right side of the formula.
histogram( ~age, data=HELPrct )
densityplot( ~age, data=HELPrct )
bwplot( ~age, data=HELPrct )
qqmath( ~age, data=HELPrct )
freqpolygon( ~age, data=HELPrct )
bargraph( ~sex, data=HELPrct )
xyplot( i1 ~ age, data=HELPrct )
bwplot( age ~ substance, data=HELPrct )
bwplot( substance ~ age, data=HELPrct )
histogram()
, qqmath()
, densityplot()
, freqpolygon()
, bargraph()
xyplot()
, bwplot()
Create a plot of your own choosing with one of these data sets
names(KidsFeet) # 4th graders' feet
?KidsFeet
names(Utilities) # utility bill data
?Utilities
names(NHANES) # body shape, etc.
?NHANES
groups =
group to overlay.y ~ x | z
to create multipanel plots.densityplot( ~ age | sex, data=HELPrct,
groups=substance,
auto.key=TRUE)
My approach:
xyplot( births ~ dayofyear, data=Births78,
groups=dayofyear %% 7, type='l',
auto.key=list(columns=4, lines=T, points=F),
par.settings=list(
superpose.line=list( lty=1 ) ))
July 1978
Su Mo Tu We Th Fr Sa
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31
Big idea: Replace plot name with summary name
histogram( ~ age, data=HELPrct )
mean( ~ age, data=HELPrct )
[1] 35.65
The mosaic package includes formula aware versions of
mean()
,
sd()
,
var()
,
min()
,
max()
,
sum()
,
IQR()
, …
Also provides favstats()
to compute our favorites.
favstats( ~ age, data=HELPrct )
min Q1 median Q3 max mean sd n missing
19 30 35 40 60 35.65 7.71 453 0
tally( ~ sex, data=HELPrct)
female male
107 346
tally( ~ substance, data=HELPrct)
alcohol cocaine heroin
177 152 124
Three ways to think about this. All do the same thing.
sd( age ~ substance, data=HELPrct )
sd( ~ age | substance, data=HELPrct )
sd( ~ age, groups=substance, data=HELPrct )
alcohol cocaine heroin
7.652 6.693 7.986
tally( sex ~ substance, data=HELPrct )
substance
sex alcohol cocaine heroin
female 0.2034 0.2697 0.2419
male 0.7966 0.7303 0.7581
tally( ~ sex + substance, data=HELPrct )
substance
sex alcohol cocaine heroin
female 36 41 30
male 141 111 94
mean( age ~ substance | sex, data=HELPrct )
A.F C.F H.F A.M C.M H.M F M
39.17 34.85 34.67 37.95 34.36 33.05 36.25 35.47
median()
, min()
, max()
, sd()
, var()
, favstats()
, etc. mean( age ~ sex, data=HELPrct )
bwplot( age ~ sex, data=HELPrct )
lm( age ~ sex, data=HELPrct )
female male
36.25 35.47
(Intercept) sexmale
36.2523 -0.7841
The mosaic
package includes some other things, too
xchisq.test()
, xpnorm()
, xqqmath()
mPlot()
– interactive plot designhistogram()
controls (e.g., width
)xpnorm( 700, mean=500, sd=100)
If X ~ N(500,100), then
P(X <= 700) = P(Z <= 2) = 0.9772
P(X > 700) = P(Z > 2) = 0.0228
[1] 0.9772
xpnorm( c(300, 700), mean=500, sd=100)
If X ~ N(500,100), then
P(X <= 300) = P(Z <= -2) = 0.0228
P(X <= 700) = P(Z <= 2) = 0.9772
P(X > 300) = P(Z > -2) = 0.9772
P(X > 700) = P(Z > 2) = 0.0228
[1] 0.02275 0.97725
xchisq.test(phs)
Pearson's Chi-squared test with Yates' continuity correction
data: phs
X-squared = 24.43, df = 1, p-value = 7.71e-07
104.00 10933.00
( 146.52) (10890.48)
[12.34] [ 0.17]
<-3.51> < 0.41>
189.00 10845.00
( 146.48) (10887.52)
[12.34] [ 0.17]
< 3.51> <-0.41>
key:
observed
(expected)
[contribution to X-squared]
<residual>
Modeling is really the starting point for the mosaic
design.
lm()
and glm()
) defined the templatemodel <- lm(width ~ length * sex,
data=KidsFeet)
Width <- makeFun(model)
Width( length=25, sex="B")
1
9.168
Width( length=25, sex="G")
1
8.939
xyplot( width ~ length, data=KidsFeet,
groups=sex, auto.key=TRUE )
plotFun( Width(length, sex="B") ~ length,
col=1, add=TRUE)
plotFun( Width(length, sex="G") ~ length,
col=2, add=TRUE)
Often used on first day of class
Story
Use rflip()
to simulate flipping coins
rflip()
[1] 1
attr(,"n")
[1] 1
attr(,"prob")
[1] 0.5
attr(,"sequence")
[1] "H"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"
Faster if we flip multiple coins at once:
rflip(10)
[1] 6
attr(,"n")
[1] 10
attr(,"prob")
[1] 0.5
attr(,"sequence")
[1] "H" "H" "H" "T" "T" "T" "H" "H" "H" "T"
attr(,"verbose")
[1] TRUE
attr(,"class")
[1] "cointoss"
heads
= correct; tails
= incorrect than to compare with a given pattern
rflip(10)
simulates 1 lady tasting 10 cups 1 time.
We can do that many times to see how guessing ladies do:
do(2) * rflip(10)
n heads tails prop
1 10 7 3 0.7
2 10 6 4 0.6
do()
is clever about what it remembersLadies <- do(5000) * rflip(10)
head(Ladies, 1)
n heads tails prop
1 10 7 3 0.7
histogram( ~ heads, data=Ladies, width=1 )
tally( ~(heads >= 9) , data=Ladies)
TRUE FALSE
46 4954
tally( ~(heads >= 9) , data=Ladies,
format="prop")
TRUE FALSE
0.0092 0.9908
tally( ~(heads >= 9) , data=Ladies,
format="prop")
TRUE FALSE
0.0092 0.9908
prop( ~(heads >= 9), data=Ladies)
TRUE
0.0092
shuffle()
or resample()
diffmean(age ~ sex, data=HELPrct)
diffmean
-0.7841
do(1) *
diffmean(age ~ shuffle(sex), data=HELPrct)
diffmean
1 1.357
Null <- do(5000) *
diffmean(age ~ shuffle(sex), data=HELPrct)
prop( ~(abs(diffmean) > 0.7841), data=Null )
TRUE
0.359
histogram(~ diffmean, data=Null, v=-.7841)
Bootstrap <- do(5000) *
diffmean(age~sex, data= resample(HELPrct))
histogram( ~diffmean, data=Bootstrap,
v=-.7841, glwd=4 )
cdata(~diffmean, data=Bootstrap, p=.95)
low hi central.p
-2.4347 0.8129 0.9500
confint(Bootstrap, method="quantile")
name lower upper level method
1 diffmean -2.435 0.8129 0.95 quantile
confint(Bootstrap) # default uses st. err
name lower upper level method estimate margin.of.error
1 diffmean -2.439 0.842 0.95 stderr -0.7985 1.64
do(1) * lm(width ~ length, data=KidsFeet)
Intercept length sigma r.squared
1 2.862 0.2479 0.3963 0.411
do(3) * lm( width ~ shuffle(length), data=KidsFeet)
Intercept length sigma r.squared
1 10.339 -0.05447 0.5113 0.019832
2 7.434 0.06301 0.5095 0.026545
3 8.581 0.01663 0.5159 0.001848
do(1) *
lm(width ~ length + sex, data=KidsFeet)
Intercept length sexG sigma r.squared
1 3.641 0.221 -0.2325 0.3849 0.4595
do(3) *
lm( width ~ length + shuffle(sex),
data=KidsFeet)
Intercept length sexG sigma r.squared
1 2.907 0.2467 -0.02617 0.4016 0.4117
2 2.837 0.2541 -0.26178 0.3781 0.4784
3 2.778 0.2503 0.05202 0.4009 0.4136
Null <- do(5000) *
lm( width ~ length + shuffle(sex),
data=KidsFeet)
histogram( ~ sexG, data=Null,
v=-0.2325, glwd=4)
histogram(~sexG, data=Null,
v=-0.2325, glwd=4)
prop(~ (sexG <= -0.2325), data=Null)
TRUE
0.0334