%>%
Gilbert group_by(death, gilbert) %>% tally()
death | gilbert | n |
---|---|---|
no | not | 1027 |
no | on_duty | 157 |
yes | not | 357 |
yes | on_duty | 100 |
The data frame math300::Gilbert
records 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert
.) Whether or not a patient died during a shift is indicated by the variable death
. The variable gilbert
marks whether or not nurse Kristen Gilbert was on duty during the shift.
A. Are the variables numerical or categorical?
B. The data were used as evidence during Gilbert’s trial for murder. The data can be reduced to a compact form by data wrangling:
%>%
Gilbert group_by(death, gilbert) %>% tally()
death | gilbert | n |
---|---|---|
no | not | 1027 |
no | on_duty | 157 |
yes | not | 357 |
yes | on_duty | 100 |
Calculate the proportion of times a death occurred when Gilbert was on duty, and the proportion of time when Gilbert was not on duty. Does the shift evidence support the claim that Gilbert was a contributing factor to the deaths.
C. In the Lessons, we will mainly use regression modeling rather than wrangling as a data-reduction technique. In the trial, the relevant question was whether Gilbert was a factor in the deaths. Express this question as a tilde expression involving the variables death
and gilbert
.
D. The response variable in your tilde expression from (C) is categorical. Regression modeling requires a numerical response variable. Use mutate()
and zero_one()
to transform the variable death
from categorical form to a numerical form that has a value 1 when Gilbert was on duty and 0 otherwise. This will become the response variable for the regression: death ~ gilbert
.
E. Carry out the regression using lm()
. Look at the coefficients from the fitted model with conf_interval()
; these are in .coef
column. Explain how the coefficients correspond to the proportions you calculated in (B).
Write out the R commands to make these graphics, based on the Whickham
data frame.
Graph A
Graph B
Graph C
It is undeniably odd that the units of the variance are the square of the units of the variable for which the variance is calculated. For example, if the variable records height in units of meters, the variance of height will have units of square-meters. Sometime, the units of the variance have no physical meaning to most people. A case in point, if the variable records the temperature in degreesC, the variance will have units of degreesC2.
The odd units of the variance make it hard to estimate from a graph. But there is a trick that makes it straightforward.
Graph the values of the variable with a jittered point plot. The horizontal axis has no content because variance is about a single variable, not a pair of variables. The horizontal space is needed only to provide room for the jittered points to spread out.
Mark the graph with two horizontal lines. The space between the lines should include the central two-thirds of the points in the graph.
Find the intercept of each of the two lines with the vertical axis. We’ll call the two intercepts “top” and “bottom.”
Find the numerical difference between the top and the bottom. Divide this difference by 2 to get a “by-eye” estimate of the standard deviation. That is, the standard deviation is the length of a vertical line drawn from the mid-point of “top” and “bottom” to the top.
Square the standard deviation to get the value of the variance.
For each of the following graphics (in the style described in (1)), estimate the standard deviation and from that the variance of the data in the plot.
Graph | Standard Dev. | Variance | units of sd | units of var |
---|---|---|---|---|
A | ||||
B | ||||
C | ||||
D |
Consider this violin plot
For each group, judge by eye what fraction of the data points have a value of 2 or below.
Which of the three groups has multiple peaks in its density?
Which group has the lowest median?
Here is the variance, broken down by sex
of three variables from the Galton
data frame.
%>%
Galton summarize(vh = var(height), vm = var(mother), vf = var(father))
vh | vm | vf |
---|---|---|
12.8373 | 5.322365 | 6.102164 |
All three variables—height
, mother
, father
—give the height of a person. The height
variable is special only because the people involved are the (adult-aged) children of the respective mother and father.
Height is measured in inches in Galton
. What are the units of the variances?
Can you tell from the variances which group is the tallest: fathers, mothers, or adult children?
Mothers and fathers have about the same variance. Yet the heights of the children have a variance that is more than twice as big. Let’s see why this is.
As you might expect, the mothers are all females and the fathers are all males. But the children, whose heights are recorded in heights
, are a mixture of males and females. So the large variance, 12.8 square-inches, is a combination of the systematic variation between males and females and the person-to-person variation within each group.
%>%
Galton group_by(sex) %>%
mutate(hdev = height - mean(height)) %>%
summarize(v_sex_adjusted = var(hdev))
sex | v_sex_adjusted |
---|---|
F | 5.618415 |
M | 6.925288 |
Because the data have been grouped by sex
, the mean(height)
is found separately for males and females.
vm
and vf
from the calculation made at the start of this exercise.The variance calculations made using summarize()
, var()
and group_by()
can be done another way using regression modeling.
lm(height ~ sex, data=Galton) %>%
model_eval() %>%
group_by(sex) %>%
summarize(var(.resid))
Using training data as input to model_eval().
sex | var(.resid) |
---|---|
F | 5.618415 |
M | 6.925288 |
Consider these calculations of the variation in age
in the Whickham
data frame:
%>% summarize(var(age)) Whickham
var(age) |
---|
303.8756 |
%>% summarize(var(age - 5)) Whickham
var(age - 5) |
---|
303.8756 |
%>% summarize(var(age - mean(age))) Whickham
var(age - mean(age)) |
---|
303.8756 |
Explain why the answers are all the same, even though different amounts are being subtracted from age
?
Consider this graph with 4 nodes:
Is the graph acyclic?
True or false: c
is caused jointly by a
and b
.
True or false: a
is caused exclusively by b
.
True or false: d
gets inputs from all of a
, b
, and c
.
Identify the exogenous nodes in each of these DAGs.
A fundamental problem in analyzing data is to determine which variables are connected to which other variables. We will mainly use regression modeling for this. A simpler but less flexible method is to use some a basic principle about variances. The principle is this:
x
and y
, then the sum of the variances will equal the variance of the sum. Any deviation from this indicates a connection: the bigger the deviation, the stronger the connection.To illustrate, consider dag03
and these variance calculations:
sample(dag03, size=100) %>%
summarize(vx = var(x), vy=var(y), vsum=var(x+y))
vx | vy | vsum |
---|---|---|
2.430999 | 1.903885 | 6.536116 |
The variance of x+y
(6.5) is larger than the sum of the individual variances (2.4 and 1.9 for x
and y
, respectively.)
Part A
In dag04
, shown below, nodes a
, b
, and c
are exogenous, but d
is not.
sample(dag04, size=100) %>%
summarize(va = var(a), vb = var(b), vsum=var(a+b))
va | vb | vsum |
---|---|---|
1.314656 | 1.152798 | 2.555548 |
Do the variances indicate any connection between nodes a
and b
?
Repeat the calculations, but looking for a possible connection between b
and d
. Is there a connection?
Part B
Turn to dag10
. Using the variance principle, demonstrate that none of the exogenous nodes are connected but that all are connected to y
.
Optional: Triangles
A triangle is a familiar object. Here’s a picture with the angle between sides a and b marked as \(\theta\).
For a right triangle, the angle \(\theta = 90^\circ\). And, as you know, for a right triangle the Pythagorean theorem holds for the lengths of the sides of the triangle, \(a^2 + b^2 = c^2\).
There is a more general formula that holds for any triangle, regardless of \(\theta\). It is called the “Law of cosines”:
\[a^2 + b^2 - 2ab \cos(\theta) = c^2\ .\] What’s important for us here is that the Law of cosines suggests a powerful and intuitive way to think about the connection between two variables: as an angle! When two variables are not connected, the angle between them is \(\theta = 90^\circ\) as in a right triangle. However, when the two variables are connected, the angle will be something else, say \(30^\circ\) or \(140^\circ\). What the statisticians call the “correlation coefficient” of two variables is really nothing more than the cosine of the angle between them.
Simulations are often used to plan for readiness, risk, or resource allocation. Here is a problem taught in an operations research course at the US Air Force Academy:
You are the Officer in Charge (OIC) of the 74th AMU (Aircraft Maintenance Unit). You are in charge of 24 A-10s and 220 Airmen. On a typical day, 10 aircraft are scheduled to fly in the morning and 8 of the same aircraft are scheduled to fly in the afternoon. However, 6% of aircraft break before takeoff, or abort, and 12% of the aircraft that takeoff are broken when they land. There is a 40% chance that the broken aircraft can be repaired before the afternoon flight, and a 67% chance that the aborted aircraft can be repaired before afternoon.
- On average, of the 10 original aircraft, how many are available for the afternoon flight?
- How many aircraft will need maintenance during the day?
- For a day of flights, what is the average total hours of maintenance required, per day?
dag_flights
contains a simulation of each day’s situation. It starts with 10 planes ready. Then it simulates a 6% chance of abort for each of those planes, with the remainder going on the morning flight. Of these, 12% break on landing. The aborted and broken-on-landing planes are sent to maintence: about 67% of the aborted planes are ready for the afternoon flight, and similarly for 40% of the broken planes. Those ready for the afternoon flight will abort or break in the same way as in the morning.
In the following diagram, AM
is the number of planes in the morning flight, PM is the number of planes in the afternoon flight. brokeAM
and abortAM
together constitute the number of planes needing maintenance from the morning flight, similarly with brokePM
and brokeAM
.
A simulation of five day’s flying produces this result:
set.seed(102)
sample(dag_flights, size=5)
ready | abortAM | AM | brokeAM | PM | abortPM | brokePM |
---|---|---|---|---|---|---|
10 | 1 | 9 | 1 | 8 | 2 | 1 |
10 | 0 | 10 | 4 | 8 | 1 | 0 |
10 | 1 | 9 | 1 | 10 | 2 | 1 |
10 | 1 | 9 | 2 | 8 | 1 | 0 |
10 | 0 | 10 | 1 | 9 | 0 | 1 |
Generate a large sample of days and use the result to answer the above questions by simple data wrangling.
<- dag_sample(dag_flights, size=1000) Simulation
Find the average number of planes ready for the PM flight.
Find the average number of planes that need maintenance. (Add up all the “abort” or “broke” columns.)
In order to calculate the number of maintenance hours needed, you need some additional information: the average number of maintenance hours for each plane that needs maintenance. Using previous maintenance records, this was found to be 15.3 hours.
THIS WILL BE TURNED OFF INITIALLY
%>%
Simulation mutate(need_maintenance = brokeAM + brokePM + abortAM + abortPM) %>%
summarize(PM_ready = mean(PM), num_maintained = mean(need_maintenance),
hours = mean(15.3 * need_maintenance))
PM_ready | num_maintained | hours |
---|---|---|
9.136 | 3.225 | 49.3425 |
Consider this simple DAG:
dag_draw(dag_prob_21.1)
print(dag_prob_21.1)
x ~ exo(1)
y ~ x + exo(2)
The exo()
function in DAG formulas means “pure exogenous noise.” The noise will have a standard deviation equal to the argument to exo()
. So exo(1)
will have a standard deviation of 1, while exo(2)
will have a standard deviation of 2.
Based on the standard deviations of exo(1)
and exo(2)
, what is the variance of the noise produced by exo(1)
? Of exo(2)
?
The y
variable consists of a simple sum of x
and exogenous noise from exo(2)
. The variance of the sum will be the sum of the variances. Based on your answers to (1), what will the variance of y
be?
Generate a sample of size 1000 or so from dag_prob_21.1
:
<- sample(dag_prob_21.1, size=1000) Samp
From Samp
, use wrangling to calculate the variance of x
and of y
.
What is the R2 of the model y ~ x
fitted to Samp
?
The moderndive::amazon_books
data frame gives the list_price
and number of pages (num_pages
). Build a model list_price ~ num_pages
and calculate how much of the variation in list_price
comes from num_pages
.
Whenever you include more explanatory variables in a model, the R2 gets bigger.
Compare the R2 of these two models trained on the Galton
data.
height ~ sex
height ~ sex + mother + father
height
is accounted for by the mother
and father
? (Hint: How much did R2 increase between the two models?)Now add one more explanatory variable to the model: nkids
which gives the total number of children in the family.
height
is accounted for by nkids
?lm(height ~ sex, data=Galton) |> R2()
n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|
898 | 1 | 0.5101643 | 933.1846 | 0.5096176 | 0 | 1 | 896 |
lm(height ~ sex + mother + father, data=Galton) |> R2()
n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|
898 | 3 | 0.6396752 | 529.0317 | 0.6384661 | 0 | 3 | 894 |
Since R2 increased from 51% to 64%, father
and mother
account for 13% of the variance in height.
lm(height ~ sex + mother + father + nkids, data=Galton) |> R2()
n | k | Rsquared | F | adjR2 | p | df.num | df.denom |
---|---|---|---|---|---|---|---|
898 | 4 | 0.640721 | 398.1333 | 0.6391117 | 0 | 4 | 893 |
nkids
accounts for virtually none of the variation in height
.
Here are several graphs of basic regression models with zero, one, or two explanatory variables. For each graph:
Consider a data frame with two variables, x
and y
. Here are the box-and-whisker plots of the two variables.
x
or y
? Explain what you see in the graphs that account for your answer.Even when individual variables do not show outliers, there can be outliers from the relationship between the two variables, as in the following graph.
lm(y ~ x, data=...)
and model_eval()
.)The blue line in the following plot shows the model fitted by lm()
. The left end of the line is pulled upward by the outliers; the right end is forced downwards. Excluding the outliers from the data used for fitting would address the matter. But we have more sophisticated methods that both identify outliers and make the model robust to them. The red line shows the model fitted by MASS::rlm()
, where rlm
standd for “robust linear modeling.”
Here is a model with two explanatory variables, fitted with lm()
and summarized with regression_summary()
.
%>%
Hill_racing lm(time ~ distance + climb, data=.) %>%
regression_summary()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -469.976937 | 32.3582241 | -14.52419 | 0 |
distance | 253.808295 | 3.7843322 | 67.06819 | 0 |
climb | 2.609758 | 0.0593826 | 43.94821 | 0 |
Using the information from the regression summary, calculate the confidence interval on the three coefficients.
Explain the difference between the variation of a variable and the sampling variation of a summary.
Does the variance in a variable change substantially as more data is collected, say increasing the sample size \(n\) by a factor of 10? If so, by how much?
Does the does the sampling variation of a summary change substantially as more data is collected, say increasing the sample size \(n\) by a factor of 10? If so, by how much?
Write each of the following intervals in both [top, bottom] and center \(\pm\) spread form using a sensible number of digits.
Assuming that each of the above intervals is a confidence interval, fill in the following table. (Use the sensible number of digits from your previous replies.)
margin oferror | standard error | |
---|---|---|
a. | ||
b. | ||
c. |
Fit the model outcome ~ expenditure and give confidence intervals. Explain that participation has to do with the willingness to fill in a survey.
set.seed(101)
<- 600
npts <- tibble::tibble(y = 350 + 200* rnorm(npts,sd=0.5), x=1, samp=rep(1:30,each=20))
Dat <- df_stats(y ~ samp, data=Dat, mn = mean(), sd=sd()) %>%
Stats mutate(top=mn+2*sd, bottom=mn-2*sd)
<- ggplot(Dat, aes(x=samp, y=y)) +
P1 geom_jitter(alpha=0.3, width=0.15, height=0) +
geom_point(data=Stats, aes(y=mn, x=samp),
color="red", size=2) +
geom_errorbar(data=Stats, aes(ymin=bottom, ymax=top, x=samp), y=NA,
color="red")+
scale_x_continuous(breaks=c()) + xlab("") +
scale_y_continuous(breaks=100*(0:6))
<- df_stats(mn ~ 1, data=Stats, gmean=mean(), gsd=sd()) %>%
Grand_stats mutate(top=gmean + 2*gsd, bottom=gmean - 2*gsd)
<- tibble::tibble(
For_violin v = with(Grand_stats, rnorm(1000, mean=gmean, sd=gsd))
)
<- ggplot(Stats, aes(x=1, y=mn)) +
P2 geom_jitter(alpha=0.3, color="red", height=0, width=0.05) +
geom_violin(data=For_violin, aes(y=v, x=1),
fill="red", color=NA, alpha=0.2) +
geom_errorbar(data=Grand_stats, aes(x=1, ymin=bottom, ymax=top),
y=NA, color="red", width=0.05) +
scale_y_continuous(breaks=100*(0:6), limits=c(0,600)) +
ylab("") +
scale_x_continuous(breaks=c()) + xlab("")
::grid.arrange(P1, P2, ncol=2, widths=c("80", "20")) gridExtra
A 1995 article recounted an incident with a scallop fishing boat. In order to protect the fishery, the law requires that the average weight of scallops caught be larger than 1/36 pound. The particular ship involved returned to port with 11,000 bags of frozen scallops. The fisheries inspector randomly selected 18 bags as the ship was being unloaded, finding the average weight of the scallops in each of those bags. The resulting measurements are displayed below, in units of 1/36 pound. (That is, a value of 1 is exactly 1/36 pound while a value of 0.90 is \(\frac{0.90}{36}=0.025\) pound.)
<- tibble::tribble(
Sample ~ scallops,
0.93, 0.88, 0.85, 0.91, 0.91, 0.84, 0.90, 0.98, 0.88,
0.89, 0.98, 0.87, 0.91, 0.92, 0.99, 1.14, 1.06, 0.93)
lm(scallops ~ 1, data = Sample) |> conf_interval(level=0.99)
term | .lwr | .coef | .upr |
---|---|---|---|
(Intercept) | 0.8802122 | 0.9316667 | 0.9831212 |
If the average of the 18 measurements is below 1.0, a penalty is imposed. For instance, an average of 0.97 leads to 40% confiscation of the cargo, while 0.93 and 0.89 incur to 95- and 100-percent confiscation respectively.
The inspection procedure—select 18 bags at random and calculate the mean weight of the scallops therein, penalize if that mean is below 1/36 pound—is an example of a “standard operating procedure.” The government inspector doesn’t need to know any statistics or make any judgement. Just count, weigh, and find the mean.
Designing the procedure presumably involves some collaboration between a fisheries expert (“What’s the minimum allowed weight per scallop? I need scallops to have a fighting chance of reaching reproductive age.”), a statistician (“How large should the sample size be to give the desired precision? If the precision is too poor, the penalty will be effectively arbitrary.”), and an inspector (“You want me to sample 200 bags? Not gonna happen.”)
A. Which of the numbers in the above report correspond to the mean weight per scallop (in units of 1/36 pound)?
There is a legal subtlety. If the regulations state, “Mean weight must be above 1/36 pound,” then those caught by the procedure have a legitimate claim to insist that there be a good statistical case that the evidence from the sample reliably relates to a violation.
B. Which of the numbers in the above report corresponds to a plausible upper limit on what the mean weight has been measured to be?
Back to the legal subtlety …. If the regulations state, “The mean weight per scallop from a random sample of 18 bags must be 1/36 pound or larger,” then the question of evidence doesn’t come up. After all, the goal isn’t necessarily that the mean be greater than 1/36th pound, but that the entire procedure be effective at regulating the fishery and fair to the fishermen. Suppose that the real goal is that scallops weigh, on average, more than 1/34 of a pound. In order to ensure that the sampling process doesn’t lead to unfair allegations, the nominal “1/36” minimum might reflect the need for some guard against false accusations.
C. Transpose the whole confidence interval to where it would be if the target were 1/34 of a pound (that is, \(\frac{1.06}{36}\). Does the confidence interval from a sample of 18 bags cross below 1.0?
An often-heard critique of such procedures is along the lines of, “How can a sample of 18 bags tell you anything about what’s going on in all 11,000 bags?” The answer is that the mean of 18 bags—on its own—doesn’t tell you how the result relates to the 11,000 bags. However, the mean with its confidence interval does convey what we know about the 11,000 bags from the sample of 18.
D. Suppose the procedure had been defined as sampling 100 bags, rather than 18. Using the numbers from the above report, estimate in \(\pm\) format how wide the confidence interval would be.
Consider a situation where a measurement, say the number of highway deaths in Minnesota, is seen to increase from one year to the next. Perhaps the data look like this:
<- tibble::tribble(
Highway ~ year, ~ deaths,
2021, 403,
2022, 430,
)
A. Do you think these data are sufficient for concluding that there is an upward trend in highway deaths? (By “trend,” we mean an increase that is likely to persist into the future.) Explain why or why not.
I think many people would not see much of a trend in these data. The question we want to address here is whether statistical methods would or would not point to a trend.
For technical reasons that are not important to us here, an appropriate method for building a model of counts is “poisson regression.” This works very much like lm()
but is invoked differently.
glm(deaths ~ year, data=Highway, family=poisson) |> conf_interval()
Waiting for profiling to be done...
term | .lwr | .coef | .upr |
---|---|---|---|
(Intercept) | -400.0723002 | -125.0601785 | 149.54320 |
year | -0.0709928 | 0.0648486 | 0.20089 |
B. Based on the above report, do you think the statistical report points clearly to an upward trend, or not? Explain your reasoning.
For demonstration purposes only, we are going to make up data to see how much data is needed to give compelling (statistical) evidence of an upward trend. To that end, open an Rmd document and create a new chunk. In that chunk, paste the code from both of the above chunks. Run the code to make sure you get the same confidence interval as shown above.
In your Rmd document, edit the data to add another year, 2023, with about 460 deaths.
C. Would the confidence interval calculated on the model fitted to the extended data clearly to a trend?
Now add a fourth row to the data, for the year 2024. You will try different values for deaths
in 2024 to see what the confidence interval has to say.
D. Find the largest number of deaths for 2024 that creates a confidence interval that does not point clearly to a trend.
E. Find the smallest number of deaths for 2024 such that the confidence interval does point clearly to a trend.
The data frame math300::Gilbert
has a row for each of 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert
.) The variable deaths
records whether or not a death occurred on that shift, while gilbert
records whether nurse Kristen Gilbert was on duty.
A. Using mutate()
transform deaths
to a zero-one variable where 1 indicates whether a death occurred on the shift.
B. Using the data from (A), fit the regression model death ~ gilbert
and extract the confidence interval on the gilberton_duty
coefficient.
i. The coefficient on indicates that Gilbert being on duty is associated with a 13.1 percentage point increase in the risk of a death during the shift. Gilbert was on duty for 257 shifts. Using the coefficient (`.coef`), how many deaths can reasonably be attributed to Gilbert?
ii. The `.lwr` and `.upr` bounds of the confidence interval suggest a range in the number of deaths that these data indicate can be attributed to Gilbert. What is that range?
How would you explain the
The following table refers to a model with explanatory variables dist
and accuracy
. The table shows the .output
of the model for each of four combinations of the explanatory variables.
model_eval(mod, dist=c(100,200), accuracy = c(50,60)) |> select(-.lwr, -.upr)
dist | accuracy | .output |
---|---|---|
100 | 50 | -42.14096 |
200 | 50 | -21.52042 |
100 | 60 | -39.59864 |
200 | 60 | -18.97810 |
A. At an accuracy of 50, what is the effect size of the .output
with respect to dist
? (Be sure to take into account both the difference in .output
and the difference in dist
.)
B. At a distance of 100, what is the effect size of the .output
with respect to accuracy?
Price_mod
is a model of the sales price
at action of antique clocks, as a function of the age
of the clock and the number of bidders
involved in the auction.
<- lm(price ~ age + bidders, data = Clock_auction) Price_mod
Use the following table of the evaluated model to answer the following questions.
model_eval(Price_mod, age=c(100,150), bidders=c(5,10)) |> select(-.lwr, -.upr)
age | bidders | .output |
---|---|---|
100 | 5 | 507.2968 |
150 | 5 | 1061.6295 |
100 | 10 | 827.4311 |
150 | 10 | 1381.7638 |
A. What is the effect size of the model .output
with respect to age
?
B. What is the effect size of the model .output
with respect to bidders
?
C. age
is in years and .output
is in USD. bidders
is a pure number representing the number of people bidding. But rather than calling the units “people,” let’s call it “bidders.”
i. What is the units of the answer to (A)?
ii. What is the units of the answer to (B)?
D. Explain, in everyday terms, the commercial meaning of the effect size of price
with respect to bidders
.
The Boston Marathon is the oldest annual marathon in the US, starting in 1897. The winning runs each year are recorded in the math300::Boston_marathon
data frame. In this exercise, you are going to look at the records for 1990 and earlier. (In another Lesson, we will look at what the post-1990 data have to say.)
To start, create a data frame Bos
with just the 1990 and earlier data. This is ordinary data wrangling, not modeling.
<- Boston_marathon %>% filter(year <= 1990)
Bos head(Bos) # to see what the data frame looks like
year | name | country | time | sex | minutes |
---|---|---|---|---|---|
1990 | Gelindo Bordin | Italy | 02:08:19 | male | 128.3167 |
1989 | Abebe Mekonnen | Ethiopia | 02:09:06 | male | 129.1000 |
1988 | Ibrahim Hussein | Kenya | 02:08:43 | male | 128.7167 |
1987 | Toshihiko Seko | Japan | 02:11:50 | male | 131.8333 |
1986 | Robert de Castella | Australia | 02:07:51 | male | 127.8500 |
1985 | Geoff Smith | England | 02:14:05 | male | 134.0833 |
Use the minutes
variable to represent the winners’ running duration.
A. Create a graphic showing how the race times have varied over the years. You just need to fill in the blanks in this scaffold, but you’ll have to decide which variable should be matched to which aesthetic (that is, x, y, color).
ggplot(Bos, aes(x=______, y=_____, color=____)) +
geom_point()
ggplot(Bos, aes(x=year, y=minutes, color=sex)) +
geom_point()
B. Now to fit a simple regression model to the data. You can use these commands:
<- lm(minutes ~ year, data=Bos)
mod1 model_plot(mod1, x = year, color=sex)
Warning: Ignoring unknown aesthetics: fill
i. Comment on how well `mod1` matches the data. In particular, are there ways in which the model fails to capture the pattern of change of the winning race time over the years?
ii. Whatever misgivings you have from (i), calculate the effect size of `year` on `minutes`. The following command does the basic calculation. The effect size is the rate of change of the `.output` with respect to `year`. Note that the effect size is the same for female and male runners; `sex` isn't included in `mod1`.
::: {.cell}
```{.r .cell-code}
model_eval(mod1, year=c(1980, 1990), sex=c("female", "male")) |> select(-.upr, -.lwr)
```
::: {.cell-output-display}
<div class="kable-table">
| year|sex | .output|
|----:|:------|--------:|
| 1980|female | 141.9835|
| 1990|female | 140.1760|
| 1980|male | 141.9835|
| 1990|male | 140.1760|
</div>
:::
:::
iii. Using the effect size you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
C. Repeat the steps in (B), but for a model that does include sex
as an explanatory variable: minutes ~ year + sex
. (Call the model mod2
.) Now the .output
will be different for the females and the males.
i. Draw a graph of the model, using the same commands as in (B), but for `mod2`.
ii. Describe ways in which the model grossly mis-represents the data.
iii. Calculate the effect size for females and males using the `model_eval()` command from (B), but with `mod2`. Do the effect sizes differ for males and females?
iv. Using the effect size(s) you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
D. We will do this one more time, but for a slightly different model form: minutes ~ year * sex
. (Call the model mod3
.) Note that for the tilde expression in this model, *
is used to join the explanatory variables rather than +
. That is very important.
i. Draw a graph of the model, using the same commands as in (B) and (C), but for `mod3`.
ii. `mod3` is much closer to the data than either `mod1` or `mod2`. Explain what is different about `mod3` that allows it to be more faithful to the data.
iii. Calculate the effect size for females and males using the `model_eval()` command from (B or C), but with `mod3`. Do the effect sizes differ for males and females?
iv. Using the effect sizes you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
The Boston Marathon is the oldest annual marathon in the US, starting in 1897. The winning runs each year are recorded in the math300::Boston_marathon
data frame. In this exercise, you are going to look at the records for 1990 and earlier. (In another Lesson, we will look at what the post-1990 data have to say.)
To start, create a data frame Bos
with just the 1990 and earlier data. This is ordinary data wrangling, not modeling.
<- Boston_marathon %>% filter(year <= 1990)
Bos head(Bos) # to see what the data frame looks like
year | name | country | time | sex | minutes |
---|---|---|---|---|---|
1990 | Gelindo Bordin | Italy | 02:08:19 | male | 128.3167 |
1989 | Abebe Mekonnen | Ethiopia | 02:09:06 | male | 129.1000 |
1988 | Ibrahim Hussein | Kenya | 02:08:43 | male | 128.7167 |
1987 | Toshihiko Seko | Japan | 02:11:50 | male | 131.8333 |
1986 | Robert de Castella | Australia | 02:07:51 | male | 127.8500 |
1985 | Geoff Smith | England | 02:14:05 | male | 134.0833 |
Use the minutes
variable to represent the winners’ running duration.
A. Create a graphic showing how the race times have varied over the years. You just need to fill in the blanks in this scaffold, but you’ll have to decide which variable should be matched to which aesthetic (that is, x, y, color).
ggplot(Bos, aes(x=______, y=_____, color=____)) +
geom_point()
ggplot(Bos, aes(x=year, y=minutes, color=sex)) +
geom_point()
B. Now to fit a simple regression model to the data. You can use these commands:
<- lm(minutes ~ year, data=Bos)
mod1 model_plot(mod1, x = year, color=sex)
Warning: Ignoring unknown aesthetics: fill
i. Comment on how well `mod1` matches the data. In particular, are there ways in which the model fails to capture the pattern of change of the winning race time over the years?
ii. Whatever misgivings you have from (i), calculate the effect size of `year` on `minutes`. The following command does the basic calculation. The effect size is the rate of change of the `.output` with respect to `year`. Note that the effect size is the same for female and male runners; `sex` isn't included in `mod1`.
::: {.cell}
```{.r .cell-code}
model_eval(mod1, year=c(1980, 1990), sex=c("female", "male")) |> select(-.upr, -.lwr)
```
::: {.cell-output-display}
<div class="kable-table">
| year|sex | .output|
|----:|:------|--------:|
| 1980|female | 141.9835|
| 1990|female | 140.1760|
| 1980|male | 141.9835|
| 1990|male | 140.1760|
</div>
:::
:::
iii. Using the effect size you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
C. Repeat the steps in (B), but for a model that does include sex
as an explanatory variable: minutes ~ year + sex
. (Call the model mod2
.) Now the .output
will be different for the females and the males.
i. Draw a graph of the model, using the same commands as in (B), but for `mod2`.
ii. Describe ways in which the model grossly mis-represents the data.
iii. Calculate the effect size for females and males using the `model_eval()` command from (B), but with `mod2`. Do the effect sizes differ for males and females?
iv. Using the effect size(s) you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
D. We will do this one more time, but for a slightly different model form: minutes ~ year * sex
. (Call the model mod3
.) Note that for the tilde expression in this model, *
is used to join the explanatory variables rather than +
. That is very important.
i. Draw a graph of the model, using the same commands as in (B) and (C), but for `mod3`.
ii. `mod3` is much closer to the data than either `mod1` or `mod2`. Explain what is different about `mod3` that allows it to be more faithful to the data.
iii. Calculate the effect size for females and males using the `model_eval()` command from (B or C), but with `mod3`. Do the effect sizes differ for males and females?
iv. Using the effect sizes you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.
In Question 24-3 we looked at precipitation in California using the model specification precip ~ orientation + altitude + distance
. We concluded that distance
didn’t have much to say about precip
. So in this Question we will drop distance
from the model.
In addition, we will make another change. If you have ever hiked near the crest between two mountains, you might have noticed that the vegetation can be substantially different from one side of the crest to another. We would like to create models that take this into account: one model for the “W” orientation and another for the “L” orientation. (We will also get rid of two outlier stations.)
<- lm(precip ~ altitude + latitude,
modW data = Calif_precip |> filter(orientation=="W", station != "Cresent City"))
<- lm(precip ~ altitude + latitude,
modL data = Calif_precip |> filter(orientation=="L", station != "Tule Lake"))
Here are graphs of the two models:
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
A. You can see that the W model and the L model are very different. One difference is that the precipitation is much higher for the W stations than the L stations. How does the higher precipitation for W show up in the graphs? (Hint: Don’t overthink the question!)
B. Another difference between the models has to do with the confidence bands. The bands for the L stations are pretty much flat while those for the W stations tend to slope upwards.
i. What about the altitude confidence intervals on `modW` and `modL` corresponds to the difference?
ii. Calculate R^2^ for both the L model and the W model. What do the different values of R^2^ suggest about how much of the explanation of `precip` is accounted for by each model?
Your boss wants to know the effect of ambient temperature on the range of electric vehicles. You borrow her Tesla each month over the next year. Each month, you drive the Tesla until it needs recharging, taking note of the mileage and the outdoor temperature. Altogether, you have a sample of size \(n=12\) from which you build a model range ~ temperature
. The confidence interval on the effect size of temperature on range is \(-2.4 \pm 4.0\) miles per degree C.
From this confidence interval, you boss concludes that lower temperatures decrease the range.
A. Explain to your boss why this conclusion is not justified by the data.
B. What sample size would be required to reduce the margin of error from \(\pm 3.7\) to \(\pm 1\) miles per degree C?
C. You need to warn your boss that the larger sample, while it will reduce the margin of error, won’t necessarily lead to a justified conclusion that lower temperatures decrease the range. Make up several different confidence intervals that might plausibly result from the larger sample, some of which point to the possibility that the conclusion might be that lower temperatures increase range.
Just FYI … A better design for the study would be to make several measurements in July and several in January so that your measurements will come from the extremes of temperature rather than the in-the-middle months like October or April.
Here are several graphs of basic regression models with two explanatory variables. For each graph, say whether the model specification includes an interaction between the two explanatory variables.
The following graphs show confidence bands for the same model fitted to two samples of different sizes.
A. Do the two confidence intervals (red and blue) plausibly come from the same model? Explain your reasoning.
B. Which of the confidence intervals comes from the larger sample, red or blue?
C. To judge from the graph, how large is the larger sample compared to the smaller one?
The following graphs show prediction bands for the same model fitted to two samples of different sizes differing in size by a factor of 16.
A. Which of the prediction intervals comes from the larger sample, red or blue? Explain your reasoning.
C. To judge from the graph, how large is the larger sample compared to the smaller one?
In the graph, a confidence interval and a prediction interval from the same model are shown.
A. Which is the confidence interval, red or blue?
B. To judge from the graph, what is the standard deviation of the residuals from the model?
The graph shows a confidence interval and a prediction interval.
A. Which is the confidence interval, red or blue?
B. Do the confidence interval and the prediction come from the same sample of the same system? Explain your reasoning.
An industrial process for making solar photovoltaics involves printing layers of doped silicon onto a large plastic substrate. The integrity of the final product varies from run to run. You are the manager of the production line and have asked the quality control technicians to measure the atmospheric humidity for each run to check if humidity is related to product integrity. Integrity must be at least 14 for the output of the product run to be accepted.
The graph shows the data from 50 production runs along with a prediction band from a model trained on the data.
As manager, you have decided that the probability of the integrity being above 14 must be at least 75% in order to generate an appropriate production quantity without wasting too much material from the rejected production runs.
A. What is the interval of acceptable humidity levels in order to meet the above production standards?
B. As more production runs are made, more data will be collected. Based on what you know about prediction bands, will the top and bottom of the band become closer together as more data accumulates?
IN DRAFT… ## Exercises
Translate Figure 25.1 to a ribbon plot.
Make a DAG data set with 200 points, say y ~ x
. Count how many points lie outside the prediction band. It should be about 10.
Using training data as input to model_eval().
exo()
component to y
.)Using training data as input to model_eval().
The data frame math300::Ticket_sales
comes from the operations research 310 course at the US Air Force Academy. It is used as part of a lesson on “seasonal adjustment” of time series in management. Sales of a product often vary in a seasonal pattern. For instance, September and December are often high-sales months, while February and March are low points. In modeling sales as a function of time, it is useful to be able to capture explicitly the seasonal patterns.
The Ticket_sales
data frame records sales over only one month, so there is no clear seasonal trend. However, there is a weekly trend; sales tend to be higher on the weekend. In this exercise you will construct and evaluate predictive models based on the Ticket_sales
data.
To set things up, construct these four models and fit to the Ticket_sales
data. In each model, the response variable is sales
, which is the quantity that we want to use the model to predict.
<- lm(sales ~ 1, data = Ticket_sales)
modA <- lm(sales ~ day, data = Ticket_sales)
modB <- lm(sales ~ dow, data = Ticket_sales)
modC <- lm(sales ~ dow + day, data = Ticket_sales[-c(3,4),]) modD
We will focus on the “seasonal” bit of the problem: how the sales differ from one day of the week to another. To simplify the commands, here is an object that lists all the days of the week.
<- c("Monday", "Tuesday", "Wednesday", "Thursday",
week "Friday", "Saturday", "Sunday")
The mechanics of making a prediction from. any one of the models is simple. For example, here is a prediction for sales on each of the weekdays on a day near the end of the month:
model_eval(modD, dow=week, day=28, interval="prediction")
dow | day | .output | .lwr | .upr |
---|---|---|---|---|
Monday | 28 | 4.063776 | 2.634304 | 5.493247 |
Tuesday | 28 | 3.418367 | 1.931522 | 4.905213 |
Wednesday | 28 | 3.772959 | 2.316552 | 5.229367 |
Thursday | 28 | 4.681122 | 3.231134 | 6.131111 |
Friday | 28 | 10.089286 | 8.645206 | 11.533365 |
Saturday | 28 | 10.543878 | 9.111730 | 11.976025 |
Sunday | 28 | 8.302041 | 6.876694 | 9.727388 |
A. The .output
shows differences in sales among the weekdays that are consistent with everyday experience of when people buy tickets for events. Which of the four models shows such differences clearly and which do not? Explain what’s missing in the models that fail to capture the differences from one weekday to another.
B. Look at the prediction intervals. Do you regard them as wide or as narrow? Explain why.
C. Calculate R2 for each of the four models. According to this criterion, which of the four models should make the best predictions.
D. One reason the prediction intervals might be poor is that the sales on some days may be extra-ordinarily high or low, outside of the “usual” pattern. You can identify such days from the residuals.
<- model_eval(modD) Evaluated
Plot the residuals (.resid
) versus day
. Do any days stand out as outliers?
E. For simplicity, we’re going fit the four models again, but this time excluding the days that you identified in (D) as outliers. To illustrate, suppose you thought that day 17 and 19 were outliers. (They aren’t, but just suppose.) Here’s a wrangling statement that will make a new data frame Cleaned
without the outliers.
<- Ticket_sales %>% filter(!day %in% c(17, 19)) Cleaned
Re-fit modD
using the Cleaned
data and look at the prediction intervals for the days of the week. What has changed compared to the model fitted to Ticket_sales
.
F. Time for some detective work. The month recorded by Ticket_sales
is not identified in the data or the documentation. Figure out what month it is. (Hints: 1) How many days are in the month recorded in Ticket_sales
? 2) What familiar circumstances in the calendar might lead to exceptionally high ticket sales compared to other of the same weekday?)
A 1990 lawsuit concerned prison overcrowding in Suffolk County, Massachusetts (part of which is in Boston). The county had just opened a new jail, sized to hold 419 inmates. The new jail was the result of a previous law suit about the previous jail, where conditions were ruled unconscionable. Regretably, when the new jail opened, the nightly average prisoner count was about 500. The 1990 lawsuit claimed that the sheriff had violated the court-imposed terms from the previous lawsuit through insufficient planning for the new jail.
Plans for the new jail were made based on prisoner counts available to up 1987, when the design details were finalized. The data are:
<- tibble::tribble(
Prisoners ~ year, ~ count,
1980, 190,
1981, 220,
1982, 270,
1983, 310,
1984, 319,
1985, 326,
1986, 322,
1987, 370
)
The sheriff argued that in 1987, when plans for the new jail were finalized, a jail size of 419 was more than ample, and that he should not be in the position of having to devine what the crime and arrest rates will be in the future. The prisoners countered that the growth pattern is clear for the data from 1980 to 1987. Figure 27.1 shows the prisoner-count data and the confidence and prediction intervals for the model fitted to the data.
A. Figure 27.1 shows one band (purple) inside another (blue). Which is the confidence interval and which is the prediction interval?
B. To judge from Figure 27.1, does the sheriff have a good case that his planning for 1990 based on the available data was responsible?
Figure 27.2 is similar to Figure 27.1, except that the single point from 1987 has been deleted. This is part of a process called “sensitivity analysis” which checks whether the results are overly sensitive to single data points.
C. Does Figure 27.2 lead you to a different conclusion than you stated in (B).
Actual outcome: The court ruled in favor of the sheriff, noting that prisoner populations are inherently unpredictable.
Note: This exercise stems from a 1995 article by Arnold Barnett: Interfaces 25(2)
Federal regulation calls for household appliances to be labelled for energy use. An example of such a label was published with the regulations and is shown below.
Underneath the bold-face $84 are two bars showing intervals. Are these bars confidence intervals or prediction intervals? Explain your reasoning.
Consider the data in the point plot.
As is typical, the data are a combination of signal and noise.
SIMPLE PATTERN TYPE TO DEFINE MODEL. THE CLUSTER IN THE UPPER CORNER IS NOT PART OF A SIMPLE PATTERN.
WHAT ABOUT THE PREDICTION INTERVALS?