The data frame math300::Gilbert records 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert.) Whether or not a patient died during a shift is indicated by the variable death. The variable gilbert marks whether or not nurse Kristen Gilbert was on duty during the shift.

A. Are the variables numerical or categorical?

B. The data were used as evidence during Gilbert’s trial for murder. The data can be reduced to a compact form by data wrangling:

Gilbert %>%
group_by(death, gilbert) %>% tally()
death gilbert n
no not 1027
no on_duty 157
yes not 357
yes on_duty 100

Calculate the proportion of times a death occurred when Gilbert was on duty, and the proportion of time when Gilbert was not on duty. Does the shift evidence support the claim that Gilbert was a contributing factor to the deaths.

C. In the Lessons, we will mainly use regression modeling rather than wrangling as a data-reduction technique. In the trial, the relevant question was whether Gilbert was a factor in the deaths. Express this question as a tilde expression involving the variables death and gilbert.

D. The response variable in your tilde expression from (C) is categorical. Regression modeling requires a numerical response variable. Use mutate() and zero_one() to transform the variable death from categorical form to a numerical form that has a value 1 when Gilbert was on duty and 0 otherwise. This will become the response variable for the regression: death ~ gilbert.

E. Carry out the regression using lm(). Look at the coefficients from the fitted model with conf_interval(); these are in .coef column. Explain how the coefficients correspond to the proportions you calculated in (B).

Write out the R commands to make these graphics, based on the Whickham data frame.

Graph A

Graph B

Graph C

• True or false: Variance refers to a data frame.
• Explain why there are two measures that are equivalent: “standard deviation” and “variance.”
• True or false: Variance is about a single variable, not the relationship between two variables.

It is undeniably odd that the units of the variance are the square of the units of the variable for which the variance is calculated. For example, if the variable records height in units of meters, the variance of height will have units of square-meters. Sometime, the units of the variance have no physical meaning to most people. A case in point, if the variable records the temperature in degreesC, the variance will have units of degreesC2.

The odd units of the variance make it hard to estimate from a graph. But there is a trick that makes it straightforward.

1. Graph the values of the variable with a jittered point plot. The horizontal axis has no content because variance is about a single variable, not a pair of variables. The horizontal space is needed only to provide room for the jittered points to spread out.

2. Mark the graph with two horizontal lines. The space between the lines should include the central two-thirds of the points in the graph.

3. Find the intercept of each of the two lines with the vertical axis. We’ll call the two intercepts “top” and “bottom.”

4. Find the numerical difference between the top and the bottom. Divide this difference by 2 to get a “by-eye” estimate of the standard deviation. That is, the standard deviation is the length of a vertical line drawn from the mid-point of “top” and “bottom” to the top.

5. Square the standard deviation to get the value of the variance.

For each of the following graphics (in the style described in (1)), estimate the standard deviation and from that the variance of the data in the plot.

Graph Standard Dev. Variance units of sd units of var
A
B
C
D

Consider this violin plot

1. For each group, judge by eye what fraction of the data points have a value of 2 or below.

2. Which of the three groups has multiple peaks in its density?

3. Which group has the lowest median?

Here is the variance, broken down by sex of three variables from the Galton data frame.

Galton %>%
summarize(vh = var(height), vm = var(mother), vf = var(father))
vh vm vf
12.8373 5.322365 6.102164

All three variables—height, mother, father—give the height of a person. The height variable is special only because the people involved are the (adult-aged) children of the respective mother and father.

1. Height is measured in inches in Galton. What are the units of the variances?

2. Can you tell from the variances which group is the tallest: fathers, mothers, or adult children?

Mothers and fathers have about the same variance. Yet the heights of the children have a variance that is more than twice as big. Let’s see why this is.

As you might expect, the mothers are all females and the fathers are all males. But the children, whose heights are recorded in heights, are a mixture of males and females. So the large variance, 12.8 square-inches, is a combination of the systematic variation between males and females and the person-to-person variation within each group.

Galton %>%
group_by(sex) %>%
mutate(hdev = height - mean(height)) %>%
summarize(v_sex_adjusted = var(hdev))
F 5.618415
M 6.925288

Because the data have been grouped by sex, the mean(height) is found separately for males and females.

1. The table just calculated gives the variance among the female (adult-aged) children and the variance among the male (adult-aged) children. Compare it to the variances for the fathers and mothers, vm and vf from the calculation made at the start of this exercise.

The variance calculations made using summarize(), var() and group_by() can be done another way using regression modeling.

lm(height ~ sex, data=Galton) %>%
model_eval() %>%
group_by(sex) %>%
summarize(var(.resid))
Using training data as input to model_eval().
sex var(.resid)
F 5.618415
M 6.925288
1. Explain what the variance of the residual has to do with the variance of the sexes separately.

Consider these calculations of the variation in age in the Whickham data frame:

Whickham %>% summarize(var(age))
var(age)
303.8756
Whickham %>% summarize(var(age - 5))
var(age - 5)
303.8756
Whickham %>% summarize(var(age - mean(age)))
var(age - mean(age))
303.8756

Explain why the answers are all the same, even though different amounts are being subtracted from age?

Consider this graph with 4 nodes:

1. Is the graph acyclic?

2. True or false: c is caused jointly by a and b.

3. True or false: a is caused exclusively by b.

4. True or false: d gets inputs from all of a, b, and c.

Identify the exogenous nodes in each of these DAGs.

A fundamental problem in analyzing data is to determine which variables are connected to which other variables. We will mainly use regression modeling for this. A simpler but less flexible method is to use some a basic principle about variances. The principle is this:

• If there is no connection between x and y, then the sum of the variances will equal the variance of the sum. Any deviation from this indicates a connection: the bigger the deviation, the stronger the connection.

To illustrate, consider dag03 and these variance calculations:

sample(dag03, size=100) %>%
summarize(vx = var(x), vy=var(y), vsum=var(x+y))
vx vy vsum
2.430999 1.903885 6.536116

The variance of x+y (6.5) is larger than the sum of the individual variances (2.4 and 1.9 for x and y, respectively.)

Part A

In dag04, shown below, nodes a, b, and c are exogenous, but d is not.

sample(dag04, size=100) %>%
summarize(va = var(a), vb = var(b), vsum=var(a+b))
va vb vsum
1.314656 1.152798 2.555548
1. Do the variances indicate any connection between nodes a and b?

2. Repeat the calculations, but looking for a possible connection between b and d. Is there a connection?

Part B

Turn to dag10. Using the variance principle, demonstrate that none of the exogenous nodes are connected but that all are connected to y.

Optional: Triangles

A triangle is a familiar object. Here’s a picture with the angle between sides a and b marked as $$\theta$$.

For a right triangle, the angle $$\theta = 90^\circ$$. And, as you know, for a right triangle the Pythagorean theorem holds for the lengths of the sides of the triangle, $$a^2 + b^2 = c^2$$.

There is a more general formula that holds for any triangle, regardless of $$\theta$$. It is called the “Law of cosines”:

$a^2 + b^2 - 2ab \cos(\theta) = c^2\ .$ What’s important for us here is that the Law of cosines suggests a powerful and intuitive way to think about the connection between two variables: as an angle! When two variables are not connected, the angle between them is $$\theta = 90^\circ$$ as in a right triangle. However, when the two variables are connected, the angle will be something else, say $$30^\circ$$ or $$140^\circ$$. What the statisticians call the “correlation coefficient” of two variables is really nothing more than the cosine of the angle between them.

Simulations are often used to plan for readiness, risk, or resource allocation. Here is a problem taught in an operations research course at the US Air Force Academy:

You are the Officer in Charge (OIC) of the 74th AMU (Aircraft Maintenance Unit). You are in charge of 24 A-10s and 220 Airmen. On a typical day, 10 aircraft are scheduled to fly in the morning and 8 of the same aircraft are scheduled to fly in the afternoon. However, 6% of aircraft break before takeoff, or abort, and 12% of the aircraft that takeoff are broken when they land. There is a 40% chance that the broken aircraft can be repaired before the afternoon flight, and a 67% chance that the aborted aircraft can be repaired before afternoon.

1. On average, of the 10 original aircraft, how many are available for the afternoon flight?
2. How many aircraft will need maintenance during the day?
3. For a day of flights, what is the average total hours of maintenance required, per day?

dag_flights contains a simulation of each day’s situation. It starts with 10 planes ready. Then it simulates a 6% chance of abort for each of those planes, with the remainder going on the morning flight. Of these, 12% break on landing. The aborted and broken-on-landing planes are sent to maintence: about 67% of the aborted planes are ready for the afternoon flight, and similarly for 40% of the broken planes. Those ready for the afternoon flight will abort or break in the same way as in the morning.

In the following diagram, AM is the number of planes in the morning flight, PM is the number of planes in the afternoon flight. brokeAM and abortAM together constitute the number of planes needing maintenance from the morning flight, similarly with brokePM and brokeAM.

A simulation of five day’s flying produces this result:

set.seed(102)
sample(dag_flights, size=5)
ready abortAM AM brokeAM PM abortPM brokePM
10 1 9 1 8 2 1
10 0 10 4 8 1 0
10 1 9 1 10 2 1
10 1 9 2 8 1 0
10 0 10 1 9 0 1

Generate a large sample of days and use the result to answer the above questions by simple data wrangling.

Simulation <- dag_sample(dag_flights, size=1000)
1. Find the average number of planes ready for the PM flight.

2. Find the average number of planes that need maintenance. (Add up all the “abort” or “broke” columns.)

In order to calculate the number of maintenance hours needed, you need some additional information: the average number of maintenance hours for each plane that needs maintenance. Using previous maintenance records, this was found to be 15.3 hours.

1. Find the average daily number of maintenance hours required to support the AM and PM flights. (This is your answer to (2) multiplied by 15.3.)
SOLUTION

THIS WILL BE TURNED OFF INITIALLY

Simulation %>%
mutate(need_maintenance = brokeAM + brokePM + abortAM + abortPM) %>%
summarize(PM_ready = mean(PM), num_maintained = mean(need_maintenance),
hours = mean(15.3 * need_maintenance))
9.136 3.225 49.3425

Consider this simple DAG:

dag_draw(dag_prob_21.1)
print(dag_prob_21.1)
x ~ exo(1)
y ~ x + exo(2)

The exo() function in DAG formulas means “pure exogenous noise.” The noise will have a standard deviation equal to the argument to exo(). So exo(1) will have a standard deviation of 1, while exo(2) will have a standard deviation of 2.

1. Based on the standard deviations of exo(1) and exo(2), what is the variance of the noise produced by exo(1)? Of exo(2)?

2. The y variable consists of a simple sum of x and exogenous noise from exo(2). The variance of the sum will be the sum of the variances. Based on your answers to (1), what will the variance of y be?

Generate a sample of size 1000 or so from dag_prob_21.1:

Samp <- sample(dag_prob_21.1, size=1000)
1. From Samp, use wrangling to calculate the variance of x and of y.

2. What is the R2 of the model y ~ x fitted to Samp?

The moderndive::amazon_books data frame gives the list_price and number of pages (num_pages). Build a model list_price ~ num_pages and calculate how much of the variation in list_price comes from num_pages.

Whenever you include more explanatory variables in a model, the R2 gets bigger.

Compare the R2 of these two models trained on the Galton data.

• height ~ sex
• height ~ sex + mother + father
1. How much of the variation in height is accounted for by the mother and father? (Hint: How much did R2 increase between the two models?)

Now add one more explanatory variable to the model: nkids which gives the total number of children in the family.

1. How much of the variation in height is accounted for by nkids?
Solution
lm(height ~ sex, data=Galton) |> R2()
n k Rsquared F adjR2 p df.num df.denom
898 1 0.5101643 933.1846 0.5096176 0 1 896
lm(height ~ sex + mother + father, data=Galton) |> R2()
n k Rsquared F adjR2 p df.num df.denom
898 3 0.6396752 529.0317 0.6384661 0 3 894

Since R2 increased from 51% to 64%, father and mother account for 13% of the variance in height.

lm(height ~ sex + mother + father + nkids, data=Galton) |> R2()
n k Rsquared F adjR2 p df.num df.denom
898 4 0.640721 398.1333 0.6391117 0 4 893

nkids accounts for virtually none of the variation in height.

Here are several graphs of basic regression models with zero, one, or two explanatory variables. For each graph:

1. List the explanatory variables (if any).
2. Say whether they are quantitative or categorical.
3. For the categorical variables (if any), say how many levels they have.

Consider a data frame with two variables, x and y. Here are the box-and-whisker plots of the two variables.

1. Is there any sign of outliers in x or y? Explain what you see in the graphs that account for your answer.

Even when individual variables do not show outliers, there can be outliers from the relationship between the two variables, as in the following graph.

1. Speculate on how you might create a variable that indicates how far away from the relationship each point is, and use that to detect outliers from the relationship. (Hint: lm(y ~ x, data=...) and model_eval().)

The blue line in the following plot shows the model fitted by lm(). The left end of the line is pulled upward by the outliers; the right end is forced downwards. Excluding the outliers from the data used for fitting would address the matter. But we have more sophisticated methods that both identify outliers and make the model robust to them. The red line shows the model fitted by MASS::rlm(), where rlm standd for “robust linear modeling.”

Here is a model with two explanatory variables, fitted with lm() and summarized with regression_summary().

Hill_racing %>%
lm(time ~ distance + climb, data=.) %>%
regression_summary()
term estimate std.error statistic p.value
(Intercept) -469.976937 32.3582241 -14.52419 0
distance 253.808295 3.7843322 67.06819 0
climb 2.609758 0.0593826 43.94821 0

Using the information from the regression summary, calculate the confidence interval on the three coefficients.

1. Explain the difference between the variation of a variable and the sampling variation of a summary.

2. Does the variance in a variable change substantially as more data is collected, say increasing the sample size $$n$$ by a factor of 10? If so, by how much?

3. Does the does the sampling variation of a summary change substantially as more data is collected, say increasing the sample size $$n$$ by a factor of 10? If so, by how much?

Write each of the following intervals in both [top, bottom] and center $$\pm$$ spread form using a sensible number of digits.

1. $$362.231 \pm 15.90632$$
2. $$[29.313, 75.0824]$$
3. $$0.000234 \pm 0.14296$$

Assuming that each of the above intervals is a confidence interval, fill in the following table. (Use the sensible number of digits from your previous replies.)

margin oferror standard error
a.
b.
c.

Fit the model outcome ~ expenditure and give confidence intervals. Explain that participation has to do with the willingness to fill in a survey.

1. Filter out those with participation < 50
2. Filter out those with participation > 50
set.seed(101)
npts <- 600
Dat <- tibble::tibble(y = 350 + 200* rnorm(npts,sd=0.5), x=1, samp=rep(1:30,each=20))
Stats <- df_stats(y ~ samp, data=Dat, mn = mean(), sd=sd()) %>%
mutate(top=mn+2*sd, bottom=mn-2*sd)
P1 <- ggplot(Dat, aes(x=samp, y=y)) +
geom_jitter(alpha=0.3, width=0.15, height=0) +
geom_point(data=Stats, aes(y=mn, x=samp),
color="red", size=2) +
geom_errorbar(data=Stats, aes(ymin=bottom, ymax=top, x=samp), y=NA,
color="red")+
scale_x_continuous(breaks=c()) + xlab("") +
scale_y_continuous(breaks=100*(0:6))

Grand_stats <- df_stats(mn ~ 1, data=Stats, gmean=mean(), gsd=sd()) %>%
mutate(top=gmean + 2*gsd, bottom=gmean - 2*gsd)

For_violin <- tibble::tibble(
v = with(Grand_stats, rnorm(1000, mean=gmean, sd=gsd))
)

P2 <- ggplot(Stats, aes(x=1, y=mn)) +
geom_jitter(alpha=0.3, color="red", height=0, width=0.05) +
geom_violin(data=For_violin, aes(y=v, x=1),
fill="red", color=NA, alpha=0.2) +
geom_errorbar(data=Grand_stats, aes(x=1, ymin=bottom, ymax=top),
y=NA, color="red", width=0.05) +
scale_y_continuous(breaks=100*(0:6), limits=c(0,600)) +
ylab("") +
scale_x_continuous(breaks=c()) + xlab("")

gridExtra::grid.arrange(P1, P2, ncol=2, widths=c("80", "20"))

A 1995 article recounted an incident with a scallop fishing boat. In order to protect the fishery, the law requires that the average weight of scallops caught be larger than 1/36 pound. The particular ship involved returned to port with 11,000 bags of frozen scallops. The fisheries inspector randomly selected 18 bags as the ship was being unloaded, finding the average weight of the scallops in each of those bags. The resulting measurements are displayed below, in units of 1/36 pound. (That is, a value of 1 is exactly 1/36 pound while a value of 0.90 is $$\frac{0.90}{36}=0.025$$ pound.)

Sample <- tibble::tribble(
~ scallops,
0.93, 0.88, 0.85, 0.91, 0.91, 0.84, 0.90, 0.98, 0.88,
0.89, 0.98, 0.87, 0.91, 0.92, 0.99, 1.14, 1.06, 0.93)
lm(scallops ~ 1, data = Sample) |> conf_interval(level=0.99)
term .lwr .coef .upr
(Intercept) 0.8802122 0.9316667 0.9831212

If the average of the 18 measurements is below 1.0, a penalty is imposed. For instance, an average of 0.97 leads to 40% confiscation of the cargo, while 0.93 and 0.89 incur to 95- and 100-percent confiscation respectively.

The inspection procedure—select 18 bags at random and calculate the mean weight of the scallops therein, penalize if that mean is below 1/36 pound—is an example of a “standard operating procedure.” The government inspector doesn’t need to know any statistics or make any judgement. Just count, weigh, and find the mean.

Designing the procedure presumably involves some collaboration between a fisheries expert (“What’s the minimum allowed weight per scallop? I need scallops to have a fighting chance of reaching reproductive age.”), a statistician (“How large should the sample size be to give the desired precision? If the precision is too poor, the penalty will be effectively arbitrary.”), and an inspector (“You want me to sample 200 bags? Not gonna happen.”)

A. Which of the numbers in the above report correspond to the mean weight per scallop (in units of 1/36 pound)?

There is a legal subtlety. If the regulations state, “Mean weight must be above 1/36 pound,” then those caught by the procedure have a legitimate claim to insist that there be a good statistical case that the evidence from the sample reliably relates to a violation.

B. Which of the numbers in the above report corresponds to a plausible upper limit on what the mean weight has been measured to be?

Back to the legal subtlety …. If the regulations state, “The mean weight per scallop from a random sample of 18 bags must be 1/36 pound or larger,” then the question of evidence doesn’t come up. After all, the goal isn’t necessarily that the mean be greater than 1/36th pound, but that the entire procedure be effective at regulating the fishery and fair to the fishermen. Suppose that the real goal is that scallops weigh, on average, more than 1/34 of a pound. In order to ensure that the sampling process doesn’t lead to unfair allegations, the nominal “1/36” minimum might reflect the need for some guard against false accusations.

C. Transpose the whole confidence interval to where it would be if the target were 1/34 of a pound (that is, $$\frac{1.06}{36}$$. Does the confidence interval from a sample of 18 bags cross below 1.0?

An often-heard critique of such procedures is along the lines of, “How can a sample of 18 bags tell you anything about what’s going on in all 11,000 bags?” The answer is that the mean of 18 bags—on its own—doesn’t tell you how the result relates to the 11,000 bags. However, the mean with its confidence interval does convey what we know about the 11,000 bags from the sample of 18.

D. Suppose the procedure had been defined as sampling 100 bags, rather than 18. Using the numbers from the above report, estimate in $$\pm$$ format how wide the confidence interval would be.

Consider a situation where a measurement, say the number of highway deaths in Minnesota, is seen to increase from one year to the next. Perhaps the data look like this:

Highway <- tibble::tribble(
~ year, ~ deaths,
2021, 403,
2022, 430,
)

A. Do you think these data are sufficient for concluding that there is an upward trend in highway deaths? (By “trend,” we mean an increase that is likely to persist into the future.) Explain why or why not.

I think many people would not see much of a trend in these data. The question we want to address here is whether statistical methods would or would not point to a trend.

For technical reasons that are not important to us here, an appropriate method for building a model of counts is “poisson regression.” This works very much like lm() but is invoked differently.

glm(deaths ~ year, data=Highway, family=poisson) |> conf_interval()
Waiting for profiling to be done...
term .lwr .coef .upr
(Intercept) -400.0723002 -125.0601785 149.54320
year -0.0709928 0.0648486 0.20089

B. Based on the above report, do you think the statistical report points clearly to an upward trend, or not? Explain your reasoning.

For demonstration purposes only, we are going to make up data to see how much data is needed to give compelling (statistical) evidence of an upward trend. To that end, open an Rmd document and create a new chunk. In that chunk, paste the code from both of the above chunks. Run the code to make sure you get the same confidence interval as shown above.

C. Would the confidence interval calculated on the model fitted to the extended data clearly to a trend?

Now add a fourth row to the data, for the year 2024. You will try different values for deaths in 2024 to see what the confidence interval has to say.

D. Find the largest number of deaths for 2024 that creates a confidence interval that does not point clearly to a trend.

E. Find the smallest number of deaths for 2024 such that the confidence interval does point clearly to a trend.

The data frame math300::Gilbert has a row for each of 1641 shifts at a VA hospital. (For the story behind the data, give the command ?Gilbert.) The variable deaths records whether or not a death occurred on that shift, while gilbert records whether nurse Kristen Gilbert was on duty.

A. Using mutate() transform deaths to a zero-one variable where 1 indicates whether a death occurred on the shift.

B. Using the data from (A), fit the regression model death ~ gilbert and extract the confidence interval on the gilberton_duty coefficient.

i. The coefficient on indicates that Gilbert being on duty is associated with a 13.1 percentage point increase in the risk of a death during the shift. Gilbert was on duty for 257 shifts. Using the coefficient (.coef), how many deaths can reasonably be attributed to Gilbert?
ii. The .lwr and .upr bounds of the confidence interval suggest a range in the number of deaths that these data indicate can be attributed to Gilbert. What is that range?

How would you explain the

The following table refers to a model with explanatory variables dist and accuracy. The table shows the .output of the model for each of four combinations of the explanatory variables.

model_eval(mod, dist=c(100,200), accuracy = c(50,60)) |> select(-.lwr, -.upr)
dist accuracy .output
100 50 -42.14096
200 50 -21.52042
100 60 -39.59864
200 60 -18.97810

A. At an accuracy of 50, what is the effect size of the .output with respect to dist? (Be sure to take into account both the difference in .output and the difference in dist.)

B. At a distance of 100, what is the effect size of the .output with respect to accuracy?

Price_mod is a model of the sales price at action of antique clocks, as a function of the age of the clock and the number of bidders involved in the auction.

Price_mod <- lm(price ~ age + bidders, data = Clock_auction)

Use the following table of the evaluated model to answer the following questions.

model_eval(Price_mod, age=c(100,150), bidders=c(5,10)) |> select(-.lwr, -.upr)
age bidders .output
100 5 507.2968
150 5 1061.6295
100 10 827.4311
150 10 1381.7638

A. What is the effect size of the model .output with respect to age?

B. What is the effect size of the model .output with respect to bidders?

C. age is in years and .output is in USD. bidders is a pure number representing the number of people bidding. But rather than calling the units “people,” let’s call it “bidders.”

i. What is the units of the answer to (A)?
ii. What is the units of the answer to (B)?

D. Explain, in everyday terms, the commercial meaning of the effect size of price with respect to bidders.

The Boston Marathon is the oldest annual marathon in the US, starting in 1897. The winning runs each year are recorded in the math300::Boston_marathon data frame. In this exercise, you are going to look at the records for 1990 and earlier. (In another Lesson, we will look at what the post-1990 data have to say.)

To start, create a data frame Bos with just the 1990 and earlier data. This is ordinary data wrangling, not modeling.

Bos <- Boston_marathon %>% filter(year <= 1990)
head(Bos) # to see what the data frame looks like
year name country time sex minutes
1990 Gelindo Bordin Italy 02:08:19 male 128.3167
1989 Abebe Mekonnen Ethiopia 02:09:06 male 129.1000
1988 Ibrahim Hussein Kenya 02:08:43 male 128.7167
1987 Toshihiko Seko Japan 02:11:50 male 131.8333
1986 Robert de Castella Australia 02:07:51 male 127.8500
1985 Geoff Smith England 02:14:05 male 134.0833

Use the minutes variable to represent the winners’ running duration.

A. Create a graphic showing how the race times have varied over the years. You just need to fill in the blanks in this scaffold, but you’ll have to decide which variable should be matched to which aesthetic (that is, x, y, color).

ggplot(Bos, aes(x=______, y=_____, color=____)) +
geom_point()
ggplot(Bos, aes(x=year, y=minutes, color=sex)) +
geom_point()

B. Now to fit a simple regression model to the data. You can use these commands:

mod1 <- lm(minutes ~ year, data=Bos)
model_plot(mod1, x = year, color=sex)
Warning: Ignoring unknown aesthetics: fill

i. Comment on how well mod1 matches the data. In particular, are there ways in which the model fails to capture the pattern of change of the winning race time over the years?
ii. Whatever misgivings you have from (i), calculate the effect size of year on minutes. The following command does the basic calculation. The effect size is the rate of change of the .output with respect to year. Note that the effect size is the same for female and male runners; sex isn't included in mod1.

::: {.cell}

{.r .cell-code}
model_eval(mod1, year=c(1980, 1990), sex=c("female", "male")) |> select(-.upr, -.lwr)


::: {.cell-output-display}
<div class="kable-table">

| year|sex    |  .output|
|----:|:------|--------:|
| 1980|female | 141.9835|
| 1990|female | 140.1760|
| 1980|male   | 141.9835|
| 1990|male   | 140.1760|

</div>
:::
:::

iii. Using the effect size you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

C. Repeat the steps in (B), but for a model that does include sex as an explanatory variable: minutes ~ year + sex. (Call the model mod2.) Now the .output will be different for the females and the males.

i. Draw a graph of the model, using the same commands as in (B), but for mod2.
ii. Describe ways in which the model grossly mis-represents the data.
iii. Calculate the effect size for females and males using the model_eval() command from (B), but with mod2. Do the effect sizes differ for males and females?
iv. Using the effect size(s) you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

D. We will do this one more time, but for a slightly different model form: minutes ~ year * sex. (Call the model mod3.) Note that for the tilde expression in this model, * is used to join the explanatory variables rather than +. That is very important.

i. Draw a graph of the model, using the same commands as in (B) and (C), but for mod3.
ii. mod3 is much closer to the data than either mod1 or mod2. Explain what is different about mod3 that allows it to be more faithful to the data.
iii. Calculate the effect size for females and males using the model_eval() command from (B or C), but with mod3. Do the effect sizes differ for males and females?
iv. Using the effect sizes you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

The Boston Marathon is the oldest annual marathon in the US, starting in 1897. The winning runs each year are recorded in the math300::Boston_marathon data frame. In this exercise, you are going to look at the records for 1990 and earlier. (In another Lesson, we will look at what the post-1990 data have to say.)

To start, create a data frame Bos with just the 1990 and earlier data. This is ordinary data wrangling, not modeling.

Bos <- Boston_marathon %>% filter(year <= 1990)
head(Bos) # to see what the data frame looks like
year name country time sex minutes
1990 Gelindo Bordin Italy 02:08:19 male 128.3167
1989 Abebe Mekonnen Ethiopia 02:09:06 male 129.1000
1988 Ibrahim Hussein Kenya 02:08:43 male 128.7167
1987 Toshihiko Seko Japan 02:11:50 male 131.8333
1986 Robert de Castella Australia 02:07:51 male 127.8500
1985 Geoff Smith England 02:14:05 male 134.0833

Use the minutes variable to represent the winners’ running duration.

A. Create a graphic showing how the race times have varied over the years. You just need to fill in the blanks in this scaffold, but you’ll have to decide which variable should be matched to which aesthetic (that is, x, y, color).

ggplot(Bos, aes(x=______, y=_____, color=____)) +
geom_point()
ggplot(Bos, aes(x=year, y=minutes, color=sex)) +
geom_point()

B. Now to fit a simple regression model to the data. You can use these commands:

mod1 <- lm(minutes ~ year, data=Bos)
model_plot(mod1, x = year, color=sex)
Warning: Ignoring unknown aesthetics: fill

i. Comment on how well mod1 matches the data. In particular, are there ways in which the model fails to capture the pattern of change of the winning race time over the years?
ii. Whatever misgivings you have from (i), calculate the effect size of year on minutes. The following command does the basic calculation. The effect size is the rate of change of the .output with respect to year. Note that the effect size is the same for female and male runners; sex isn't included in mod1.

::: {.cell}

{.r .cell-code}
model_eval(mod1, year=c(1980, 1990), sex=c("female", "male")) |> select(-.upr, -.lwr)


::: {.cell-output-display}
<div class="kable-table">

| year|sex    |  .output|
|----:|:------|--------:|
| 1980|female | 141.9835|
| 1990|female | 140.1760|
| 1980|male   | 141.9835|
| 1990|male   | 140.1760|

</div>
:::
:::

iii. Using the effect size you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

C. Repeat the steps in (B), but for a model that does include sex as an explanatory variable: minutes ~ year + sex. (Call the model mod2.) Now the .output will be different for the females and the males.

i. Draw a graph of the model, using the same commands as in (B), but for mod2.
ii. Describe ways in which the model grossly mis-represents the data.
iii. Calculate the effect size for females and males using the model_eval() command from (B), but with mod2. Do the effect sizes differ for males and females?
iv. Using the effect size(s) you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

D. We will do this one more time, but for a slightly different model form: minutes ~ year * sex. (Call the model mod3.) Note that for the tilde expression in this model, * is used to join the explanatory variables rather than +. That is very important.

i. Draw a graph of the model, using the same commands as in (B) and (C), but for mod3.
ii. mod3 is much closer to the data than either mod1 or mod2. Explain what is different about mod3 that allows it to be more faithful to the data.
iii. Calculate the effect size for females and males using the model_eval() command from (B or C), but with mod3. Do the effect sizes differ for males and females?
iv. Using the effect sizes you just calculated, predict what the model value will be in the year 2020. You can assume the 1990 winning times are 145 minutes and 130 minutes respectively for females and males.

In Question 24-3 we looked at precipitation in California using the model specification precip ~ orientation + altitude + distance. We concluded that distance didn’t have much to say about precip. So in this Question we will drop distance from the model.

In addition, we will make another change. If you have ever hiked near the crest between two mountains, you might have noticed that the vegetation can be substantially different from one side of the crest to another. We would like to create models that take this into account: one model for the “W” orientation and another for the “L” orientation. (We will also get rid of two outlier stations.)

modW <- lm(precip ~  altitude + latitude,
data = Calif_precip |> filter(orientation=="W", station != "Cresent City"))
modL <- lm(precip ~  altitude + latitude,
data = Calif_precip |> filter(orientation=="L", station != "Tule Lake"))

Here are graphs of the two models:

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.
Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

A. You can see that the W model and the L model are very different. One difference is that the precipitation is much higher for the W stations than the L stations. How does the higher precipitation for W show up in the graphs? (Hint: Don’t overthink the question!)

B. Another difference between the models has to do with the confidence bands. The bands for the L stations are pretty much flat while those for the W stations tend to slope upwards.

i. What about the altitude confidence intervals on modW and modL corresponds to the difference?
ii. Calculate R^2^ for both the L model and the W model. What do the different values of R^2^ suggest about how much of the explanation of precip is accounted for by each model?

Your boss wants to know the effect of ambient temperature on the range of electric vehicles. You borrow her Tesla each month over the next year. Each month, you drive the Tesla until it needs recharging, taking note of the mileage and the outdoor temperature. Altogether, you have a sample of size $$n=12$$ from which you build a model range ~ temperature. The confidence interval on the effect size of temperature on range is $$-2.4 \pm 4.0$$ miles per degree C.

From this confidence interval, you boss concludes that lower temperatures decrease the range.

A. Explain to your boss why this conclusion is not justified by the data.

B. What sample size would be required to reduce the margin of error from $$\pm 3.7$$ to $$\pm 1$$ miles per degree C?

C. You need to warn your boss that the larger sample, while it will reduce the margin of error, won’t necessarily lead to a justified conclusion that lower temperatures decrease the range. Make up several different confidence intervals that might plausibly result from the larger sample, some of which point to the possibility that the conclusion might be that lower temperatures increase range.

Just FYI … A better design for the study would be to make several measurements in July and several in January so that your measurements will come from the extremes of temperature rather than the in-the-middle months like October or April.

Here are several graphs of basic regression models with two explanatory variables. For each graph, say whether the model specification includes an interaction between the two explanatory variables.

The following graphs show confidence bands for the same model fitted to two samples of different sizes.

A. Do the two confidence intervals (red and blue) plausibly come from the same model? Explain your reasoning.

B. Which of the confidence intervals comes from the larger sample, red or blue?

C. To judge from the graph, how large is the larger sample compared to the smaller one?

The following graphs show prediction bands for the same model fitted to two samples of different sizes differing in size by a factor of 16.

A. Which of the prediction intervals comes from the larger sample, red or blue? Explain your reasoning.

C. To judge from the graph, how large is the larger sample compared to the smaller one?

In the graph, a confidence interval and a prediction interval from the same model are shown.

A. Which is the confidence interval, red or blue?

B. To judge from the graph, what is the standard deviation of the residuals from the model?

The graph shows a confidence interval and a prediction interval.

A. Which is the confidence interval, red or blue?

B. Do the confidence interval and the prediction come from the same sample of the same system? Explain your reasoning.

An industrial process for making solar photovoltaics involves printing layers of doped silicon onto a large plastic substrate. The integrity of the final product varies from run to run. You are the manager of the production line and have asked the quality control technicians to measure the atmospheric humidity for each run to check if humidity is related to product integrity. Integrity must be at least 14 for the output of the product run to be accepted.

The graph shows the data from 50 production runs along with a prediction band from a model trained on the data.

As manager, you have decided that the probability of the integrity being above 14 must be at least 75% in order to generate an appropriate production quantity without wasting too much material from the rejected production runs.

A. What is the interval of acceptable humidity levels in order to meet the above production standards?

B. As more production runs are made, more data will be collected. Based on what you know about prediction bands, will the top and bottom of the band become closer together as more data accumulates?

IN DRAFT… ## Exercises

1. Translate Figure 25.1 to a ribbon plot.

2. Make a DAG data set with 200 points, say y ~ x. Count how many points lie outside the prediction band. It should be about 10.

Using training data as input to model_eval().

1. Show several prediction intervals from different models, some of which have been calculated at a level different from 95%. Which ones are are these? (Use different relationships, varying up the slope and exo() component to y.)
Using training data as input to model_eval().

The data frame math300::Ticket_sales comes from the operations research 310 course at the US Air Force Academy. It is used as part of a lesson on “seasonal adjustment” of time series in management. Sales of a product often vary in a seasonal pattern. For instance, September and December are often high-sales months, while February and March are low points. In modeling sales as a function of time, it is useful to be able to capture explicitly the seasonal patterns.

The Ticket_sales data frame records sales over only one month, so there is no clear seasonal trend. However, there is a weekly trend; sales tend to be higher on the weekend. In this exercise you will construct and evaluate predictive models based on the Ticket_sales data.

To set things up, construct these four models and fit to the Ticket_sales data. In each model, the response variable is sales, which is the quantity that we want to use the model to predict.

modA <- lm(sales ~ 1, data = Ticket_sales)
modB <- lm(sales ~ day, data = Ticket_sales)
modC <- lm(sales ~ dow, data = Ticket_sales)
modD <- lm(sales ~ dow + day, data = Ticket_sales[-c(3,4),])

We will focus on the “seasonal” bit of the problem: how the sales differ from one day of the week to another. To simplify the commands, here is an object that lists all the days of the week.

week <- c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")

The mechanics of making a prediction from. any one of the models is simple. For example, here is a prediction for sales on each of the weekdays on a day near the end of the month:

model_eval(modD, dow=week, day=28, interval="prediction")
dow day .output .lwr .upr
Monday 28 4.063776 2.634304 5.493247
Tuesday 28 3.418367 1.931522 4.905213
Wednesday 28 3.772959 2.316552 5.229367
Thursday 28 4.681122 3.231134 6.131111
Friday 28 10.089286 8.645206 11.533365
Saturday 28 10.543878 9.111730 11.976025
Sunday 28 8.302041 6.876694 9.727388

A. The .output shows differences in sales among the weekdays that are consistent with everyday experience of when people buy tickets for events. Which of the four models shows such differences clearly and which do not? Explain what’s missing in the models that fail to capture the differences from one weekday to another.

B. Look at the prediction intervals. Do you regard them as wide or as narrow? Explain why.

C. Calculate R2 for each of the four models. According to this criterion, which of the four models should make the best predictions.

D. One reason the prediction intervals might be poor is that the sales on some days may be extra-ordinarily high or low, outside of the “usual” pattern. You can identify such days from the residuals.

Evaluated <- model_eval(modD)

Plot the residuals (.resid) versus day. Do any days stand out as outliers?

E. For simplicity, we’re going fit the four models again, but this time excluding the days that you identified in (D) as outliers. To illustrate, suppose you thought that day 17 and 19 were outliers. (They aren’t, but just suppose.) Here’s a wrangling statement that will make a new data frame Cleaned without the outliers.

Cleaned <- Ticket_sales %>% filter(!day %in% c(17, 19))

Re-fit modD using the Cleaned data and look at the prediction intervals for the days of the week. What has changed compared to the model fitted to Ticket_sales.

F. Time for some detective work. The month recorded by Ticket_sales is not identified in the data or the documentation. Figure out what month it is. (Hints: 1) How many days are in the month recorded in Ticket_sales? 2) What familiar circumstances in the calendar might lead to exceptionally high ticket sales compared to other of the same weekday?)

A 1990 lawsuit concerned prison overcrowding in Suffolk County, Massachusetts (part of which is in Boston). The county had just opened a new jail, sized to hold 419 inmates. The new jail was the result of a previous law suit about the previous jail, where conditions were ruled unconscionable. Regretably, when the new jail opened, the nightly average prisoner count was about 500. The 1990 lawsuit claimed that the sheriff had violated the court-imposed terms from the previous lawsuit through insufficient planning for the new jail.

Plans for the new jail were made based on prisoner counts available to up 1987, when the design details were finalized. The data are:

Prisoners <- tibble::tribble(
~ year, ~ count,
1980, 190,
1981, 220,
1982, 270,
1983, 310,
1984, 319,
1985, 326,
1986, 322,
1987, 370
)

The sheriff argued that in 1987, when plans for the new jail were finalized, a jail size of 419 was more than ample, and that he should not be in the position of having to devine what the crime and arrest rates will be in the future. The prisoners countered that the growth pattern is clear for the data from 1980 to 1987. Figure 27.1 shows the prisoner-count data and the confidence and prediction intervals for the model fitted to the data.

A. Figure 27.1 shows one band (purple) inside another (blue). Which is the confidence interval and which is the prediction interval?

B. To judge from Figure 27.1, does the sheriff have a good case that his planning for 1990 based on the available data was responsible?

Figure 27.2 is similar to Figure 27.1, except that the single point from 1987 has been deleted. This is part of a process called “sensitivity analysis” which checks whether the results are overly sensitive to single data points.

C. Does Figure 27.2 lead you to a different conclusion than you stated in (B).

Actual outcome: The court ruled in favor of the sheriff, noting that prisoner populations are inherently unpredictable.

Note: This exercise stems from a 1995 article by Arnold Barnett: Interfaces 25(2)

Federal regulation calls for household appliances to be labelled for energy use. An example of such a label was published with the regulations and is shown below.

Underneath the bold-face \$84 are two bars showing intervals. Are these bars confidence intervals or prediction intervals? Explain your reasoning.

Consider the data in the point plot.

As is typical, the data are a combination of signal and noise.

SIMPLE PATTERN TYPE TO DEFINE MODEL. THE CLUSTER IN THE UPPER CORNER IS NOT PART OF A SIMPLE PATTERN.