Activities D: Tutorial 4 – UATX Computing Tutorials: Data and Modeling

Activity D.1 Table D.1 looks at a model of penguin bill length versus flipper length, sex, and species. The only point of the model is to generate confidence intervals for use in the multiple-choice questions that follow.

Table D.1: Confidence intervals in [lower, upper] format.

Penguins |>
  model_train(bill_length ~ flipper + sex + species) |>
  conf_interval()

# A tibble: 5 × 4
  term                .lwr  .coef   .upr
  <chr>              <dbl>  <dbl>  <dbl>
1 (Intercept)      10.2    18.2   26.2  
2 flipper           0.0580  0.101  0.143
3 sexmale           2.44    3.00   3.57 
4 speciesChinstrap  8.74    9.43  10.1  
5 speciesGentoo     4.70    5.97   7.25

What’s the margin of error for the speciesGentoo coefficient?

What’s the margin of error for the sexmale coefficient?

What’s the margin of error for the flipper coefficient?

What’s the margin of error for the (Intercept) coefficient?

Notice that each of the margins of error has been written to two significant digits. This is always appropriate for a standard error. Report the center of the confidence interval to the same last significant digit.

Report the confidence interval on the speciesChinstrap coefficient using the $\pm$ format with the appropriate number of digits per the rule just stated.

Activity D.2 The data frame UK_heights (source) gives the average height of males and females from the period 1998 to 2022.

Active R chunk D.1

Active R chunk D.1 takes the mean of the mean heights from 1998 to 2022, and also looks at the mean heights of the children in Galton. (The 2.54 in the mutate() line converts Galton’s inches to today’s cm.)

To judge from the results of Active R chunk D.1, how do Galton’s heights differ from those today?

Contemporary people are taller than in Galton’s day.

Contemporary people are shorter than in Galton’s day.

No meaningful comparison can be made, since no statement is given of the precision of the estimates.

question id: UK-height-a

The wrangling commands in Active R chunk D.1 are pretty simple, but doing the calculation using modeling gives additional information, as in Active R chunk D.2:

Active R chunk D.2

As regards women’s heights, what conclusion can be properly drawn from the results of Active R chunk D.2?

Today’s women are discernably taller than in Galton’s day.

Today’s women are discernably shorter than in Galton’s day.

There is no discernible difference in women’s heights today versus in Galton’s day.

question id: UK-height-b

Active R chunk D.2 shows the difference between mens’ and womens’ average height: on average, men are about 13 cm taller than women. What can you properly conclude about the difference between mens’ and womens’ heights today versus in Galton’s day?

Compared to women, today’s men are discernably taller than in Galton’s day.

Compared to women, today’s men are discernably shorter than in Galton’s day.

There is no discernible difference in men’s heights today versus in Galton’s day.

question id: UK-height-c

The confidence intervals in Active R chunk D.2 are wider for Galton’s people than for people today. There are several possible reasons, but a simple possibility is that the sample size for Galton was much smaller than for the the collection of samples from 1998 to 2022. Assuming that this simple possibility is the right one, roughly how big is the modern sample (taking all the different years together)? Hint: You can find out the sample size in Galton by a simple computation.

Keep in mind that the modern data involve 24 different yearly samples, so the sample size in each year is considerably smaller than the 24-year total.

Active R chunk D.3 looks at the overall trend in height from 1998 to 2022, both graphically and quantitatively.

Active R chunk D.3

Judging from Active R chunk D.3, is there a discernable upward trend in height from 1998 to 2022? Explain your reasoning, taking note of the thickness of the confidence bands.

Intercepts can be tricky to interpret, particularly when there is a quantitative explanatory variable (like year). Notice that in the results from Active R chunk D.3, the intercept has a huge margin of error: $\pm 15$ cm. Speculate about why. (Hint: The Romans might be involved.)

Activity D.3 The graph shows the grade-point average (GPA) of about 5200 graduates of UT Austin. The blue bars mark the confidence interval on the school-by-school mean of the GPA.

Some of the intervals are long, some short. Explain why.

Wait a minute! The GPA is already an average. Does it make sense to compute an mean of an average?

No. The mean GPA is the same as the GPA.

Yes. A “mean” is in some subtle way different from an “average.”

No. You can’t take the mean of an average.

Yes. The “mean” is across all students, the GPA is for one student.

question id: 4-utsat-CI-2

As a hint, here’s a direct calculation that you can try.

Activity D.4

Nature always seems trying to talk to us as if she had some great secret to tell. — John Lubbock

The simulation named noisy_process() contains a secret. Your job is to uncover it, to break the code.

Here’s how. Generate data from noisy_process, producing a data frame with four variables: u, v, w, and x. One of these variables is a function of the other three, and you can, in principle, find the secret by looking at the coefficients on a regression model of one variable versus the other three. But you don’t yet know which of the four should be the response variable. That’s part of the secret. Another part of the secret is the formula that relates the response variable to the other variables. We can tell you a few things about it, however.

A model formula like a ~ b + c + d will produce the coefficients of the relationship among the four variables. (Probably you already figured this out: each of a, b, … corresponds to one of the variables u, v, ….)
The coefficients are all integers (that is, “whole numbers” … -3, -2, -1, 0, 1, 2, 3, 4, …).
The response variable (whatever it be) is obscured by random noise. That’s how you keep a secret! But it means your output will be somewhat different each time you run the simulation.

Loading webR...

# For knitr-R chunks
noisy_process <- datasim_make(
  u <- rnorm(n),
  v <- rnorm(n) - 2*u,
  w <- rnorm(n) + 3*v,
  x <- 1 + 3*u  - 4*v - w + 20*rnorm(n)
)

For instance, suppose you think that u is the appropriate variable to use as the response. Your model specification would be u ~ v + w + x. (The order of the explanatory variables doesn’t matter.) Here’s what I got when I ran Active R chunk D.4 with this model and a sample size of 10.

All the coefficients from this model are small, so perhaps the secret coefficients are all zero. (Although we can’t rule out -1 for the v coefficient because that’s included in the confidence interval.) How can we know for sure? Collect a bigger sample, say n = 1000 here or maybe even more. If the tighter confidence intervals don’t include any integer, then you can rule out that model.

Active R chunk D.4

Your task: Try out the simulation, using each of u, v, w, and x as the response variable in turn. Make the sample size large enough so that you can be confident whether the resulting coefficients are integers or not. In the end, you will find the appropriate response variables and the (integer) coefficients for the intercept and the explanatory variables.

What did you find?

webR Code Links

R History Command Contents