Instructor Teaching Notes for Lesson 20



Daniel Kaplan


March 7, 2023

Review of Lesson 19

Looking at the results of the divide-and-measure activity

The entire box is 14.43 “inches” long. This should be the total of the left, right, and middle measurements, but people tended to round.

Thirds <- readr::read_csv("")
Rows: 16 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Student_initials
dbl (3): left, middle, right

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Long_form <- tidyr::pivot_longer(Thirds, !Student_initials, names_to = "position")
Long_form |> group_by(Student_initials) |> summarize(tot = sum(value), v = var(value))
# A tibble: 16 × 3
   Student_initials   tot        v
   <chr>            <dbl>    <dbl>
 1 AMA               13.8 0.0208  
 2 BAK               14.5 0.271   
 3 DQS               14.4 0.416   
 4 EDS               16   0.646   
 5 EFSF              14.1 0.00750 
 6 EJD               13.8 0.271   
 7 HAM               14.2 0.188   
 8 HJB               14.3 0.000833
 9 IMW               14   0.396   
10 JK                14.2 0       
11 JTSF              14.8 0.271   
12 KDL               14.2 0.0625  
13 KZC               14   0.583   
14 RHP               14.2 0       
15 RRP               14.2 0.188   
16 SJA               14.5 1.65    
Long_form |> summarize(vmeas = var(value))
# A tibble: 1 × 1
1 0.240
lm(value ~ position, data=Long_form) |> conf_interval()
# A tibble: 3 × 4
  term             .lwr .coef  .upr
  <chr>           <dbl> <dbl> <dbl>
1 (Intercept)     4.29  4.50  4.70 
2 positionmiddle  0.392 0.678 0.964
3 positionright  -0.126 0.159 0.445
ggplot(Long_form, aes(y=value, x=position)) + geom_jitter(width=0.2, alpha=0.5)


Remember that statistics focus on variation in the characteristics of a set of multiple specimens. The characteristics of each individual specimen are recorded in a row of a data frame. The data frame itself, with its multiple rows, represents the set of specimens. Each characteristic is arranged as a column in the data frame. We call such columns “variables,” a name that emphasizes that our particular interest is to understand/explain/account-for the variation of the values stored in the column.

In a regression model, we attempt to understand/explain/account-for the variation in a single variable, called the “response variable.” We accomplish this explanation by associating the variation in the response with the simultaneous variation in other variables called “explanatory variables.”

The lm() model-building function does the work of quantifying the associations. Your task in model building is to provide data for training and to specify which are the explanatory variables you want to use to account for the variation in the response variable. The specification takes the form of a tilde expression listing the response and explanatory variables. All these variables must be in the data frame used for training. We say such variables are “observed.”

There are usually other characteristics that are relevant to the system being studied that are not observed, that is, they are not in the data frame. It’s a bad idea to ignore such things.

Starting Lesson 20

Today is a meta-day. It is about tools for learning about statistical methods and gaining insight into why certain kinds of questions/techniques come up over and over again as you work on genuine statistical problems.

The two kinds of tools for learning are:

  1. Tools for thinking and communicating about hypotheses about causal connections.
    • Diagrams called “DAGs” for sketching out causation.
    • Generating random, simulated data consistent with the mechanism described by a DAG.
  2. Ways to automate the process of random trials. This is purely a labor-saving measure. You are not responsible to generate the code for this automation, but you should learn to read the code to be able to say what’s going on.

Causation examples

  1. Systolic blood pressure in the elderly:
    • Experiment shows that lowering SBP reduces mortality.
    • Observation shows that lower SBP is associated with increased mortality.
  2. Congressional elections
    • Among incumbents, higher election spending is associated with worse vote outcomes.
  3. Vitamin D and disease
    • Low vitamin linked to adverse outcomes in many diseases
    • Ill people go outside less often so are less exposed to sunlight AND Vitamin D is an acute phase reactant and declines with the inflammatory cytokine rise in acute and chronic diseases AND No evidence from randomized trials that vitamin D supplementation lessens mortality risks in such conditions.
    • Bring up article

Directed acyclic graphs (DAGs)

A DAG is a format for writing down which characteristics, either observed or unobserved, are important in the operation of a system.

A good dictionary definition of “system” is:

A set of things working together as parts of a mechanism or an interconnecting network.

Graphs, Directed, Acyclic

Sampling from DAGs

In Math 300Z, DAGs have been augmented with a simulation mechanism. This consists of formulas that are invoked to create each variable in the DAG.

Activity: Life Savers

Repeating trials

foo <- do(100000)*sum(runif(10))
ggplot(foo, aes(x=" ", y = sum)) + geom_violin(alpha=0.5)

Trials <- do(100) * {lm(x ~ y, data=sample(dag03, size=5)) |> R2()}