14  Simulation

Carl Wieman is a Nobel-prize-winning physicist and professor of education at Stanford University. Weiman writes, “For many years, I had two parallel research programs: one blasting atoms with lasers to see how they’d behave, and another studying how people learn.” Some of Wieman’s work on learning deals with the nature of “expertise.” He points out that experts have ways to monitor their own thinking and learning; they have a body of knowledge relevant to checking their own understanding.

This lesson presents you with tools to help you monitor your understanding of statistical methods. The monitoring strategy is based on simulation. In a simulation, you set up a hypothetical world where you get to specify the relationships between variables. The simulation machinery lets you automatically generate data collected in this hypothetical world. You can then apply statistical methods, such as regression modeling, to the simulated data. Then check the extent to which the results of the statistical methods, for instance model coefficients, agree with the hypothetical world that you created. If so, the statistical method is doing its job. If not, you have discovered limits to the statistical method, and you can explore how you might modify the method to give better agreement.

Like our data, the simulation involves a mixture of signal and noise. You implement the signal by setting precise mathematical rules for the relationships between your simulation variables. The noise comes in when we modify the mathematical rules to include measured amounts of noise.

Pure noise

We regard the residuals from a model as “noise” because they are entirely disconnected from the pattern defined by the tilde expression that directs the training of the model. There might be other patterns in the data—other explanatory variables, for instance—that could account for the residuals.

For simulation purposes, having an inexhaustible noise source guaranteed to be unexplainable by any pattern defined over any set of potential variables is helpful. We call this pure noise.

There is a mathematical mechanism that can produce pure noise, noise that is immune to explanation. Such mechanisms are called “random number generators.” R offers many random number generators with different properties, which we will discuss in Lesson 15. In this Lesson, we will use the rnorm() random number generator just because rnorm() generates noise that looks generically like the residual that come from models. But in principle we could use others. We use datasim_make() to construct our data-generating simulations. Here is a simulation that makes two variables consisting of pure random noise.

noise_sim <- datasim_make(
  x <- rnorm(n),
  y <- rnorm(n)
)

Once a data-generating simulation has been constructed, we can draw a sample from it of whatever size n we like:

set.seed(153)
noise_sim |> sample(n = 5)
x y
2.819099 -0.3191096
-0.524167 -1.3198173
1.194761 -2.2864853
-1.741019 -0.7891444
-0.449941 -0.8128883

Although the numbers produced by the simulation are random, they are not entirely haphazard. Each variable is unconnected to the others, and each row is independent. Collectively, however, the random values have specific properties. The output above shows that the numbers tend to be in the range -2 to 2. In Figure 14.1, you can see that the distribution of each variable is densest near zero and becomes less dense rapidly as the values go past 1 or -1. This is the so-called “normal” distribution, hence the name rnorm() for the random-number generator that creates such numbers.

Code
set.seed(106)
noise_sim |> datasim_run(n=5000) |>
  pivot_longer(c(x, y), 
               names_to = "variable_name", values_to = "value") |>
  point_plot(value ~ variable_name, annot = "violin", 
             point_ink = 0.1, size = 0)
Figure 14.1: Distribution of the x and y variables from the simulation.

Another property of the numbers generated by rnorm(n) is that their mean is zero and their variance is one.

noise_sim |> sample(n=10000) |>
  summarize(mean(x), mean(y), var(x), var(y))
mean(x) mean(y) var(x) var(y)
0.000382 0.00152 0.998 1
But they aren’t exactly what they ought to be!

Most people would likely agree that the means and variances in the above report are approximately zero and one, respectively, but are not precisely so.

This has to do with a subtle feature of random numbers. We used a sample size n = 10,000, but we might equally well have used a sample size 1. Would the mean of such a small sample be zero? If this were required, the number would hardly be random!

The mean of random numbers from rnorm(n) won’t be exactly zero (except, very rarely and at random!). But the mean will tend to get closer to zero the larger that n gets. To illustrate, here are the means and variances from a sample that’s 100 times larger: n = 1,000,000:

noise_sim |> sample(n=1000000) |>
  summarize(mean(x), mean(y), var(x), var(y))
mean(x) mean(y) var(x) var(y)
0.00118 0.000586 1 0.999

The means and variances can drift far from their theoretical values for small samples. For instance:

noise_sim |> sample(n=10) |>
  summarize(mean(x), mean(y), var(x), var(y))
mean(x) mean(y) var(x) var(y)
0.342 -0.155 0.424 0.776

Recall the claim made earlier in this Lesson that rnorm() generates a new batch of random numbers every time, unrelated to previous or future batches. In such a case, the model y ~ x will, in principle, have an x-coefficient of zero. R2 will also be zero, as will the model values. That is, x tells us nothing about y. ?fig-noise_sim verifies the claim with an annotated point plot:

Code
noise_sim |> sample(n=10000) |>
  point_plot(y ~ x, annot = "model", 
             point_ink = 0.1, model_ink = 1, size=0.1) |>
  gf_theme(aspect.ratio = 1)
Figure 14.2: A sample of ten-thousand points from noise_sim. The round cloud is symptomatic of a lack of relationship between the x and y values. The model values are effectively zero; x has nothing to say about y.

Simulations with a signal

The model values in ?fig-noise_sim are effectively zero: there is no signal in the noise_sim. If we want a signal between variables in a simulation, we need to state the data-generating rule so that there is a relationship between x and y. For example:

signal_sim <- datasim_make(
  x <- rnorm(n),
  y <- 3.2 * x + rnorm(n)
)
Code
signal_sim |> sample(n=10000) |>
  point_plot(y ~ x, annot = "model", 
             model_ink = 1, point_ink = 0.1, size=0.1)
Figure 14.3: A sample of ten-thousand points from signal_sim where the y values are defined to be y <- 3.2 * x + rnorm(n). The cloud is elliptical and has a slant.

Example: Heritability of height

Simulations can be set up to implement a hypothesis about how the world works. The hypothesis might or might not be on target. It’s not even necessary that the hypothesis be completely realistic. Still, data from the simulation can be compared to field or experimental data.

Consider the following simulation in which each row of data gives the heights of several generations of a family. The simulation will be a gross simplification, as is often the case when starting to theorize. There will be a single hypothetical “mid-parent,” who reflects the average height of a real-world mother and father. The children—“mid-children”—will have a height mid-way between real-world daughters and sons.

height_sim <- datasim_make(
  mid_grandparent <- 66.7 + 2 * rnorm(n),
  mid_parent <- 17.81 + 0.733 * mid_grandparent +  0.99 * rnorm(n),
  mid_child <- 17.81 + 0.733 * mid_parent + 0.99 * rnorm(n),
  mid_grandchild <- 17.81 + 0.733 * mid_child + 0.99 * rnorm(n)
)

Note that the formulas for the heights of the mid-parents, mid-children, and mid-grandchildren are similar. The simulation imagines that the heritability of height from parents is the same in every generation. However, the simulation has to start from some “first” generation. We use the grandparents for this.

Figure 14.4

We can sample five generations of simulated heights easily:

sim_data <- height_sim |> sample(n=100000)

The simulation results compare well with the authentic Galton data:

Galton2 <- Galton |> mutate(mid_parent = (mother + father)/2)
Galton2 |> summarize(mean(mid_parent), var(mid_parent))
mean(mid_parent) var(mid_parent)
66.65863 3.066038
sim_data |> summarize(mean(mid_parent), var(mid_parent))
mean(mid_parent) var(mid_parent)
66.70651 3.098433

The mean and variance of the mid-parent from the simulation are close matches to those from Galton. Similarly, the model coefficients agree, with the intercept from the simulation data including one-half of the Galton coefficient on sexM to reflect the mid-child being halfway between F and M.

Mod_galton <- Galton2 |> 
  model_train(height ~ mid_parent + sex)
Mod_sim    <- sim_data |> 
  model_train(mid_child ~ mid_parent)
Mod_galton |> conf_interval()
term .lwr .coef .upr
(Intercept) 9.8042633 15.2013993 20.5985354
mid_parent 0.6520547 0.7328702 0.8136858
sexM 4.9449933 5.2280332 5.5110730
Mod_sim    |> conf_interval()
term .lwr .coef .upr
(Intercept) 17.8402226 18.0725882 18.3049537
mid_parent 0.7257281 0.7292103 0.7326925
Mod_galton |> R2()
n k Rsquared F adjR2 p df.num df.denom
898 2 0.638211 789.4087 0.6374025 0 2 895
Mod_sim    |> R2()
n k Rsquared F adjR2 p df.num df.denom
1e+05 1 0.6275154 168464.1 0.6275117 0 1 99998

Each successive generation relates to its parents similarly; for instance, the mid-child has children (the mid-grandchild) showing the same relationship.

sim_data |> model_train(mid_grandchild ~ mid_child) |> conf_interval()
term .lwr .coef .upr
(Intercept) 17.4321058 17.6846047 17.9371036
mid_child 0.7310966 0.7348802 0.7386638

… and all generations have about the same mean height:

sim_data |> 
  summarize(mean(mid_parent), mean(mid_child), mean(mid_grandchild))
mean(mid_parent) mean(mid_child) mean(mid_grandchild)
66.70651 66.71566 66.71262

However, the simulated variability decreases from generation to generation. That’s unexpected, given that each generation relates to its parents similarly.

sim_data |> 
  summarize(var(mid_parent), var(mid_child), var(mid_grandchild))
var(mid_parent) var(mid_child) var(mid_grandchild)
3.098433 2.625568 2.396332

To use Galton’s language, this is “regression to mediocrity,” with each generation being closer to the mean height than the parent generation.

Note

FYI … Phenotype vs genotype

Modern genetics distinguishes between the “phenotype” of a trait and the “genotype” that is the mechanism of heritability. The phenotype is not directly inherited; it reflects outside influences combined with the genotype. The above simulation reflects an early theory of inheritance based on “phenotype.” (See Figure 14.5.) However, in part due to data like Galton’s, the phenotype model has been rejected in favor of genotypic inheritance.

(a) Phenotype inherited
(b) Genotype inherited
Figure 14.5: Two different models of genetic inheritance. The phenotypic model reflects very early ideas about genetics. The genotypic model is more realistic.