Toe bone connected to the foot bone
Foot bone connected to the heel bone
Heel bone connected to the ankle bone
Ankle bone connected to the leg bone
Leg bone connected to the knee bone
Knee bone connected to the thigh bone
Thigh bone connected to the hip bone
Hip bone connected to the back bone
Back bone connected to the shoulder bone
Shoulder bone connected to the neck bone
Neck bone connected to the head bone—“Dry Bones,” a spiritual by James Weldon Johnson andJ. Rosamond Johnson.
Sometimes our understanding of systems comes in small, simple fragments; factor A is connected to B, B to C, and so on. Each of the fragments tells us little about the whole, but the entire collection gives a comprehensive understanding of the system in its complexity.
This Lesson introduces “simulation,” a way to put together comparatively simple components so that conclusions can be drawn about the whole. A characteristic example comes from engineering. In aeronautical engineering, for instance, wind tunnel testing is used to estimate the performance of a proposed new aircraft in various ways, such as its aerodynamic drag coefficient. The measured drag (a force) is influenced not just by the shape of the aircraft but also by factors such as the density of the air, velocity of the air, cross-section area of the aircraft, etc. Any noise in the measurement of these factors will propagate into the result of the final measurement. Knowing the amount of noise in the final measurement is key to determining whether the final result is precise enough for the purpose at hand.
Every simulation is based on some hypothesis about how the system works. Obviously, we would like the assumptions contained in the hypothesis to give a good description of how the real-world works, but ultimately, any conclusion drawn from the simulation is just a consequence of the assumptions made. So we must always be careful to check that the assumptions used in the simulation are reasonable.
Readers will be familiar with simulations from video games: the whole world presented on the screen is a simulation. Our simulations will not be nearly so complicated. The output from our simulations will be data frames, not graphics.
Constructing a simulation
To construct a simulation, we give instructions for every variable in the simulation and how each variable depends on the others (if at all). In a simulation, our instructions do not dictate every detail of the calculations. Instead, random numbers play an important role as the ultimate source of the values. To illustrate, we’ll turn again to the relationship between parents’ and child’s height, but this time impose our own ideas about what might be going on. The purpose will be to check out our ideas as possible explanations for the sort of data seen in Galton.
First, we’ll make a mother and also a father. We won’t worry what are the family backgrounds of these parents: just their heights. (Height is the only thing Galton recorded about the parents.) We want the simulated parents to be more or less like the parents in Galton. A basic way to accomplish this is to give the simulated mothers and fathers the same mean and variance as in Galton.
To run the simulation, we specify the value we desire for n. For convenience, we will keep it small at, say, n = 6.
datasim_run(Height_sim, n=6)
The overall process here has been
Decide what variables we want to simulate and, as appropriate here, specify what “spread” of numbers we want the variables to have.
Build the simulation. We will use the datasim_make() function for this. The arguments to datasim_make() specify the details of the simulation. We haven’t yet explained what these are doing, but just the syntax may give you a clue.
Having built the simulation, run it with datasim_run(). The choice of the argument n= is up to you. We’re using small n now so that you can easily read the data frame created by datasim_run(), but later we’ll tend to use larger n.
Later, to avoid ambiguity, we will switch from “spread” to a technical word: “distribution.”
Focus now on the two arguments that were given to datasim_make(). Both arguments are complete commands, even if they are not yet familiar to you. Each involves the “assignment” token <-. To the left of <- goes the name of the variable to be created. That’s the easy part.
The harder part is what goes to the right side of <-, the instructions for generating the values in the variables. Here is where you need to learn a handful of new functions. In Height_sim, we use only one of this handful, the function rnorm().
Rnorm() is a “random number generator”, that is, the output is constructed at random. But even though the output is random, it has been made by a process that draws on specified properties, called “parameters.” For the rnorm() random number generator (for short, RNG) those properties are the mean and standard deviation.
Recall that the standard deviation is the square root of the variance. In rnorm(), the sd= argument sets the square root of the variance of the output-generating process.
Mysticism of random numbers
Although we call functions like rnorm() “random number generators,” in reality the arithmetic behind them does not invoke any chance mechanism. Even so, decades of work by mathematicians and computer programmers has produced an arithmetic mechanism that avoids any discernible pattern other than the specified parameters.
To have an easier way to grasp where the random numbers used in our calculations come from, mathematical statisticians have invented a simple-sounding vocabulary. At the core of this vocabulary is “population.” In everyday speech, “population” refers to a collection of people, as in the population of a city or of the world. The mathematical statistician’s “population” is typically infinite in size, from which could be drawn a sample of any size whatsoever without ever selecting the same element twice!
I like to think of the mathematical statistician’s “population” as if it were an assembly line. An assembly line for producing a particular design of toy at Santa’s workshop at the North Pole. Since every little boy or girl deserves a toy that is uniquely their own, Santa’s toys are made with slight deviations from item to item, but those deviations are kept in line according to design parameters or “tolerances.”
Santa can order a delivery of any number of items from the production line. We humans, on the other hand, don’t have direct access to the production line, just to whichever items have been produced and distributed to children. That is, we humans can work only with samples from the population.
So far as random number generators are concerned, many people like to imagine the numbers as coming from a production line at the North Pole. But, in reality, the production line is in your own computer, and neither Santa, nor his elves, nor anything else mysterious is going on.
We will work with several different RNGs, each of which produces its output in a different way that gives the numbers, in aggregate, different properties. But for now we will focus on rnorm().
To get a quick glimpse of the properties of numbers produced by rnorm(), we will collect a much larger sample from Height_sim and look at the data graphically. (We will defer including children in the simulation until the next Lesson.)
The pivot_longer() wrangling verb combines two or more column into a single column, adding a second column that says which original column each value comes from. Here’s the result applied to the data frame in ?tbl-heightsim1.
parent height
------- -------
mom 63.3
dad 70.7
mom 65.4
dad 68.9
mom 62.5
dad 71.5
mom 64.6
dad 68.6
mom 64.8
dad 70.5
mom 66.8
dad 67.2
Code
# wrangle into a convenient form for graphics ...Long_height <- Height_sim |>datasim_run(n=1000, seed=101) |>pivot_longer(c(mom, dad), names_to="parent", values_to="height") # ... and make the graphics
Warning in tilde_graph(Long_height, height ~ parent, annot = "violin", alpha = 0.3, : x-axis variable is numerical, so only one violin drawn for all rows.
Perhaps you want to use ntiles() or factor() on that variable?
Warning in geom_path(aes(x = xticks, y = ybase), data = Ruler, ...): Ignoring
unknown parameters: `scale`
Warning in geom_segment(aes(x = xticks, xend = xticks, y = ybase, yend = ymid),
: Ignoring unknown parameters: `scale`
Warning in geom_text(aes(x = xticks, y = ytop, label = labels), size = 3, :
Ignoring unknown parameters: `scale`
Figure 13.1: The distributions of the mom and dad values from the Height_sim simulation.
Relative density
The random numbers generated by rnorm() and shown in Figure 13.1, are created in a way that gives them a very definite “structure.” For instance, the mom random numbers are densest at about 64, quite common between 67 and 61, sparse outside this range and hardly discernible below about 57 or above about 71.
In order to avoid vague terms like sparse or common, we will turn to the semi-quantitative presentation of density provided by the violin plot. As you can see, at each and every value of the variable on the vertical axis, the violin has a width. The relative widths at any two values, say 60 and 65 in Figure 13.1, tell you the “relative density of data points at those two value. Just by eye, you can see that the width at 60 is about one-quarter of the width at 65.
But how to measure those relative widths? Remember, the horizontal axis in Figure 13.1 refers to the levels of a categorical variable; there are no numerical tick marks on the horizontal axis. This is where the word “relative” comes into play. It does not matter what ruler you use to measure the width so long as you use the same ruler consistently. For convenience, we have placed such a ruler in Figure 13.1. Using that ruler, you can measure the width of the mom violin at value 60 as about 2, and the width at value 65 as about 8. In other words, data points for mom are about 4 times more common at height 65 than at height 60.
Dueling graphical conventions
An important graphical convention in these Lessons is that the graphics frame is always about variables so that data points can be placed in it. The vertical and horizontal axes are always mapped to variables (as is color, if used). Violin plots, showing the relative density of data with respect to the vertical variable (e.g. height in Figure 13.1) have smoothly changing widths from bottom to top: thin where there are few data points, fat where there are many.
There is another widely used convention for displaying density, which involves a graphical frame that is different. Figure 13.2 gives an example.
Code
gf_density(~ height, fill =~ parent, data=Long_height)
Figure 13.2: A convention for displaying relative density where the vertical axis does not come from a variable. Effectively, the scale on the vertical axis provides a way to measure the width of the half-violin.
The format of Figure 13.2, let’s call it “density-as-axis,” is accepted and used by all conventional textbooks. If you continue on in statistics, you will certainly need to get used to that format. We don’t like the form because it breaks the cardinal rule that each of the three components of the graphics frame—x-axis, y-axis, color—should correspond to a variable.
The density-as-axis format shows exactly the same information as the violin format. To see the relationship, let’s redraw the violin format with two simple changes:
Figure 13.3: The density-as-axis format involves showing only half-violins and swapping the meaning of the vertical and horizontal axes.
Creating dependents
Time to return to the simulation of children’s and parents’ heights that we started in ?sec-sim-construction. We will now give each of the mom/dad pairs a child. The reader will scarcelyneed reminding of the basic facts of reproduction. For the sake of simplicity in this introduction to simulation, we will suspend the complexity of the child’s sex or gender, and create simulated children who are all similar with no sex or gender. (In the exercises, we present a more complex, sexed simulation.)
Remember that this is a simulation, so we get to make up the facts of life. On the other hand, insofar as we want to use the simulation to guide our thinking about the real-world situation, it’s helpful to make the simulation realistic. Almost always in simulations, we start with simplistic models, refining them only when the simplifications are seen to influence the conclusions in an important way.
For our made-up children, we will let the child’s height be the average of the mother’s and father’s height, with a bit of random noise added to mimic the variation seen in children’s height even with the same mother and father.
Looking back to Section 13.1, note that the heights of mothers and fathers we assigned variances consistent with that seen in the Galton data, a little more than 5 square-inches for each. We’ll let the (unisex) children vary in height by this same amount. This gives an overall simulation of:
Square-inches? Remember that the variance, as introduced in Lesson 8.1, is the average squared difference between pairs of values in a variable. The standard deviation is merely the square root of the variance. The rnorm() RNG happens to have been programmed to take the standard deviation as an argument. Had the programmer made a different decision, that argument could equally well have be made to correspond to the variance.
Having constructed a complete—if not necessarily correct—model of the relationship between parents’ and child’s heights let’s generate the simulated data, summarize it, and compare it to the same sorts of summaries of the Galton data. In contrast the the actual data, the simulated data can be of any size whatsoever. We’ll take a large sample for reasons that will be explained in Lesson 19.
Sim_data <-datasim_run(Height_sim, n =100000)
Summarizing with models:
The coefficients from the simulation are close to 1/2 for each of the mom and dad. This is what we would expect given that the child’s height was set to (mom + dad)/2 plus random noise.
term .lwr .coef .upr
------------ ------- ------- ------
(Intercept) 13.900 22.300 30.80
mother 0.187 0.283 0.38
father 0.290 0.380 0.47
To judge from the model coefficients, it looks like the simulation gives a stronger dependence of the child’s height on the parents’.
For Galton, the coefficients are about 1/3, less than the 1/2 from the simulation.
Why do the simulation results differ from the Galton-data results? One extreme possibility is that the difference is merely an accident, the play of randomness in the simulation and in the selection of the Galton sample. This is an important possibility to consider. Managing that possibility turns out to be a central task of statistical thinking, a task for which excellent tools are available, as you will see in Lesson 19.
A taste of what’s coming up … our main tool for understanding the play of randomness is called a “confidence interval.” The .lwr and .upr in the coefficient report give the bounds of that interval.
Another extreme possibility is that the difference between the simulated and real data is due to a defect in the mechanism used in the simulation for setting child’s height. Until we get to Lesson 19 with its tools for quantifying the possible play of randomness, you’ll have to take the author’s word that a defect in mechanism is what’s going on.
Let’s also look at the amount of variation in children’s heights:
Here, however, we know about a real-world mechanism that has been left out of the simulation: the child’s sex. In Galton, the children are of both sexes; Height_sim has sexless children.
A quick way to do the comparison between Galton and the height simulation more fairly, is to look at the variation in children’s height within each sex, like this:
sex var(height) var(mother) var(father)
---- ------------ ------------ ------------
F 5.618415 5.184809 6.979624
M 6.925288 5.448853 5.289675
Accounting for sex, the variation in the children’s heights is just about the same as the variation in mother’s and in father’s heights. Still, this is different from the situation in the simulation.
We will eventually have to revise the simulation to implement a mechanism that will give results closer to those in Galton. Before we can do that, however, we have to understand more about the nature of rnorm().
Exercises
Exercise 13.1
A few problems involving measuring the relative widths of the violins at different values of the y variable.
Exercise 13.2
When modeling the simulated data using the specification child ~ mom + dad, the coefficients were consistent with the mechanism used in the simulation.
Explain in detail what aspect of the simulation corresponds to the coefficient found when modeling the simulated data.
As you explained in (a), the specification child ~ mom + dad gives a good match to the coefficients from modeling the simulated data, and makes intuitive sense. However, the specification dad ~ child + mom does not accord with what we know about biology. Fit the specification dad ~ child + mom to simulated data and explain what about the coefficients contradicts the intuitive notion that the mother’s height is not a causal influence on the father’s height.