```
|>
Galton summarize(sd(height))
```

# 8 Statistical thinking & variation

The central object of statistical thinking is **variation**. Whenever we consider more than one item of the same kind, , the items will likely differ from one another. Sometimes, the differences appear slight, as with the variation among fish in a school. Sometimes the differences are large, as with country-to-country variation in population size, land area, or economic production. Even carbon atomsâ€”all made of identical protons, neutrons, and electronsâ€”differ one from another in terms of energy state, bonds to molecular neighbors, and so on.

In many settings, variation is inevitable. Example: The Earth is a spinning sphere. Consequently, the intensity of sunlight necessarily varies from place to place and time of day leading to wind and turbulence. Example: In biology, variation is a built-in feature of genetic mechanisms. results from genetic and environmental impacts, even at the level of a cell or even a virus. In human activities, sometimes variation is a nuisance as in industrial production; sometimes it provides important structure as with the days of the week and seasons of the year; sometimesâ€”as in artâ€”it is encouraged.

In the same era, the increased use of time-keeping in navigation and the need to make precise land surveys across large distances led to detailed astronomical observations. The measurements of the positions and alignment of stars and planets from different observatories were slightly inconsistent even when taken simultaneously. Such inconsistencies were deemed to be the result of error. Consequently, the â€śtrueâ€ť position and alignment was considered to be that offered by the most esteemed, prestigious, and authoritative observatory.

After 1800, attitudes began to changeâ€”slowly. Rather than referring to a single authoritative source, astronomers and surveyors used arithmetic to construct an artificial summary by averaging the varying individual observations. A Belgian astronomer, Adolphe QuĂ©telet, started to apply the astronomical techniques to social data. Among other things, in the 1830s QuĂ©telet introduced the notion of a *crime rate*. At the time, this was counter-intuitive because there is no central authority that regulates the commission of crimes.

Up through the early 1900s, the differences between individual observations and the summary were described as â€śerrorsâ€ť and â€śdeviations.â€ť But the summary is built from the observations; any â€śerrorâ€ť in the observations must pass through, to some extent, to the summary.

In natural science and engineering, it became important to be able to measure the likely amount of â€śerrorâ€ť in a summary, for instance, to determine whether the summary is consistent or inconsistent with a hypothesis. Similarly, if QuĂ©telet was to compare crime rates across state boundaries, he needed to know how precise his measurement was.

To illustrate, suppose QuĂ©telet was comparing crime rates in France and England. If the summaries were exact, a simple numerical comparison would suffice to establish the differences. However, since summaries are not infinitely precise, it is essential to consider their imprecision in making judgments of difference.

The challenge for the student of statistics is to overcome years of training suggesting wrongly that you can compare groups by comparing averages. The averages themselves are not sufficient. Statistical thinking is based on the idea that **variation** is an essential component of comparison. Comparing averages can be misleading without considering the specimen-to-specimen variation simultaneously.

As you learn to think statistically, it will help to have a concise definition. The following captures much of the essence of statistical thinking:

Statistic thinking is the accounting for variationin the context ofwhat remains unaccounted for.

As we start, the previous sentence may be obscure. It will begin to make more and more sense as you work through these successive *Lessons* where, among other things, you will â€¦

- Learn how to measure variation;
- Learn how to account for variation;
- Learn how to measure what remains unaccounted for.

## Measuring variation

Yet another style for describing variationâ€”one that will take primary place in these Lessonsâ€”uses only a **single-number**. Perhaps the simplest way to imagine how a *single* number can capture variation is to think about the numerical *difference* between the top and bottom of an interval description. We are throwing out some information in taking such a distance as the measure of variation. Taken together, the top and bottom of the interval describe two things: the *location* of the values and how different the values are from one another. These are both important, but it is the difference between values that gives a pure description of variation.

Early pioneers of statistics took some time to agree on a standard way of measuring variation. For instance, should it be the distance between the top and bottom of a 50% interval, or should an 80% interval be used, or something else? Ultimately, the selected standard is not about an interval but something more fundamental: the distances between pairs of individual values.

To illustrate, suppose the `gestation`

variable had only two entries, say, 267 and 293 days. The *difference* between these is \(267-293 = -26\) days. Of course, we donâ€™t intend to measure *distance* with a negative number. One solution is to use the absolute value of the difference. However, for subtle mathematical reasons relating to the Pythagorean theorem, we avoid negative numbers by using the *square of the difference*, that is, \((293 - 267)^2 = 676\) days-squared.

To extend this straightforward measure of variation to data with \(n > 2\) is simple: look at the square difference between every possible pair of values, then average. For instance, for \(n=3\) with values 267, 293, 284, look at the differences \((267-293)^2, (267-284)^2\) and \((293-284)^2\) and average them! This simple way of measuring variation is called the â€śmodulusâ€ť and dates from at least 1885. Since then, statisticians have standardized on a closely related measure, the â€ś**variance**,â€ť which is the modulus divided by \(2\). Either one would have been fine, but honoring convention offers important advantages; like the rest of the world of statistics, weâ€™ll use the variance to measure variation.

## Variance as pairwise-differences

Figure fig-explain-modulus is a jitter plot of the `gestation`

duration variable from the `Gestation`

data frame. The graph has no explanatory variable because we are focusing on just one variable: `gestation.`

The range in the values of `gestation`

runs from just over 220 days to just under 360 days.

Each red line in Figure fig-explain-modulus connects two randomly selected values from the variable. Some lines are short; the values are pretty close (in vertical offset). Some of the lines are long; the values differ substantially.

Only a few pairs of points have been connected with the red lines. To connect every possible pair of points would fill the graph with so many lines that it would be impossible to see that each line connects a pair of values.

The average square of the linesâ€™ lengths (in the vertical direction) is called the â€śmodulus.â€ť We wonâ€™t use this word going forward in these Lessons; we accept that the conventional description of variation is the â€śvariance.â€ť Still, the modulus has a more natural explanation than the variance. Numerically, the variance is half the modulus.

Calculating the variance is straightforward using the `var()`

function. Remember, `var()`

is similar to the other reduction functionsâ€”e.g. `mean()`

and `median()`

â€”that distill multiple values into a single number. As always, the reduction functions need to be used *within* the arguments of a data wrangling function.of a set of data-frame rows to a single summary is accomplished with the `summarize()`

wrangling command.

The variance is a numerical quantity whose units come from the variable itself. `height`

in the `Galton`

data frame is measured in inches. The variance averages the square of differences, so `var(height)`

will have units of *square-inches*.

## From variance to â€śstandard deviationâ€ť

If you have studied statistics before, you have probably encountered the â€ś**standard deviation**.â€ť We avoid this terminology; it is long-winded and wrongly suggests a departure from the normal. Calculating the standard deviation involves two steps: first, find the variance; then take the square root. These two steps are automated in the `sd()`

summary function.

A consequence of the use of squaring in defining the variance is the units of the result. `gestation`

is measured in days, so `var(gestation)`

is measured in days-squared.

Use `summary()`

to compute both the variance and standard deviation of the `flipper`

length in the `Penguins`

data frame.

After viewing the numbers, pipe the result into `mutate()`

and do a simple calculation to demonstrate the numerical relationship between variance and standard deviation.

## Exercises

## Exercises

**Exercise 8.1** The two jitter + violin graphs below show the distribution of two different variables, X and Y. Which variable has more variability?

*Answer*:

There is about the same level of variability in variable A and variable B. This surprises some people. Remember, the amount of variability has to do with the spread of *values* of the variable. In variable B, those values are have a 95% prediction interval of about 30 to 65, about the same as for variable A. There are two things about plot (b) that suggest to many people that there is more variability in variable B.

- The larger horizontal spread of the dots. Note that variable B is shown along the vertical axis. The horizontal spread imposed by jittering is completely arbitrary: the only values that count are on the y axis.

- The scalloped, irregular edges of the violin plot.

On the other hand, some people look at the clustering of the data points in graph (b) into several discrete values, creating empty spaces in between. To them, this clustering implies less variability. And, in a way, it does. But the *statistical* meaning of variability has to do with the overall spread of the points, not whether they are restricted to discrete values.

**Exercise 8.2**

- True or false: Variance refers to a data frame.
*Answer*: False. Variance refers to a variable. - Are â€śstandard deviationâ€ť and â€śvarianceâ€ť the same thing?
*Answer*: Almost. Standard deviation is the square root of variance. Many people prefer to use standard deviation because its units are the same as those of the variable. (Variance has the variableâ€™s units*squared*. But variance has simpler mathematical properties.) - True or false: Variance is about a single variable, not the relationship between two variables.
*Answer*: True

**Exercise 8.3** It is undeniably odd that the units of the variance are the square of the units of the variable for which the variance is calculated. For example, if the variable records *height* in units of meters, the variance of height will have units of square-meters. Sometime, the units of the variance have no physical meaning to most people. A case in point, if the variable records the *temperature* in degreesC, the variance will have units of degreesC^{2}.

The odd units of the variance make it hard to estimate from a graph. But there is a trick that makes it straightforward.

Graph the values of the variable with a jittered point plot. The horizontal axis has no content because variance is about a single variable, not a pair of variables. The horizontal space is needed only to provide room for the jittered points to spread out.

Mark the graph with two horizontal lines. The space between the lines should include the central two-thirds of the points in the graph.

Find the intercept of each of the two lines with the vertical axis. Weâ€™ll call the two intercepts â€śtopâ€ť and â€śbottom.â€ť

Find the numerical difference between the top and the bottom. Divide this difference by 2 to get a â€śby-eyeâ€ť estimate of the

**standard deviation**. That is, the standard deviation is the length of a vertical line drawn from the mid-point of â€śtopâ€ť and â€śbottomâ€ť to the top.Square the

**standard deviation**to get the value of the**variance**.

For each of the following graphics (in the style described in (1)), estimate the standard deviation and from that the variance of the data in the plot.

```
Warning in geom_jitter(point_ink = 0.3): Ignoring unknown parameters: `point_ink`
Ignoring unknown parameters: `point_ink`
Ignoring unknown parameters: `point_ink`
Ignoring unknown parameters: `point_ink`
```

Graph | Standard Dev. | Variance | units of sd | units of var |
---|---|---|---|---|

A | ||||

B | ||||

C | ||||

D |

**Exercise 8.4** Here is the variance of three variables from the `Galton`

data frame.

```
|>
Galton summarize(vh = var(height), vm = var(mother), vf = var(father))
```

vh | vm | vf |
---|---|---|

12.8373 | 5.322365 | 6.102164 |

All three variablesâ€”`height`

, `mother`

, `father`

â€”give the height of a person. The `height`

variable is special only because the people involved are the (adult-aged) children of the respective mother and father.

Height is measured in inches in

`Galton`

. What are the units of the variances?*Answer*: Variance has the units of the square of the variable. In this case, thatâ€™s â€śsquare-inches.â€ťCan you tell from the variances which group is the tallest: fathers, mothers, or adult children?

*Answer*: Variance is about the differences between pairs of values in a variable. It does not have anything to say about whether the values are high or low, just about how they differ one to another.

Mothers and fathers have about the same variance. Yet the heights of the children have a variance that is more than twice as big. Letâ€™s see why this is.

As you might expect, the mothers are all females and the fathers are all males. But the children, whose heights are recorded in `heights`

, are a mixture of males and females. So the large variance, 12.8 square-inches, is a combination of the systematic variation between males and females and the person-to-person variation within each group.

```
|>
Galton mutate(hdev = height - mean(height), .by = sex) |>
summarize(v_sex_adjusted = var(hdev), .by = sex)
```

sex | v_sex_adjusted |
---|---|

M | 6.925288 |

F | 5.618415 |

Because the data have been grouped by `sex`

, the `mean(height)`

is found separately for males and females.

- The table just calculated gives the variance among the female (adult-aged) children and the variance among the male (adult-aged) children. Compare it to the variances for the fathers and mothers,
`vm`

and`vf`

from the calculation made at the start of this exercise.

The variance calculations made using `summarize()`

, `var()`

and `.by = sex`

can be done another way using regression modeling.

```
|> model_train(height ~ sex) |>
Galton model_eval() |>
summarize(var(.resid), .by = sex)
```

`Using training data as input to model_eval().`

sex | var(.resid) |
---|---|

M | 6.925288 |

F | 5.618415 |

- Explain what the variance of the residual has to do with the variance of the sexes separately.
*Answer*: It is roughly the average separate variances.

**Exercise 8.5** Consider these calculations of the variation in `age`

in the `Whickham`

data frame:

`|> summarize(var(age)) Whickham `

var(age) |
---|

303.8756 |

`|> summarize(var(age - 5)) Whickham `

var(age - 5) |
---|

303.8756 |

`|> summarize(var(age - mean(age))) Whickham `

var(age - mean(age)) |
---|

303.8756 |

Explain why the answers are all the same, even though different amounts are being subtracted from `age`

?

*Answer*:

Variance is about the differences between pairs of values in a variable. Adding or subtracting the same quantity to all values does not change any of the differences between them.

**Exercise 8.6**

- Some by hand calculation of variance.
- Units of variance in various settings.
- Variance by eye

## Enrichment topics

Instructors and some students will bring their previous understanding of the measurement of variation to this section. They will likely be bemused by the presentation in this Lesson. First, this Lesson gives prime billing to the â€ś**variance**â€ť (rather than the â€ś**standard deviation**â€ť). Second, the calculation will be done in an unconventional way.

There are three solid reasons for the departure from the convention. I recognize that the usual formula is the correct, computationally efficient algorithm for measuring variation. That algorithm is usually presented algebraically, even though many students do not parse algebraic notation of such complexity:

\[{\large s} \equiv \sqrt{\frac{1}{n-1} \sum_i \left(x_i - \bar{x}\right)^2}\ .\] The first step in the conventional calculation of the standard deviation \(s\) is to find the mean value of \(x\), that is

\[{\large\bar{x}} = \frac{1}{n} \sum_i x_i\] For those students who can parse the formulas, the clear implication is that the standard deviation depends on the mean.

The mean and the variance (or its square root, the standard deviation) are independent. Each can take on any value at all without changing the other. The mean and the variance measure two utterly distinct characteristics. The method shown in the text avoids making the misleading link between the mean and the variance.

As well, the textâ€™s formulation avoids any need to introduce the distracting \(n-1\). The effect of the \(n-1\) is already accounted for in the textâ€™s simple averaging.

Finally, working directly with the **vari**ance verbally reminds us that it is a measure of **vari**ation, avoids the obscure and oddball name â€śstandard deviation,â€ť and simplifies the accounting of variation by removing the need to square standard deviations before working with them.

Instructors should point out to students that the units of the variance are not those of the mean. For instance, the variance of a set of heights will have units height^{2}: area. Itâ€™s reasonable for the units to differ, just as units for gas volume and pressure vary. Variances and means are different quantities measured in different ways.