7  Statistical thinking & variation

Published

2018-12-02

The central object of statistical thinking is variation. Whenever we consider more than one item of the same kind, it’s likely that the items will differ one from another. Sometimes the differences appear slight, as with the variation among fish in a school. Sometimes the differences are large, as with country-to-country variation in population size, land area, or economic production. Even carbon atoms—all made of identical protons, neutrons, and electrons—differ one from another in terms of energy state, bonds to molecular neighbors, and so on.

Expressed in terms of data frames, the “same kind” means the unit of observation, the type of specimen that occupies a row of a data frame.

Sometimes variation is desirable; think how boring it would be to have all students exactly the same. Sometime the goal is to avoid variation, as with precision-made, interchangeable components intended for assembly-line production. Lack of variation—interchangeability—became an industrial concept around 1800 with the manufacture of guns. Perhaps it’s obvious that achieving interchangeability requires ways to measure with precision, to detect even the smallest differences between a standard part and the parts in production.

In the same era, the increased use of time-keeping in navigation and the need to make precise surveys of land across large distances led to very detailed astronomical observations. Different observatories’ measurements of the positions and alignment of stars and planets were found to be slightly inconsistent even when taken at the same time. Such inconsistencies were deemed to be the result of error. Consequently, the “true” position and alignment was considered to be that offered by the most esteemed, prestigious, and authoritative observatory.

After 1800, attitudes began to change—slowly. Rather than referring to a single observation as best, astronomers and surveyors used arithmetic to construct an artificial summary by averaging the varying individual observations. The average—typically the arithmetic mean—is called an “estimate” reflecting an understanding that the summary itself is not perfect. The estimate might differ from each of the actual observations, but still was taken as the more authoritative than any of them.

By way of political analogy, in the pre-1800 system the esteemed observatory’s observation was the king or queen and carried absolute authority. The post-1800 system of averages and estimates was more like a democratic process, where the voice of each of the observations had equal weight in the final outcome. This post-1800 democratic conception is very much still with us, with elementary-school students learning how to average at the same time they learn about elections and voting.

Up through the 1900s, terms like “error” and “deviation” were used to name the difference between individual observations and the summary estimate. But as the use of statistical summaries spread rapidly from one field to another, statistician had to confront the inconvenient fact that summaries themselves could contain error.

This convenience might have first become evident when statistical summaries of different groups—say Frenchmen and Englishmen—were compared. If the summaries were exact, a simple numerical comparison would suffice to establish the differences. But since summaries are not infinitely precise, it is essential to consider their imprecision in making judgments of difference.

The challenge of the statistics student is to overcome years of training that you can compare groups by comparing averages. The averages themselves are not sufficient. Statistical thinking is based on the idea that variation is an essential component of comparison. Comparing averages can be misleading without considering at the same time the specimen-to-specimen variation.

As you learn to think statistically, it will help to have a concise definition. The following captures much of the essence of statistical thinking:

Statistic thinking is the accounting for variation in the context of what remains unaccounted for.

As we start, the previous sentence may be obscure. It will start to make more and more sense as you work though these successive Lessons where, among other things, you will …

  1. Learn how to measure variation;
  2. Learn how to account for variation;
  3. Learn how to measure what remains unaccounted for.

Measuring variation

Instructors will bring to this section their previous understanding of the measurement of variation. They will likely be bemused by the presentation here. First, this Lesson gives prime billing to the “variance” (rather than the “standard deviation”). Second, the calculation will be done in an unconvential way.

There are three solid reasons for the departure from convention. I recognize that the usual formula is the correct, computationally efficient algorithm for measuring variation. That algorithm is usually presented algebraically, even though many students do not parse algebraic notation of such complexity:

\[{\large s} \equiv \sqrt{\frac{1}{n-1} \sum_i \left(x_i - \bar{x}\right)^2}\ .\] The first step in the conventional calculation of the standard deviation \(s\) is to find the mean value of \(x\), that is

\[{\large\bar{x}} = \frac{1}{n} \sum_i x_i\] For those students who can parse the formulas, the clear implication is that the standard deviation depends on the mean.

In reality, the mean and the variance (or its square root, the standard deviation) are entirely independent. Each can take on any value at all without changing the other. The mean and the variance measure two, utterly distinct characteristics. The method shown in the text avoids making the misleading link between the mean and the variance.

As well, the text’s formulation avoids any need to introduce the distracting \(n-1\). The effect of the \(n-1\) is already accounted for in the text’s simple averaging.

Finally, working directly with the variance verbally reminds us that it is a measure of variation, avoids the obscure and oddball name “standard deviation,” and simplifies the accounting of variation by removing the need to square standard deviations before working with them.

It is important for instructors to point out to students that the units of the variance are not those of the mean. For instance, the variance of a set of heights will have units height2 (that is, area). But it’s entirely reasonable for the units to differ, after all, the variance is a different kind of thing than the mean.

Yet another style for describing variation—one that will take primary place in these Lessons—uses only a single-number. Perhaps the simplest way to imagine how a single number can capture variation is to think about the numerical difference between the top and bottom of an interval description. In taking such a distance as the measure of variation, we are throwing out some information. Taken together, the top and bottom of the interval describe two things: the location of the values and how different the values are from one another. These are both important, but it is the difference between values that gives a pure description of variation.

Early pioneers of statistics took some time to agree on a standard way of measuring variation. For instance, should it be the distance between the top and bottom of a 50% interval, or should an 80% interval be used, or something else. In the end, the selected standard is not about an interval but something rather more basic: the distances between pairs of individual values.

To illustrate, suppose the gestation variable had only two entries, say, 267 and 293 days. The difference between these is \(267-293 = -26\) days. Of course, we don’t intend to measure distance with a negative number. One solution is to use the absolute value of the difference. However, for subtle mathematical reasons relating to—of all things!—the Pythagorean theorem, we avoid the possibility of a negative number by using the square of the difference, that is, \((293 - 267)^2 = 676\) days-squared.

To extend this very simple measure of variation to data with \(n > 2\) is simple: look at the square difference between every possible pair of values, then average. For instance, for \(n=3\) with values 267, 293, 284, look at the differences \((267-293)^2, (267-284)^2\) and \((293-284)^2\) and average them! This simple way of measuring variation is called the “modulus” and dates from at least 1885. Since then, statisticians have standardized on a closely related measure, the “variance,” which is the modulus divided by \(2\). Either one would have been fine, but honoring convention offers important advantages; like the rest of the world of statistics, we’ll use the variance to measure variation.

Variance as pairwise-differences

Figure 7.1 is a jitter plot of the gestation duration variable from the Gestation data frame. There is no explanatory variable in the graph because we are focusing on just the one variable: gestation. The range in the values of gestation runs from just over 220 days to just under 360 days.

Each red line in Figure 7.1 connects two randomly selected values from the variable. Some of lines are short; the values are pretty close (in vertical offset). Some of the lines are long; the values differ substantially.

Figure 7.1: The variance is related to the average square difference between all pairs of values in the variable.

Only a few pairs of points have been connected with the red lines. To connect every possible pair of points would fill the graph with so many lines that it would be impossible to see that each line connects a pair of values.

The average of the square of the length of the lines (in the vertical direction) is called the “modulus.” We won’t need to use this word, since the “variance” is the standard description of variability. Numerically, the variance is half the value of the modulus.

Calculating the variance is straightforward using the var() function. Remember, var() is similar to the other summary functions such as mean() or median() that reduce multiple values into a single value. As always, the reduction of a set of data-frame rows to a single summary is accomplished with the summarize() wrangling command.

Tip: In the expression summarize( vgest=var(gestation)), the name vgest is selected for human readability of the results. The name need not have anything to do with the quantity being calculated. So, summarize(rock_paper_scissors=var(gestation)) is perfectly valid from the computer’s point of view, but not so helpful from the human perspective.
Gestation %>%
  summarize(vgest = var(gestation))
    vgest
 --------
  256.887

A consequence of the use of squaring in defining the variance is the units of the result. gestation is measured in days, so var(gestation) is measured in days-squared.

Almost all statistics textbooks talk about the “spread” of a set of values and measure it with a quantity called the “standard deviation.”

Gestation %>%
  summarize(standard_deviation = sd(gestation))
  standard_deviation
 -------------------
            16.02769

The standard deviation is simply the square root of the variance.

Exercises

The two jitter + violin graphs below show the distribution of two different variables, X and Y. Which variable has more variability?

Answer:

There is about the same level of variability in variable A and variable B. This surprises some people. Remember, the amount of variability has to do with the spread of values of the variable. In variable B, those values are have a 95% prediction interval of about 30 to 65, about the same as for variable A. There are two things about plot (b) that suggest to many people that there is more variability in variable B.

  1. The larger horizontal spread of the dots. Note that variable B is shown along the vertical axis. The horizontal spread imposed by jittering is completely arbitrary: the only values that count are on the y axis.
  2. The scalloped, irregular edges of the violin plot.

On the other hand, some people look at the clustering of the data points in graph (b) into several discrete values, creating empty spaces in between. To them, this clustering implies less variability. And, in a way, it does. But the statistical meaning of variability has to do with the overall spread of the points, not whether they are restricted to discrete values.

  1. Some by hand calculation of variance.
  2. Units of variance in various settings.
  3. Variance by eye
  4. For experts: where the \(n-1\) comes from.