# Chapter 3 Measuring variation

Recall that the purpose of statistical inference is to determine which statistical claims are justified by the data on which they are based. This amounts to asking whether the data provide enough evidence to support a claim. How can we figure out how much data is enough?

An obvious, and important way to quantify “how much” is the number of rows in the data frame, that is, the *sample size* \(n\). Perhaps it’s intuitive that more data constitutes more evidence. Some care is required here, since we want to avoid phony creation of large sample sizes by copying earlier rows to make new rows in the data frame. One proper procedure is to insist that each unit of analysis be grabbed at random from a *population* of all the possible units. A data frame constructed by such a procedure is called a *sample of the population*, which is why the number of rows \(n\) is called the sample size.

It’s tempting to elaborate on *how much* evidence we have by counting the number of variables in the data frame. But there is a serious problem here. But there is no such thing as the “set of possible variables.” It’s the researcher who determines what will be a variable, and you can in principle make up as many as you like. In the running example from Chapter 1, the variables were age and running time. Sensible. But we might also have recorded the runner’s favorite number, or the time the runner’s brother had breakfast the Tuesday before the race, or anything else, relevant or not. Common sense tells you to avoid such silliness. But what one person considers silly might be sensible to someone else. For instance, many people take seriously astrological signs, but others don’t. Should we count astrological sign as a genuine variable? As it happens, birth month accounts for some of the observed differences in performance of professional athletes. (The reason appears to be that children who are the oldest in their school grade do better as kids in athletics, which leads to them developing confidence and interest in sports and receiving extra attention from coaches.)

The key to measuring *how much* evidence the data provides lies in the sentence, “Birth month accounts for some of the observed differences in performance.” What matters is whether a variable can *explain* or *account* for the variation in a outcome of interest (like athletic performance). We need to be able to say how much variation is in the outcome. As described in the previous chapter, in a statistical model the outcome is represented by the response variable. We’ll measure the variation in the response variable and then compare it to the amount of variation that the statistical model attributes to the explanatory variable(s).

## 3.1 Variance of a numerical variable

Recall that the statistical models we use in this book will always have a numerical response variable. We can quantify the amount of variation the response variable in many different ways. The conventional way is by a quantity called the *variance*.

There are different ways to calculate the variance of a variable. Most textbooks give a formula that can be used efficiently by a computer. For the purpose of explaining the variance to another person, I like another way.

The starting point is the response variable for which you want to know the variance. Usually, we organize variables into data frames, but for the moment imagine that the individual numbers, \(n\) of them, have been spilled out on the surface of a table. Take two of the numbers at random. Chances are, the two numbers are different but they might, by luck, be exactly the same. Doesn’t matter. To measure the variation of these two numbers, simply subtract one from the other to get the difference, then square the difference. Because of the squaring, it doesn’t matter whether you subtract the first number from the second or *vice versa*. For historical reasons, the *variance* is the square difference divided by two. But if history had worked out differently, the square difference would have been a fine measure of variation itself.

The square difference measures the variation between two numbers. But we want to measure the variation of the whole set of numbers. To do this, repeat the calculation of the square difference for *every possible pair of the numbers on the table*. For instance, if there were \(n=3\) numbers, say

\[5, 9, 3\] the pairs would be

- 5 - 9 giving a difference of -4 which squares to 16
- 5 - 3 giving a difference of 2, which squares to 4
- 3 - 9 giving a difference of -6, which squares to 36

Now average all the square differences. Averaging 16, 4, 36 gives 18.67. The variance, by historical convention, is half this number, or 9.33.

When \(n\) is big, there are a lot of possible pairs of numbers. For instance, when \(n = 100\), there are 4950 pairs. That’s why we leave it to the computer to do the calculation, and even then the calculation is re-arranged so that there are only 100 square differences involved.

If you like, you can think of the reason why we square the difference as a convenience to avoid having to worry about whether the difference is positive or negative (which depends only on which of the pair of values you put first in the subtraction). But there is some profound thinking behind the use of squares, which reflects the nature of randomness and, believe it or not, the Pythagorean theorem.

## 3.2 Variance of a categorical variable?

A categorical variable has distinct *levels*, usually represented by labels such as *agree*, *undecided*, and *disagree*. To attempt to describe the amount of variation in a categorical variable we can follow the same process as for numerical variables: spill the collection of \(n\) labels onto a table, pick at random a pair of labels, subtract them, and square the difference.

There’s a big problem, however. What is the numerical value of the difference between *agree* and *undecided*? How does the size of the difference between *agree* and *undecided* compare to the difference between *disagree* and *undecided* or between *agree* and *disagree*? Sometimes there’s a reasonable choice to be made, for example we might decide that *agree* and *disagree* differ by 2, *agree* and *undecided* differ by 1, and that *disagree* and *undecided* also differ by 1. Even more basic, it’s reasonable to say that the difference between *agree* and *agree* should be zero, and similarly for *disagree* versus *disagree* or *undecided* versus *undecided*.

Notice that all these declared differences can be created by recoding the categorical variable as a numeric variable. For instance, we can change *agree* to 1, *undecided* to 2, and *disagree* to 3. Then just calculate the variance of the numerical variable in the usual way.

Sometimes it’s sensible to translate the levels of a categorical variable into numbers. For instance, with *agree*/*undecided*/*disagree* it’s reasonable to think that *undecided* is inbetween *agree* and *disagree*. But, in general, there will be no such sense of inbetweenness of categorical levels. Take, for example, a categorical variable whose levels are the names of countries. Or a categorical variable whose levels are political parties: Green, Libertarian, Democratic, Republican. Which levels are between which? (As it happens, people do try to put political parties in sequential order by categorizing them on the scale from Left to Right.)

Without a sense of *inbetweenness* of levels, it’s arbitrary to assign numbers to the various levels. Except in one situation.

Often, categorical variables have only two levels. Yes or no. Dead or alive. Accepted or rejected. Treatment and control. Such variables are sometimes called *binary* (like the 0/1 of computer bits) or *dicotomous* or *binomial* (meaning, having two names) or even *two-level*. In the previous chapter, we called them *indicator* variables.

When dealing with an indicator variable, there’s no level to be inbetween; there are only two levels and the idea of “in between” requires at least three distinct things. So we can easily agree, regardless of our opinions about how the world works, that the difference is zero between labels that are the same (say, *yes* and *yes* or between *no* and *no*). And when the labels are different (say, *yes* and *no*) we just need to assign a non-zero number to the difference.

Which number? Should the square-difference between *yes* and *no* be 17, or 328, or 0.3? By convention, we use the number 1 for the square-difference between the two levels of a binary variable. This convention has the advantage of simplifying calculations. It’s also what you will get by treating indicator variables numerically. But there is another important advantage of the simple choice: any average of a 0/1 variable must always be somewhere in the range from 0 to 1, which is exactly the same scale we use for describing *probability*.

The simplicity of dealing with indicator variables means that the techniques of statistical inference with an indicator for a categorical response variable are much easier than for non-binary categorical response variables. This is also the most common setting for classical inference.