Chapter 15 Hypothesis Testing on Parts of Models

Everything should be made as simple as possible, but not simpler. – Albert Einstein

It often happens in studying a system that a single explanatory variable is of direct interest to the modeler. Other explanatory variables may be important in the way the system works, but they are not of primary interest to the modeler. As in previous chapters, call the explanatory variable of interest simply the “explanatory variable.” The other explanatory variables, the ones that aren’t of direct interest, are the covariates. The covariates are ordinary variables. Their designation as covariates reflects the interests of the modeler rather than any intrinsic property of the variables themselves.

The source of the interest in the explanatory variable might be to discover whether it is relevant to a model of a response variable. For example, in studying people’s heights, one expects that several covariates, for instance the person’s sex or the height of their father or mother, play a role. It would be interesting to find out whether the number of siblings a person has, nkids in Galton’s dataset, is linked to the height. Studying nkids in isolation might be misleading since the other variables are reasonably regarded as influencing height. But doing a whole-model hypothesis test on a model that includes both nkids and the covariates would be uninformative: Of course one can account for a significant amount of the variation in height using sex and mother and father. (See the Example in Section 14.3 on height and genetics.) The question about nkids is whether it contributes something to the explanation that goes beyond the covariates. To answer this question, one needs a way of assessing the contribution of nkids on its own, but in the context set by the other variables.

Sometimes you may choose to focus on a single explanatory variable because a decision may hang on the role of that variable. Keep in mind that role of the explanatory variable of interest can depend on the context set by covariates. Those covariates might enhance, mask, or reverse the role of the variable of interest. For instance, it’s legitimate that wages might vary according to the type of job and the worker’s level of experience and education. It’s also possible that wages might vary according to sex or race insofar as those variables happen to be correlated with the covariates, that is, correlated with the type of job or level of education and experience. But, if wages depend on the worker’s sex or race even taking into account the legitimate factors, that’s a different story and suggests a more sinister mechanism at work.

This chapter is about conducting hypothesis tests on a single explanatory variable in the context set by covariates. In talking about general principles, the text will refer to models of the form A ~ B + C or A ~ B + C + D where A is the response variable, B is the variable of interest, and C and D are the variables that you aren’t directly interested in: the covariates.

Recall the definition of statistics offered in Chapter 1:

Statistics is the explanation of variation in the context of what remains unexplained.

It’s important to include covariates in a hypothesis test because they influence both aspects of this definition:

  • Explanation of variation. When the covariates are correlated with the explanatory variable, as they often are, including the covariates in a model will change the coefficients on the explanatory variable.

  • What remains unexplained. Including covariates raises the R² of a model. Or, to put it another way, including covariates reduces the size of the residuals. This can make the explanatory variable look better: smaller residuals generally mean smaller standard errors and higher F statistics. This is not an accounting trick, it genuinely reflects how much of the response variable remains unexplained.

15.1 The Term-by-Term ANOVA Table

The ANOVA table introduced in the Chapter 14 provides a framework for conducting hypothesis tests that include covariates. The whole-model ANOVA table lets you compare the “size” of the explained part of the model with the “size” of the residuals. Here is a whole-model report for the model height ~ nkids + sex + father + mother fit to Galton’s data:

nbsp; Df Sum Sq Mean Sq F value p-value
Model terms 4 7377.9 1845.0 401 0.0000
Residuals 893 4137.1 4.6

The square length of the residual vector is 4137.1, the square length of the fitted model vector is 7377.9. The F test takes into account the number of vectors involved in each – four explanatory vectors (neglecting the intercept term), leaving 893 residual degrees of freedom. The F value, 401, is the ratio of the mean square of the model terms to the residuals. Since 401 is much, much larger than 1 (which is the expected value when the model terms are just random vectors), it’s appropriate to reject the null hypothesis; the p-value is effectively zero since F is so huge.

A term-by-term ANOVA report is much the same, but the report doesn’t just partition variation into “modeled” and “residual,” it goes further by partitioning the modeled variation among the different explanatory terms. There is one row in the report for each individual model term.

  Df Sum Sq Mean Sq F value p-value
nkids 1 185.5 185.5 40.0 0.0001
sex 1 5766.3 5766.3 1244.7 0.0001
father 1 937.6 937.6 202.4 0.0001
mother 1 488.5 488.5 105.5 0.0001
Residuals 893 4137.1 4.6

Notice that the “Residuals” row is exactly the same in the two ANOVA reports. This is because that row just describes the square length of the residuals – it’s the same model with the same residuals being analyzed in the two ANOVA reports. Less obvious, perhaps, is that the sum of squares of the individual model terms adds up across all the terms to give exactly the sum of squares from the whole-model ANOVA report: 185.5 + 5766.3 + 937.6 + 488.5 = 7377.9. The same is true for the degrees of freedom of the model terms: 1 + 1 + 1 + 1 = 4.

What’s new in the term-by-term ANOVA report is that there is a separate mean square, F value, and p-value for each model term. These are calculated in the familiar way: the mean square for any term is just the sum of squares for that term divided by the degrees of freedom for that term. Similarly, the F value for each term is the mean square for that term divided by the mean square of the residuals. For instance, for nkids, the F value is 185.5 / 4.6 = 40. This is translated to a p-value in exactly the same way as in Chapter 14 (using an F distribution with 1 degree of freedom in the numerator and 893 in the denominator – values that come from the corresponding rows of the ANOVA report).

15.2 Covariates Soak Up Variance

An important role of covariates is to account for variance in the response. This reduces the size of residuals. Effectively, the covariates reduce the amount of randomness attributed to the system and thereby make it easier to see patterns in the data. To illustrate, consider a simple question relating to marriage: Do men tend to be older than women when they get married?

The marriage-license dataset contains information on the ages of brides and grooms at the time of their marriage as well as other data on education, race, number of previous marriages, and so on. These data were collected from the on-line marriage license records made available by Mobile County, Alabama in the US. Here’s an excerpt:

BookpageID Person Age
B230p539 Bride 28.7
B230p539 Groom 32.6
B230p677 Bride 52.6
B230p677 Groom 32.3
B230p766 Bride 26.7
B230p766 Groom 34.8
B230p892 Bride 39.6
B230p892 Groom 40.6
… and so on for 49 couples altogether

The Person and Age variables are obvious enough. The BookpageID variable records the location of the marriage license in the Mobile County records as a book number and page number. There is only one license on each page, so this serves as a unique identifier of the couple.

It seems simple enough to test whether men tend to be older than women when they get married, just fit the model Age ~ Person. Here it is:

  Estimate Std. Error t value p-value
Intercept 33.239 2.060 16.13 0.0001
PersonGroom 2.546 2.914 0.87 0.3845

The coefficient on personGroom indicates that men are typically 2.5 years older than women when they get married. That’s roughly what one would expect from experience – husbands and wives tend to be roughly the same age, with the husband typically a bit older. The confidence interval suggests a different interpretation, however. It’s 2.5 ± 5.8, giving reason to believe that a different sample of licenses could give a coefficient of zero, indicating no age difference between men and women. ANOVA tells much the same story

  Df Sum Sq Mean Sq F value p-value
Person 1 159 159 0.76 0.38
Residuals 96 19967 208.00

The F value, 0.76, is smaller than one. Correspondingly, the p-value is big: 0.38. Based on these data, it’s not appropriate to reject the null hypothesis that men and women are the same age when they get married.

No big deal. Perhaps there just isn’t enough data. With some effort, more data could be collected and this might bring the p-value down to the point where the null could be rejected.

Actually, there’s much more information in this sample than the hypothesis test suggests. The model Age ~ Person ignores a very important structure in the data: Who is married to whom. A person’s age at marriage is very strongly related to the age of their spouse. That is, if you know the age of the wife, you have a lot of information about the age of the husband, and vice versa. This relationship between the husband and wife is carried by the variable BookpageID. Since the model Age ~ Person doesn’t include BookpageID as a covariate, the model tends to overstate the unexplained part of age.

To fix things, add in BookpageID as a covariate and examine the model Age ~ Person + BookpageID. In this model the coefficient on PersonGroom refers to the difference in age between males and females, holding constant the family. That is, this model looks at the difference between the husband and wife within the family. Here’s the term-by-term ANOVA report:

  Df Sum Sq Mean Sq F value p-value
Person 1 159.0 159.0 9.07 0.0041
BookpageID 48 19127.0 398.5 22.77 0.0001
Residuals 48 840.0 17.5

Notice that here the F value of Person is much larger than before, 9.07 versus 0.76 before. Correspondingly the p-value is much smaller: 0.004 versus 0.384. But the sum of squares of Person is exactly the same as it was for the previous model. What’s happened is that the inclusion of BookpageID in the model has dramatically reduced the size of the residuals. The residual mean square is now 17.5 compared to 208 in the model that omitted BookpageID as a covariate.

It’s not that BookpageID is a variable that’s directly of interest. But that doesn’t mean that you shouldn’t include BookpageID in a model. It accounts for some of the variation in Age and in so doing makes it easier for the role of other variables to be seen above the random variation due to sampling.

The particular hypothesis test conducted here has a specific name, the paired t-test. The word “paired” reflects the natural pairing in the data: each family has a wife and a husband. Another context in which pairing arises is before-and-after studies, for example measuring a subject’s blood pressure both before and after a drug is given.

A more general term is repeated measures. In a paired situation there are two measurements on each subject, but there can also be situations where there are more than two.

A still more general term is analysis of covariance, which refers simply to a situation where covariates are used to soak up the residuals.

Example: Wages and Race

The Current Population Survey data (from the mid-1980s) can be analyzed to look for signs that there is discrimination based on race in the wages people earn. That dataset has a variable, race, that encodes the race of each wage earner as either Black or White. The obvious, simple model is wage ~ race. Here’s the coefficient report:

  Estimate Std. Error t value p-value
Intercept 8.06 0.63 12.86 0.0001
raceW 1.10 0.67 1.65 0.1001

According to this report, whites earned an average of $1.10 per hour more than Blacks. But the standard error is large: the confidence interval on this estimate is $1.10 ± 1.34. The ANOVA report on the model suggests that these data don’t give enough evidence to reject the null hypothesis that race is a factor in wages:

  Df Sum Sq Mean Sq F value p-value
race 1 71.5 71.45 2.71 0.1001
Residuals 532 14005.3 26.33

The F value is 2.7, bigger than the expected value of one, but not so big as to be compelling evidence. The p-value is 0.10, too large to reject the null.

Remember that the p-value compares the explanatory power of race to a typical random term in order to account for sampling variation. In this model, all of the variation in wage is considered random other than that attributed to race. This is not reasonable. There are other sources of variation in wage such as the education of the worker and the type of job. Consider now a model with educ and sector of the economy as covariates: wage ~ race + educ + sector:

  Df Sum Sq Mean Sq F value p-value
race 1 71.45 71.45 3.44 0.0641
educ 1 2017.67 2017.67 97.22 0.0001
sector 7 1112.92 158.99 7.66 0.0001
Residuals 524 10874.66 20.75

Now the F value on race is increased, simply because the size of the residuals has been reduced by inclusion of the covariates.

The lesson here is, again, that including covariates can make the patterns of the explanatory variable of interest stand out more clearly. Admittedly, the change in F here is not huge, but it has come at a very small price: fitting a new model. Collecting data can be expensive. The modeler seeks to extract the most information from that data.

This example will be continued later in this chapter.

15.3 Measuring the Sum of Squares

At a superficial level, interpretation of the term-by-term ANOVA report is straightforward. Use the p-value on each row to decide whether to reject the null hypothesis for that row. For instance, return to the height-model example from before:

height ~ nkids + sex + father + mother

  Df Sum Sq Mean Sq F value p-value
nkids 1 185.46 185.46 40.03 0.0001
sex 1 5766.33 5766.33 1244.67 0.0001
father 1 937.63 937.63 202.39 0.0001
mother 1 488.52 488.52 105.45 0.0001
Residuals 893 4137.12 4.63

The very small p-value on nkids (it’s actually 0.0000000004) prompts a rejection of the null hypothesis on the nkids term. But there are other ANOVA reports on exactly the same model fitted to exactly the same data. Here’s one:

  Df Sum Sq Mean Sq F value pvalue
sex 1 5874.57 5874.57 1268.03 0.0001
father 1 1001.11 1001.11 216.09 0.0001
mother 1 490.22 490.22 105.81 0.0001
nkids 1 12.04 12.04 2.60 0.1073
Residuals 893 4137.12 4.63

According to this report, nkids is not making a significant contribution to the model.

Why do the two reports give such different results for nkids? Which is right, the second report’s p-value on nkids of 0.107 or the first report’s p-value of 0.0000000004? Obviously, they lead to different conclusions. But they are both right. It’s just that there are two different null hypotheses involved. In the first report, the null is that nkids on its own doesn’t contribute beyond what’s typical for a random vector. In the second report, the null is that nkids doesn’t contribute beyond the contribution already made by mother, father, and sex. The two reports give us different perspectives on nkids.

In understanding term-by-term ANOVA, it’s helpful to start with an analogy. Suppose you have a sports team consisting of players B, C, D, and so on. As a whole, the team scores 30 points in a game. B scored 3 of these points, C 10 of them, D no points, and so on.

The score reflects the team as a whole but you want to be able to identify particularly strong or weak players as individuals. The obvious thing to do is give each player credit for the points that he or she actually scored: B gets 3 points of credit, C 10 points of credit, D no credit, and so on.

This system, though simple, has its flaws. In particular, the system doesn’t acknowledge the role of teamwork. Perhaps D’s role is primarily defensive; she contributes to the team even though she doesn’t actually do the scoring. C scores a lot, but maybe this is because B does a great job of passing to her. In evaluating an individual player’s contribution, it’s important to take into account the context established by the other players.

Now consider two different ways of assigning credit to individual players. Each involves replaying the game with some changes; this is impractical in real-world game playing but quite feasible when it comes to fitting models.

  • Take out one player. Play the game and record the score. Now take out one player and re-play the game. The player who was taken out gets credit for the difference in scores between the two games.

    What’s nice about this method is that it gives the player credit for how he or she they contributed to the score beyond what the other players could have accomplished on their own.

  • Take out multiple players. Play the game with a subset of the players. Now replay the game, adding in one new player. The new player gets credit for the difference in scores between the two games.

    As you can imagine, the result of this method will depend on which subset of players was in the game. In one extreme limit, the new player would be the only player: a one person team.

Now to translate this analogy back to ANOVA. First, the game is played with only one player: a single model term, as in the model height ~ 1. The game’s score is recorded as the sum of squares of the fitted model values. Next, another player is put in and the game is replayed: as in height ~ 1 + nkids. More and more games are played, each successive game adding in one more player: height ~ 1 + nkids + sex, then height ~ 1 + nkids+sex+father and so on . In each game, the score is measured as the sum of squares.

To illustrate the process, reconstruct term-by-term ANOVA on the model height ~ 1 +nkids + sex + father + mother. Think of this as a set of nested models, as shown in the table. For each of the nested models a “score” – that is, a sum of squares – is found.

Model Sum of Squares Term added Additional Sum of Sq.
height ~ 1 4002376.8 intercept
height ~ 1+nkids 4002562.3 nkids 185.5
height ~ 1+nkids+sex 4008328.6 sex 5766.3
height ~ 1+nkids+sex+father 4009266.2 father 937.6
height ~ 1+nkids+sex+father+mother 4009754.7 mother 488.5

The “Additional Sum of Squares” assigned to each term reflects how much that term increased the sum of squares compared to the previous model. For instance, adding the nkids term in the second line of the table increased the sum of squares from 4002376.8 to 4002562.3, an increase of 185.5. It’s the increase that’s attributed to nkids.

Perhaps you’re wondering, why did nkids go first? ANOVA can take the terms in any order you want. For example:

Model Sum of Squares Term added Additional Sum of Sq.
height ~ 1 4002376.8 intercept
height ~ 1+mother 4002845.1 mother 468.3
height ~ 1+mother + father 4003630.8 sex 785.7
height ~ 1+mother+father+sex 4009742.7 sex 6112.0
height ~ 1+mother+father+sex+nkids 4009754.7 nkids 12.0

Notice that with the new arrangement of terms, the sum of squares assigned to each term is different. In particular, the first report gave nkids a fairly large sum of squares: 185.5. The second report gave nkids only 12.0. This change plays out in the p-value on nkids given in the ANOVA report: it’s 0.107 now compared to 0.0000000004.

15.4 ANOVA, Collinearity, and Multi-Collinearity

In sports, it’s easy to see why the performance of one player on a team depends on the context set by the other players. Players specialize in offense and defense, they support one another by passing and blocking or by drawing away the opponent’s defense. A poor defensive player may hurt his own team by depriving it of opportunities to go on the offense. A good team player improves the performance of other players on the team by creating opportunities for them to perform well.

With model terms, the situation is analogous but obviously lacking the aspect of personality and intentionality that shapes team dynamics.

In ANOVA, each model term is assigned a sum of squares: think of this that the term is a player “carries the ball” forward. When there are collinear model terms, it is as if each player could have carried the ball over the same ground. This raises the question of who should get credit.

The approach taken in ANOVA in a model like A ~ B + C is to assign player B the credit for the amount the ball would have been carried without player C, so the credit from the simulated play A ~ B. Then C gets the rest of the credit for A ~ B + C.

It’s important to do things this way to avoid double counting. Marching forward with this sport metaphor, imagine that the accounting were done in another way, measuring the actual distance that each player carries the ball. It might be that only B carries the ball, but C performs an important block that lets B go further. In the carry-the-ball accounting, C would get no credit. But with ANOVA, B gets the credit for the distance that he would have gone without the help of C’s block, and C gets the credit for the extra distance that B was able to go because of the block.

Now imagine what will happen with A ~ C + B. This is exactly the same model as before, but now ANOVA has been directed to consider first the play A ~ C to assign credit to player C, then look at the result of A ~ C + B to figure out how much credit to give to B. The model A ~ C corresponds to the play in which C successfully makes his block, but since there is no ball carrier B, player C will get absolutely no credit. A block without a ball carrier is of no use whatsoever! Now add in player B – that is, model A ~ C + B. With both the ball carrier and the blocker, the ball will be moved down the field. But B will get all the credit now, because there was no credit to be given to C in the play A ~ C. This is why the credit assigned by ANOVA depends on the order of terms.

When model terms are orthogonal, it is as if the players had no influence on one another’s performance. For orthogonal model terms, the results of ANOVA are the same, regardless of the order in which they are considered. This simplifies interpretation, since there is no decision to be made about the proper order. But, if your model terms are not orthogonal, you need to be thoughtful in interpreting the ANOVA report, considering the implications of the different possible orderings of model terms.

Example: Height and Siblings

In fitting the model height ~ nkids + sex + father + mother to Galton’s data, the p-value on nkids depended substantially on where the term is placed in the ANOVA report. When nkids came first, the p-value was small, leading to rejection of the null hypothesis. But when nkids came last, the p-value was not small enough to justify rejection of the null.

As always, the order dependence of the term-by-term ANOVA results stems from collinearity. You can investigate this directly by modeling nkids as a function of the covariates. Here is the regression report on the model nkids ~ sex + father + mother:

  Estimate Std. Error t value p-value
Intercept 19.2374 3.3772 5.70 0.0001
sexM −0.3652 0.1770 −2.06 0.0394
father −0.1751 0.0359 −4.88 0.0001
mother −0.0123 0.0385 −0.32 0.7488

The model coefficients indicate that boys tend to have fewer siblings than girls. This may sound a bit strange, because obviously within any family every boy has exactly the same number of siblings as every girl. But the model applies across families, so this model is saying that families with boys tend to have fewer children than families with girls. This is a well known phenomenon. Depending on your point of view, you might like to interpret in either of two ways. Perhaps it’s that parents with boys get worn out and stop having children. An alternative is that, particularly in Galton’s time when women were definitely second-class citizens, parents who didn’t have boys continued to have more children in the hopes of producing them. Either way, the model indicates that nkids is collinear with sex. It also appears to be related to father, as you will see later in this example.

A nice way to measure the extent of collinearity is with either the model R² or, equivalently but more directly geometrically, with the angle ϑ between nkids and the fitted model vector. For the model, R² = 0.03, corresponding to ϑ = 80°. That’s practically 90°, hardly any alignment at all! How can that trivial amount of collinearity lead to the dramatic order-dependence of the ANOVA result on nkids?

To examine this, imagine splitting nkids into two vectors: a vector that is exactly aligned with the sex, father, mother subspace and the part that is exactly orthogonal to it. These two vectors correspond to the fitted model values and the residuals of the nkids model. Call them aligned and perp and add them to the height model in place of nkids. It’s important to remember that neither of these vectors individual points in exactly the same direction as nkids, but the sum of the vectors does, in other words

nkids = aligned + perp.

Here’s one ANOVA report, where aligned and perp have been placed first:

  Df Sum Sq Mean Sq F value p-value
aligned 1 3435.49 3435.49 741.55 0.0001
perp 1 12.04 12.04 2.60 0.1073
sex 1 3529.00 3529.00 761.74 0.0001
father 1 401.41 401.41 86.64 0.0001
Residuals 893 4137.12 4.63

There are several interesting aspects to this report. First, aligned shows up as very significant, with a huge sum of squares compared to the original nkids – the original was only 185.5. This is because aligned happens to point in a very favorable direction, closely related to the very important variable sex. Second, perp gets a very small sum of squares: only 12. Third, sex is much reduced compared to the original ANOVA. This is because aligned has already grabbed much of the sum of squares originally associated with sex. Finally, the mother term gets absolutely no sum of squares and, indeed, has zero degrees of freedom. This is because there is one degree of redundancy among the set aligned, sex, father, and mother. Since mother came last in the report, it had absolutely nothing to contribute.

Now consider an ANOVA report on the same terms, but in a different order:

  Df Sum Sq Mean Sq F value p-value
sex 1 5874.57 5874.57 1268.03 0.0001
father 1 1001.11 1001.11 216.09 0.0001
mother 1 490.22 490.22 105.81 0.0001
aligned 0 0 NA NA NA
perp 1 12.04 12.04 2.60 0.1073
Residuals 893 4137.12 4.63

Coming after father, mother and sex, the aligned vector is the one discarded as being redundant. Indeed, it was constructed to fit exactly into the subspace spanned by father, mother and sex. Coming before aligned in the ANOVA report, the term sex has regained its large sum of squares, indicating that sex is a strong predictor of height.

Now look at perp. Although it comes at the end of the list of terms, it has exactly the same set of values that it had when it came at the beginning of the list. This is because perp is exactly orthogonal to all the other terms in the report – it was constructed that way as the residual from fitting nkids to the other terms. Whenever a term is orthogonal to the others, the sum of squares for that term will not depend on where the term comes in order with respect to the others.

It’s worth pointing out that the sum of squares, F value, and p-value on perp exactly matches that from the original ANOVA report using nkids when nkids came last. This is not a coincidence. It stems from putting father, mother and sex before nkids in the report – those other variables effectively stripped nkids of any component that was aligned with them, leaving only the part of nkids perpendicular to them.

All this may sound like an abstract exercise in geometrical thinking, but it actually carries a strong implication for interpreting the relationship between nkids and height. Earlier, a possible causal relationship was suggested; perhaps in families with many kids, competition for food and other resources caused the kids to be stunted in growth. Or it might be that families that had lots of kids back then were healthier or rich in resources that caused the kids to grow well.

The results with aligned and perp suggest something different. If nkids really did cause height, then the component of nkids perpendicular to sex and the other covariates ought to have some explanatory power with regard to height. On the other hand, if it’s really sex and the other covariates that cause height, and nkids has nothing to do with it, then the perpendicular component of nkids ought to be no better than a random vector. The ANOVA report is consistent with this second scenario. The p-value of 0.107 isn’t sufficiently small to justify rejecting the null that perp plays no role.

Even a very small alignment with genuinely important variables – nkids is at an angle of ϑ = 80° with respect to the sex, father, mother subspace – can create a sum of squares that’s statistically significant. Keep in mind that “significance” in a hypothesis test is defined with respect not to how strong or substantial or authentic the variable’s role is, but how it compares in size to a completely random variable. With large n, even a very small alignment can be much larger than expected for a random vector. There are n=898 cases in Galton’s data, and in a space of dimension 898 it’s very unlikely that a random vector would show even an alignment as far from perpendicular as 80°. (The 95% confidence interval for the angle that a random vector makes with another vector is 86° to 94°.)

15.5 Choosing the Order of Terms in ANOVA

The order-dependence of ANOVA results creates profound problems in interpretation. There is, unfortunately, no generally applicable way to identify a single correct ordering. This is because the order-dependence in ANOVA reflects genuine ambiguities in how to interpret data. Eliminating these ambiguities requires adding new information to the analysis, such as the modeler’s assumptions or knowledge about the system under study. An illustration of this is given in the example in Section 15.3, where it was assumed (reasonably enough) that the number of children in a family might be the result of covariates such as the children’s sex or the height of the mother and father, but that the number of kids doesn’t itself cause those covariates.

The mathematical analysis of the order-dependence of term-by-term ANOVA as it stems from collinearity can illuminate the situation, but doesn’t resolve the ambiguity.

Here are some suggestions for dealing with the ambiguity of term-by-term ANOVA. It’s important to remember that despite the problems with ANOVA, it offers the important advantage of allowing covariates to soak up unexplained variance. Other methods such as whole-model ANOVA or the regression report (the topic of Section 12.2) may avoid ambiguity but this is only because they don’t offer the flexibility of ANOVA. Such methods also impose their own hidden assumptions.

Treat ANOVA as a screening method. In medicine, screening tests are ways to get a quick, inexpensive indication of whether a patient has an illness or condition. They tend to have relatively high error rates. Nonetheless, they can be genuinely useful for identifying situations where more invasive or expensive methods can be used to get a more definitive result.

ANOVA provides a quick, powerful, and effective way to assess the strength of evidence for a claim. You can use it to probe whether variables should be included in a model and what covariates might be included as well. The results shouldn’t be taken as proof that a hypothesis is correct. Indeed, you should always keep in mind that hypothesis testing never offers a proof of a hypothesis. It can only produce a rejection of the null or a failure to reject the null.

Arrange the covariates so that they are orthogonal to the explanatory variable. If two explanatory vectors are not at all collinear, that is, if they are exactly orthogonal to each other, then the order of the terms has no effect.

If you are collecting data in the context of an experiment, you may well be able to arrange things so that the explanatory variables and covariates are orthogonal to one another. Chapter 18 on experimental design covers some of the techniques to do this.

Chapter 12 explains that collinearity between explanatory vectors increases standard errors, that is, worsens the precision of your coefficient estimates. This inflation of standard errors is another reason to try to create orthogonality, if you have the means to do so.

Adopt a skeptical attitude. If you are inclined to be skeptical about a variable, look for the arrangement of terms that produces the smallest sum of squares. If, despite this, the variable still has a significant sum of squares, the ambiguity of order-dependence won’t influence your conclusions.

But in interpreting your results, keep in mind that there might be other covariates that weren’t included in your models that might reduce your variable to insignificance.

Adopt an enthusiastic attitude of hypothesis formation. Although ANOVA is formally a mechanism for hypothesis testing, it can also be used for hypothesis formation. Explore your data using ANOVA. If you find an arrangement of explanatory variables that produces a significant result, that’s a good reason to devote some time and energy to thinking more about that result and how to follow up on it.

But don’t be complacent. Once you have formed a hypothesis, you will have to test it skeptically at some point.

Don’t forget other ambiguities. The ambiguity of order dependence is readily seen in ANOVA reports. But there are other ambiguities and uncertainties that you won’t necessarily be aware of. Why did you choose the particular covariates in your analysis? Why did you choose to measure certain variables and not others? Did you include interaction terms (which often introduce collinearities)? Did you consider transformation terms? The example in the previous section treats the number of kids as a simple linear term. But there is reason to think that its effect might be nonlinear, that in small families height is associated with having more siblings, but in large families the reverse is true.

Even if there were no ambiguities of order dependence, there would still be ambiguities and uncertainties in your model results.

Think carefully about how your system works, and model the system rather than just the data. The source of order dependence is collinearity among the explanatory variables. Such collinearity is itself evidence that the explanatory variables are related in some way. A model like A ~ B + C + D is about the relationship between A and the explanatory variables; it doesn’t describe the relationships among the explanatory variables, even though these may be important to understanding how the system works. If you want to understand the system as a whole, you need to model the system as a whole. This involves a set of models. For example, suppose that in the real system, A is caused by both B and C, and D exerts its influence only because it affects B. For such a system, it’s appropriate to consider a set of models: A ~ B + C and B ~ D. Things can quickly get more complicated. For example, suppose C influences D. In that case, the model A ~ B + C would confuse the direct influence of C on A with the indirect influence that C has on A via D and therefore B.

The methods of structural equation modeling (SEM) are designed to deal with such complexity by creating structured sets of models. The structure itself is based in the modeler’s knowledge or assumptions about how the system works.

Example: Wages and Race: Part 2

The example in Section 15.1 looked at the Current Population Survey data to see if there is evidence that race is associated with wages. The simple model wage ~ race produced a p-value on race of 0.10, not strong evidence. However, including the covariates educ and sector made the role of race somewhat clearer as shown by the term-by-term ANOVA report:

  Df Sum Sq Mean Sq F value p-value
race 1 71.45 71.45 3.44 0.0641
educ 1 2017.67 2017.67 97.22 0.0001
sector 7 1112.92 158.99 7.66 0.0001
Residuals 524 10874.66 20.75

Since the sum of squares in term-by-term ANOVA can be order dependent, it’s worthwhile to look at a different ordering of terms. Here’s one with race last:

  Df Sum Sq Mean Sq F value p-value
educ 1 2053.29 2053.29 98.94 0.0001
sector 7 1136.56 162.37 7.82 0.0001
race 1 12.19 12.19 0.59 0.4439
Residuals 524 10874.66 20.75

This report suggests that the data provide no evidence that race plays any role at all.

Which conclusion is right? Do the data provide evidence for the importance of race or not? The answer is that it depends on your view of how the system is working.

Suppose, for instance, that race is really a causal influence in determining people’s level of education or the sector of the economy that they are permitted to work in. In such a situation, it’s appropriate to put race first in order to reflect the claim that race influences educ and sector and therefore race should get credit for any of the way it overlaps with educ and sector in accounting for wage.

Now consider an alternative perspective. It’s certainly clear that educ and sector can’t possibly be causing race. But it is possible that a person’s race is correlated with education and sector without having caused them. How? Both education and sector can be influenced by the opportunities and role models that a person had as a child and these in turn are set by the situation in which that person’s parents lived, including racial discrimination against the parents. According to this perspective, the person’s race doesn’t cause their level of education or the sector in which they work. Instead, all these things were causally influenced by other factors. In such a situation, it might be fair to assume in the null hypothesis that educ and sector are primary influences on wage independent of race. To test this null hypothesis, which treats the potential role of race with skepticism, race should come after educ and sector in the ANOVA breakdown.

There’s also a third possibility here. Suppose you are looking for evidence that employers discriminate and you suspect that one mechanism for such discrimination is to restrict which sectors people can work in. That’s not an unreasonable idea. But it’s harder to believe that the employers have influenced the person’s level of education. So, put educ before race and put sector after race:

  Df Sum Sq Mean Sq F value p-value
educ 1 2053.29 2053.29 98.94 0.0001
race 1 35.83 35.83 1.73 0.1894
sector 7 1112.92 158.99 7.66 0.0001
Residuals 524 10874.66 20.75

This set of assumptions leads to failure to reject the null on race: the skeptic’s attitude toward race is supported by the analysis.

The core of the matter is whether covariates such as sector or educ are mechanisms through which the effects of race plays out, or whether they provide alternative, non-racial explanations for the variation in wage. Depending on which one of these you believe, you can get very different results from ANOVA.

In this wages and race example, people with different views can each see that the data are consistent with those views. The data provide evidence for the person who thinks that race influences wages through the mechanism of educ and sector. Similarly, for the person who says that educ and sector are the determinants of wage, the data are consistent with the view that race doesn’t play a role.

ANOVA treats a covariate as either previous to or subsequent to the explanatory variable. In contrast, model fitting takes both the explanatory variable and the covariates simultaneously.

The coefficient t-test is a hypothesis test on a coefficient in a model. The null hypothesis is that, in the population, that coefficient is zero but all the other coefficients are exactly as given by the model fit. Under this null hypothesis, the sampling distribution of the coefficient of interest is centered on zero. The standard deviation of the sampling distribution is the standard error found in the model regression report.

To illustrate, consider the model height ~ nkids + sex + mother + father fitted to Galton’s height data:

  Estimate Std. Error t value p-value
Intercept 16.1877 2.7939 5.79 0.0001
nkids −0.0438 0.0272 −1.61 0.1073
sexM 5.2099 0.1442 36.12 0.0001
mother 0.3210 0.0313 10.27 0.0001
father 0.3983 0.0296 13.47 0.0001

According to this model, the coefficient on nkids is −0.0438 with a standard error of 0.0272. Under the null hypothesis for nkids, the sampling distribution of the coefficient has a mean of zero and a standard deviation of 0.0272. The p-value of the observed coefficient, 0.107, tells how likely it is for that value or a more extreme value to be observed as a random draw from the sampling distribution. In this model, the p-value for nkids is not so small to justify rejecting the null hypothesis.

A nice feature of the regression report is that it is conducting a hypothesis test on each of the coefficients at the same time. This makes it easy to scan for coefficients that are statistically non-zero. On the other hand, this also introduces the need to adjust the rejection threshold to take into account the multiple hypothesis tests that are being conducted simultaneously. For example, in the above report, a Bonferroni adjustment would change the rejection threshold from the conventional 0.05 to 0.0125, since there are four coefficients being tested. (The hypothesis test on the intercept would not be of interest, since it doesn’t relate to any of the explanatory variables.)

A not-so-nice feature of the regression report is that collinearity tends to produce high p-values on the coefficients of all of the collinear vectors; there’s no hint from the report itself in the p-values that any of the collinear vectors might have a low p-value if the others were dropped from the model. To find this out, you need to try various orderings of the terms in the ANOVA report.

As always, the easiest situation for interpretation is when there is no collinearity at all between the explanatory variable and the covariates. In such a case, the regression report p-value is exactly the same as the p-value from ANOVA. However, when there is a correlation between the explanatory variable and the covariates, the regression report suffers from the same sorts of ambiguities as ANOVA, it’s just that these are hidden because a simple re-arrangement of terms doesn’t change the regression report.

Example: Wages and Race: Part 3

Earlier in this chapter, the Current Population Survey data was explored using term-by-term ANOVA to investigate the potential role of race. The conclusions reached differed depending on the assumptions made by the modeler as reflected by the order of terms in the ANOVA~report.

Here is yet another analysis, using the regression report:

  Estimate Std. Error t value p-value
Intercept 0.2083 1.4298 0.15 0.8842
raceW 0.4614 0.6021 0.77 0.4439
educ 0.5274 0.0965 5.47 0.0001
sectorconst 3.0204 1.1320 2.67 0.0079
sectormanag 4.3935 0.7857 5.59 0.0001
sectormanuf 1.5382 0.7400 2.08 0.0381
sectorother 1.6285 0.7301 2.23 0.0261
sectorprof 3.0602 0.6947 4.41 0.0001
sectorsales −0.0085 0.8734 −0.01 0.9922
sectorservice −0.1574 0.6940 −0.23 0.8206

That raceW comes first in this report doesn’t signify anything. In a regression report, the values don’t depend at all on the order of the terms.

The valuable thing about the regression report is that it gives not just the statistical significance of the term, but a measure of the actual strength of the term. This report indicates that whites earned $0.46 ± 1.20 more than Blacks, after adjusting for the sector of the economy and the educational level of the worker. The p-value reflects the wide confidence interval: at 0.44 there’s no basis for rejecting the null hypothesis that the population coefficient on raceW is zero.

Because the regression report is independent of the order of terms, it’s tempting to think that it’s somehow more objective than term-by-term ANOVA. That’s wrong. The causal model implicit in the regression report has it’s own point of view, that each of the terms in the model contributes to the response variable directly, not through the mediation of another variable. Thus, for the person who thinks that sector is merely a mechanism for discrimination by race, the regression report is not an appropriate form of analysis. Similarly, the regression report isn’t appropriate for the person who thinks that it’s really educ and sector that determine wages and that race only appears to come into play because it happens to be correlated with those other variables. The regression report is appropriate only for the person for whom causation isn’t an issue. This might be because they want to construct a prediction model or because they believe that none of the explanatory variables cause others.

Example: Testing Universities

University students are used to being given tests and graded as individuals. But who grades the universities to find out how good a job the institutions are doing?

This need not be hard. Most universities require an entrance test – the SAT is widely used in the US for undergraduate admissions, the GRE, LSAT, MCAT, etc. for graduate school. Why not give students the entrance test a second time when they leave and compare the two results? If students at Alpha University show a large increase compared to students at Beta University, then there is evidence that Alpha does a better job than Beta.

There are good and bad reasons why such testing is not widely done. University faculty as a group are notoriously independent and don’t like being evaluated. They are suspicious of standardized tests for evaluation of their teaching (although the don’t often oppose using them for admissions purposes). They point out that there are many factors outside of their control that influence the results, such as the students’ preparedness, motivation, and work ethic. (On the other hand, universities are happy to take credit for the students who succeed).

One legitimate obstacle to meaningful testing of universities is that students are not required to take an exit test and not motivated to perform well on it. A student who is applying to university has to take the entrance exam and has good reason to try to do well on it – favorable admissions decisions and scholarship awards often follow good test results. But the student who is soon to graduate has no such motivation, either to take the exit exam or to do well on it. So what does the difference between the entrance and exit scores show: How much the students learned or that the graduating students have something better to do with their time than take an exit exam?

The Collegiate Learning Assessment (CLA) is a standardized test consisting of several essay questions that probe a student’s critical thinking skills.(Stephen Klein 2007) (See www.cae.org.) As of 2008, more than 200 universities and colleges have been testing the exam. The CLA is given to one group of first-year students and another group of final-year students. The scores of the two groups are then compared.

Ideally, the same students would be tested in both their first and last years. If this were done, each student could be compared to himself or herself and covariates such as previous preparation or overall aptitude would be held constant within each entry-to-exit comparison. Unfortunately, such longitudinal designs were used at only 15% of the CLA schools. At the remaining schools, for reasons of cost and administrative complexity, a cross-sectional design was used, where the two groups consist of different students. When the study is cross-sectional, it can be important to adjust for covariates that may differ from one group to another.

In the case of the CLA, the students’ performance on other standardized tests (such as the entrance SAT) is used for the adjustment. As it happens, there is a strong correlation between the SAT score and other explanatory variables. This collinearity tends to increase the standard error of the estimate of school-to-school differences.

Two studies of the CLA adjustment methodology concluded that they so increased standard errors that no clear inference can be made of school-to-school differences. (Basken 2008) The studies called for longitudinal testing to be done in order to increase the reliability of the results. But, short of requiring students to take both the entrance and exit exams, it’s not clear how the longitudinal approach can be implemented.

15.6 Non-Parametric Statistics

It sometimes happens that one or more of your variables have outliers or a highly skew distribution. In this situation, the extreme values of the variable can have an unduly strong influence on the fit of the model. As a result, the p-values calculated from ANOVA can be misleading.

Common sense suggests deleting the outliers or extreme values from the data used to fit the model. Certainly this is an appropriate step if you have good reason to think that the extreme values are bogus.

A more moderate approach is to fit two models: one with and one without the extreme values. If the two models give similar results, then you can reasonably conclude that the extreme values are not having an undue effect.

But if the two models give very different results, you have a quandry. Which model should you use? On the one hand, you don’t want some cases to have an “undue” influence. On the other hand, those cases might be telling you something about the system and you need to be able to guard against the criticism that you have altered your data.

There is a simple and effective way to deal with this situation for the purpose of deciding whether the extreme values are dominating p-value calculations. The approach is to transform your data in a way that is genuine to the extreme values by keeping them on the edge of the distribution, but which brings them closer to the mainstream. The easiest such transform is called the rank transform. It replaces each value in a variable with the position of the value in the set of values. For example, suppose you have a variable with these values:

23.4, 17.3, 26.8, 32.8, 31.3, 34.5, 7352.3

It’s obvious that the seventh value is quite different from the other six. Applying the rank transform, this variable becomes

2, 1, 3, 5, 4, 6, 7.

That is, 23.4 is the 2nd smallest value, 17.3 is the 1st smallest value, 26.8 is the 3rd smallest, and so on. With the rank transform, the extreme value of 7352.3 becomes nothing special: it has rank 7 because it is the seventh smallest value. There are never outliers in a rank transformed variable. Nevertheless, the rank transformed data are honest to the original data in the sense that each value retains its position relative to the other values.

Analysis of rank-transformed data is called non-parametric statistics. The coefficients from models of rank-transformed data are hard to interpret, because they describe how the response value changes do to a 1-step increase in rank rather than a one-unit increase the value of the variable. Still, the p-values can be interpreted in a standard way.

When outliers or extreme values are a concern, it’s worth comparing the p-values from the original (parametric) model and the non-parametric model fitted to the rank-transformed data. If the p-values lead to the same conclusion, you know that the extreme values are not shaping the results. If the p-values for the two models lead to different conclusions then your work is just beginning and you need to think carefully about what is going on.

If the p-value on your explanatory variable of interest is small enough, you’re entitled to reject the null. If it’s not so small, you are left in the ambiguous situation of “failing to reject.” This failure might be because the explanatory variable is indeed irrelevant to the response variable. That’s a valuable finding. On the other hand, there can be a much more mundane explanation for the failure: your sample size was not large enough.

In ANOVA, the p-value gets smaller as F gets bigger. In turn, F is a product of both the substance of the effect and the sample size n: larger n produces larger F as described in Equation .

The fundamental issue, however, is not F, but the power of your study. In picking a sample size n for your study, your goal should be to produce a study with a large enough power so that, if you do fail to reject the null, you will have good reason to prefer the null hypothesis to the alternative hypothesis as a reasonable representation of your system.

Typically, you start a study with some working hypothesis about how the explanatory variable is related to the response. This working hypothesis is given the name alternative hypothesis. (The word “alternative” is simply the statistical vocabulary for distinguishing the working hypothesis from the “null” hypothesis.) The goal of this section is to show how to translate your alternative hypothesis into a decision about the appropriate sample size n to produce good power.

It may seem odd to be thinking about sample size here, at the end of a chapter on analyzing data. After all, you have to have your data already in hand before you can analyze it! But modeling is a process of asking questions. That process often involves going back and collecting more or different data based on the results of the analysis of earlier data.

When deciding on a sample size, think ahead to the analysis you will eventually do on the data you collect. It’s likely that your analysis will be analysis of covariance or something like it. So you will end up with an ANOVA table, looking perhaps like this:

Term Df Sum Sq Mean Sq F value p value
Intercept 1 500
Covariates 3 200
Explanatory Var 1 50 50 2 0.19
Residuals 10 250 25
TOTAL 15 1000

From your knowledge of the covariates and the explanatory variable, you should know the degrees of freedom of each model term even before you collect the data. But until you actually have the data, you won’t be able to fill in the actual values for the sums of squares. Instead, make a guess – hopefully an informed guess based on your working hypothesis. The guesses here have been indicated by italics, as in the guessed 200 sum of squares for the covariates.

What information might you have to work from? Perhaps you have already done a preliminary study and collected just a bit of data – the n=15 cases here. Then you have the ANOVA report from those data. Or perhaps there is an existing study in the literature from a study of the same explanatory variable or an analogous one in a setting similar to the one you are working on. You’ll need to be creative, but realistic.

For the present, assume that the above ANOVA table came from a preliminary study with n=15 cases. The F value of 2 is suggestive, but the p-value is large; you cannot reject the null based on this data set. But perhaps a larger study would be able to reveal what’s merely suggested in the preliminary study.

Imagine what the table would look like for larger n. For simplicity here, consider what would happen if you doubled the number of cases, to n=30. (In reality, you might need to consider some number much larger than the n=30 used in this example.) Of course, you can’t know exactly what the larger n will give until you actually collect the data. But, for planning, it’s sensible to assume that your extended study might look something like the preliminary study. If that’s so, doubling the number of cases will typically double the total sum of squares. (Why? Because you’re adding up twice as many cases. For instance, the sum of squares of 3 4 5 is 50; the sum of squares of 3 4 5 3 4 5 is 100.)

Increasing the number of cases doesn’t change the degrees of freedom of the explanatory variables. The intercept, the covariates, and the explanatory variable terms still have the same degrees of freedom as before. As a result, for these terms the mean square will double.

In contrast, the degrees of freedom of the residual will increase with the number of cases in the sample. The sum of squares of the residual will also increase, but the net effect is that the mean square of the residual will stay roughly the same.

To illustrate, the following table shows the ANOVA tables side by side for a preliminary study with n=15 and an expanded study with n=30.

Term D.F. Sum Sq. Mean Sq. F   D.F. Sum Sq. Mean Sq. F
Intercept 1 500 500 1   1000 1000
Covariates 3 200 67 3   400 133
Explan. Var. 1 50 50 2   1 100 100 5
Residuals 10 250 25   25 500 20
TOTAL 15 1000   30 2000
Preliminary study n=15 New study n=30

Overall, by doubling n you can anticipate that the F value will be more than doubled. If n were increased even further, the anticipated F value would similarly be increased further, as in Equation 14.2.

The goal in setting the sample size n is to produce statistical significance – a small p-value. As a rule of thumb for the F distribution, F above 4 or 5 will give a small p-value when the sample size n is around 25 or bigger. (More precisely, when the degrees of freedom of the denominator is around 25 or bigger.)

But targeting such an F value would be a mistake. The reason is that F itself is subject to sampling fluctuations. If your sample size is such to produce a p-value of 0.05 when you hit the target F, then there’s a roughly 50% probability that sampling fluctuations will produce an F that’s too small and consequently a p-value that’s too large.

To avoid suffering from a mishap, it’s appropriate to choose the sample size n large enough that sampling fluctuations are unlikely to produce an unsatisfactory p-value. Recall that the power is the probability that your study will lead to rejection of the null hypothesis if the alternative hypothesis were true. Your goal in choosing n is to produce a large power.

The calculation behind power is somewhat complicated, but the table gives an idea of the power for studies with different sample sizes n and different anticipated F.

Power for various sample sizes given the alternative hypothesis value for F.
Anticipated F n=10 n=20 n=100 n → ∞
3 17% 23% 28% 29%
5 33% 42% 49% 51%
10 64% 76% 83% 85%
20 92% 97% 99% 99%

This table was constructed for an explanatory variable with one degree of freedom. This could be a quantitative variable (for example, family income) or a categorical variable with two levels (for example, sex). As the table indicates the anticipated F must be quite high – 10 or so – to achieve a power of 90% or higher. When the degrees of freedom in the numerator is higher, so long as n is large, high power can be achieved with considerably smaller F. For example, with five degrees of freedom in the denominator, F=5 gives roughly 90% power.

If your study has low power, then failing to reject the null doesn’t give much reason to take the null as reasonable: your study is likely to be inconclusive. When you aim for an F of about 5, your power will be only about 50%. If you want a high power, say 90%, your study should be designed to target a high F.

Studies are often expensive and difficult to conduct and so you want to be careful in designing them. The table above gives only a rough idea about power. Since the power calculations are complicated, you may want to consult with a professional to do the calculation that will be appropriate for your specific study design.