# Chapter 1 Introduction

All models are wrong. Some models are useful. – George Box

Art is a lie that tells the truth. – Pablo Picasso

This book is about statistical modeling, two words that themselves require some definition.

“Modeling” is a process of asking questions. “Statistical” refers in part to data – the statistical models you will construct will be rooted in data. But it refers also to a distinctively modern idea: that you can measure what you don’t know and that doing so contributes to your understanding.

There is a saying, “A person with a watch knows the time. A person with two watches is never sure.” The statistical point of view is that it’s better not to be sure. With two watches you can see how they disagree with each other. This provides an idea of how precise the watches are. You don’t know the time exactly, but knowing the precision tells you something about what you don’t know. The non-statistical certainty of the person with a single watch is merely an uninformed self-confidence: the single watch provides no indication of what the person doesn’t know.

The physicist Ernest Rutherford (1871-1937) famously said, “If your experiment needs statistics, you ought to have done a better experiment.” In other words, if you can make a good enough watch, you need only one: no statistics. This is bad advice. Statistics never hurts. A person with two watches that agree perfectly not only knows the time, but has evidence that the watches are working at high precision. Sensibly, the official world time is based on an average of many atomic clocks. The individual clocks are fantastically precise; the point of averaging is to know when one or more of the clocks is drifting out of precision.

Why “statistical modeling” and not simply “statistics” or “data analysis?” Many people imagine that data speak for themselves and that the purpose of statistics is to extract the information that the data carry. Such people see data analysis as an objective process in which the researcher should, ideally, have no influence. This can be true when very simple issues are involved; for instance, how precise is the average of the atomic clocks used to set official time or what is the difference in time between two events? But many questions are much more complicated; they involve many variables and you don’t necessarily know what is doing what to what.(Box, Hunter, and Hunter 2005)

The conclusions you reach from data depend on the specific questions you ask. Like it or not, the researcher plays an active and creative role in constructing and interrogating data. This means that the process involves some subjectivity. But this is not the same as saying anything goes. Statistical methods allow you to make objective statements about how the data answer your questions. In particular, the methods help you to know if the data show anything at all.

The word “modeling” highlights that your goals, your beliefs, and your current state of knowledge all influence your analysis of data. The core of the scientific method is the formation of hypotheses that can be tested and perhaps refuted by experiment or observation. Similarly, in statistical modeling, you examine your data to see whether they are consistent with the hypotheses that frame your understanding of the system under study.

## Example: Applying to Law School

A student is applying to law school. The schools she applies to ask for her class rank, which is based on the average of her college course grades.

A simple statistical issue concerns the precision of the grade-point average. This isn’t a question of whether the average was correctly computed or whether the grades were accurately recorded. Instead, imagine that you could send two essentially identical students to essentially identical schools. Their grade-point averages might well differ, reflecting perhaps the grading practices of their different instructors or slightly different choices of subjects or random events such as illness or mishaps or the scheduling of classes. One way to think about this is that the students’ grades are to some extent random, contingent on factors that are unknown or perhaps irrelevant to the students’ capabilities.

How do you measure the extent to which the grades are random? There is no practical way to create “identical” students and observe how their grades differ. But you can look at the variation in a single student’s grades – from class to class – and use this as an indication of the size of the random influence in each grade. From this, you can calculate the likely range of the random influences on the overall grade-point average.

Statistical models let you go further in interpreting grades. It’s a common belief that there are easy- and hard-grading teachers and that a grade reflects not just the student’s work but the teacher’s attitude and practices. Statistical modeling provides a way to use data on grades to see whether teachers grade differently and to correct for these differences between teachers. Doing this involves some subtlety, for example taking into account the possibility that strong students take different courses than weaker students.

## Example: Nitrogen Fixing

All plants need nitrogen to grow. Since nitrogen is the primary component of air, there is plenty around. But it’s hard for plants to get nitrogen from the air; they get it instead from the soil. Some plants, like alder and soybean, support nitrogen-fixing bacteria in nodules on the plant roots. The plant creates a hospitable environment for the bacteria; the bacteria, by fixing nitrogen in the soil, create a good environment for the plant. In a word, symbiosis.

Biologist Michael Anderson is interested in how genetic variation in the bacteria influences the success with which they fix nitrogen. One can imagine using this information to breed plants and bacteria that are more effective at fixing nitrogen and thereby reducing the need for agricultural fertilizer.

Anderson has an promising early result. His extensive field studies indicate that different genotypes of bacteria fix nitrogen at different rates. Unfortunately, the situation is confusing since the different genotypes tend to populate different areas with different amounts of soil moisture, different soil temperatures, and so on. How can he untangle the relative influences of the genotype and the other environmental factors in order to decide whether the variation in genotype is genuinely important and worth further study?

## Example: Sex Discrimination

A large trucking firm is being audited by the government to see if the firm pays wages in a discriminatory way. The audit finds wage discrepancies between men and women for “office and clerical workers” but not for other job classifications such as technicians, supervisors, sales personnel, or “skilled craftworkers.” It finds no discrepancies based on race.

A simple statistical question is whether the observed difference in average wages for men and women office and clerical workers is based on enough data to be reliable. In answering this question, it actually makes a difference what other groups the government auditors looked at when deciding to focus on sex discrimination in office and clerical workers.

Further complicating matters are the other factors that contribute to people’s wages: the kind of job they have, their skill level, their experience. Statistical models can be used to quantify how these contributions and how they connect to one another. For instance, it turns out that men on average tend to have more job experience than women, and some or all of the men’s higher average wages might be due to this.

Models can help you decide whether this potential explanation is plausible. For instance, if you see that both men’s and women’s wages increase with experience in the same way, you might be more inclined to believe that job experience is a legitimate factor rather than just a mask for discrimination.

## 1.1 Models and their Purposes

Many of the toys you played with as a child are models: dolls, balsa-wood airplanes with wind-up propellers, wooden blocks, model trains. But so are many serious objects of the adult world: architectural plans, bank statements, train schedules, the results of medical diagnostic tests, the signals transmitted by a telephone, the equations of physics, the genetic sequences used by biologists. There are too many to list.

What all models have in common is this:

A model is a representation for a particular purpose.

A model might be a physical object or it might be an idea, but it always stands for something else: it’s a representation. Dolls stand for babies and animals, architectural plans stand for buildings and bridges, a white blood-cell count stands for the function of the immune system.

When you create a model, you have (or ought to have) a purpose in mind. Toys are created for the entertainment and (sometimes) edification of children. The various kinds of toys – dolls, blocks, model airplanes and trains – have a form that serves this purpose. Unlike the things they represent, the toy versions are small, safe, and inexpensive.

Models always leave things out and get some things – many things – wrong. Architectural plans are not houses; you can’t live in them. But they are easy to transport, copy, and modify. That’s the point. Telephone signals – unlike the physical sound waves that they represent – can be transported over long distances and even stored. A train schedule tells you something important but it obviously doesn’t reproduce every aspect of the trains it describes; it doesn’t carry passengers.

Statistical models revolve around data. But even so, they are first and foremost models. They are created for a purpose. The intended use of a model should shape the appropriate form of the model and determines the sorts of data that can properly be used to build the model.

There are three main uses for statistical models. They are closely related, but distinct enough to be worth enumerating.

1. Description. Sometimes you want to describe the range or typical values of a quantity. For example, what’s a “normal” white blood cell count? Sometimes you want to describe the relationship between things. Example: What’s the relationship between the price of gasoline and consumption by automobiles?
2. Classification or prediction. You often have information about some observable traits, qualities, or attributes of a system you observe and want to draw conclusions about other things that you can’t directly observe. For instance, you know a patient’s white blood-cell count and other laboratory measurements and want to diagnose the patient’s illness.
3. Anticipating the consequences of interventions. Here, you intend to do something: you are not merely an observer but an active participant in the system. For example, people involved in setting or debating public policy have to deal with questions like these: To what extent will increasing the tax on gasoline reduce consumption? To what extent will paying teachers more increase student performance?

The appropriate form of a model depends on the purpose. For example, a model that diagnoses a patient as ill based on an observation of a high number of white blood cells can be sensible and useful. But that same model could give absurd predictions about intervention: Do you really think that lowering the white blood cell count by bleeding a patient will make the patient better?

To anticipate correctly the effects of an intervention you need to get the direction of cause and effect correct in your models. But for a model used for classification or prediction, it may be unnecessary to represent causation correctly. Instead, other issues, e.g., the reliability of data, can be the most important. One of the thorniest issues in statistical modeling – with tremendous consequences for science, medicine, government, and commerce – is how you can legitimately draw conclusions about interventions from models based on data collected without performing these interventions.

## 1.2 Observation and Knowledge

How do you know what you know? How did you find it out? How can you find out what you don’t yet know? These are questions that philosophers have addressed for thousands of years. The views that they have expressed are complicated and contradictory.

From the earliest times in philosophy, there has been a difficult relationship between knowledge and observation. Sometimes philosophers see your knowledge as emerging from your observations of the world, sometimes they emphasize that the way you see the world is rooted in your innate knowledge: the things that are obvious to you.

This tension plays out on the pages of newspapers as they report the controversies of the day. Does the death penalty deter crime? Does increased screening for cancer reduce mortality?

Consider the simple, obvious argument for why severe punishment deters crime. Punishments are things that people don’t like. People avoid what they don’t like. If crime leads to punishment, then people will avoid committing crime.

Each statement in this argument seems perfectly reasonable, but none of them is particularly rooted in observations of actual and potential criminals. It’s artificial – a learned skill – to base knowledge such as “people avoid punishment” on observation. It might be that this knowledge was formed by our own experiences, but usually the only explanation you can give is something like, “that’s been my experience” or give one or two anecdotes.

When observations contradict opinions – opinions are what you think you know – people often stick with their opinions. Put yourself in the place of someone who believes that the death penalty really does deter crime. You are presented with accurate data showing that when a neighboring state eliminated the death penalty, crime did not increase. So do you change your views on the matter? A skeptic can argue that it’s not just punishment but also other factors that influence the crime rate, for instance the availability of jobs. Perhaps a generally improving economic condition in the other state kept the crime rate steady even at a time when society is imposing lighter punishments.

It’s difficult to use observation to inform knowledge because relationships are complicated and involve multiple factors. It isn’t at all obvious how people can discover or demonstrate causal relationships through observation. Suppose one school district pays teachers well and another pays them poorly. You observe that the first district has better student outcomes than the second. Can you legitimately conclude that teacher pay accounts for the difference? Perhaps something else is at work: greater overall family wealth in the first district (which is what enabled them to pay teachers more), better facilities, smaller classes, and so on.

Historian Robert Hughes concisely summarized the difficulty of trying to use observation to discover causal relationships. In describing the extensive use of hanging in 18th and 19th century England, he wrote, “One cannot say whether public hanging did terrify people away from crime. Nor can anyone do so, until we can count crimes that were never committed.” (Hughes 1988) To know whether hanging did deter crime, you would need to observe a counterfactual, something that didn’t actually happen: the crimes in a world without hanging. You can’t observe counterfactuals. So you need somehow to generate observations that give you data on what happens for different levels of the causal variable.

A modern idea is the controlled experiment . In its simplest ideal form, a controlled experiment involves changing one thing – teacher pay, for example – while holding everything else constant: family wealth, facilities, etc.

The experimental approach to gaining knowledge has great success in medicine and science. For many people, experiment is the essence of all science. But experiments are hard to perform and sometimes not possible at all. How do you hold everything else constant? Partly for this reason, you rarely see reports of experiments when you read the newspaper, unless the article happens to be about a scientific discovery.

Scientists pride themselves on recording their observations carefully and systematically. Laboratories are filled with high-precision instrumentation. The quest for precision culminates perhaps in the physicist’s fundamental quantities. For instance, the mass of the electron is reported as 9.10938215 ± 0.00000045×10⁻³¹kg. The precision is about 1 part in twenty million.

Contrast this extreme precision with the humble speed measurements from a policeman’s radar gun (perhaps a couple of miles or kilometers per hour – one part in 50) or the weight indicated on a bathroom scale (give or take a kilogram or a couple of pounds – about one part in 100 for an adult).

All such observations and measures are the stuff of data, the records of observations. Observations do not become data by virtue of high precision or expensive instrumentation or the use of metric rather than traditional units. For many purposes, data of low precision is used. An ecologist’s count of the number of mating pairs of birds in a territory is limited by the ability to find nests. A national census of a country’s population, conducted by the government can be precise to only a couple of percent. The physicists counting neutrinos in huge observatories buried under mountains to shield them from extraneous events waits for months for their results and in the end the results are precise to only one part in two.

The precision that is needed in data depends on the purpose for which the data will be used. The important question for the person using the data is whether the precision, whatever it be, is adequate for the purpose at hand. To answer this question, you need to know how to measure precision and how to compare this to a standard reflecting the needs of your task. The scientist with expensive instrumentation and the framer of social policy both need to deal with data in similar ways to understand and interpret the precision of their results.

It’s common for people to believe that conclusions drawn from data apply to certain areas – science, economics, medicine – but aren’t terribly useful in other areas. In teaching, for example, almost all decisions are based on “experience” rather than observation. Indeed, there is often strong resistance to making formal observations of student progress as interfering with the teaching process.

This book is based on the idea that techniques for drawing valid conclusions from observations – data – are valuable for two groups of people. The first group is scientists and others who routinely need to use statistical methods to analyze experimental and other data.

The second group is everybody else. All of us need to draw conclusions from our experiences, even if we’re not in a laboratory. It’s better to learn how to do this in valid ways, and to understand the limitations of these ways, than to rely on an informal, unstated process of opinion formation. It may turn out that in any particular area of interest there are no useful data. In such situations, you won’t be able to use the techniques. But at least you will know what you’re missing. You may be inspired to figure out how to supply it or to recognize it when it does come along, and you’ll be aware of when others are misusing data.

As you will see, the manner in which the data are collected plays a central role in what sorts of conclusions can be legitimately made; data do not always speak for themselves. You will also see that strongly supported statements about causation are difficult to make. Often, all you can do is point to an “association” or a “correlation,” a weaker form of statement.

Statistics is sometimes loosely described as the “science of data.” This description is apt, particularly when it covers both the collection and analysis of data, but it does not mean much until you understand what data are. That’s the subject of the next chapter.

## 1.3 The Main Points of this Book

1. Statistics is about variation. Describing and interpreting variation is a major goal of statistics.
2. You can create empirical, mathematical descriptions not only of a single trait or variable but also of the relationships between two or more traits. Empirical means based on measurements, data, observations.
3. Models let you split variation into components: “explained” versus “unexplained.” How to measure the size of these components and how to compare them to one another is a central aspect of statistical methodology. Indeed, this provides a definition of statistics:

Statistics is the explanation of variation in the context of what remains unexplained.

4. By collecting data in ways that require care but are quite feasible, you can estimate how reliable your descriptions are, e.g., whether it’s plausible that you should see similar relationships if you collected new data. This notion of reliability is very narrow and there are some issues that depend critically on the context in which the data were collected and the correctness of assumptions that you make about how the world works.
5. Relationships between pairs of traits can be studied in isolation only in special circumstances. In general, to get valid results it is necessary to study entire systems of traits simultaneously. Failure to do so can easily lead to conclusions that are grossly misleading.
6. Descriptions of relationships are often subjective – they depend on choices that you, the modeler, make. These choices are generally rooted in your own beliefs about how the world works, or the theories accepted as plausible within some community of inquiry.
7. If data are collected properly, you can get an indication of whether the data are consistent or inconsistent with your subjective beliefs or – and this is important – whether you don’t have enough data to tell either way.
8. Models can be used to check out the sensitivity of your conclusions to different beliefs. People who disagree in their views of how the world works often may not be able to reconcile their differences based on data, but they will be able to decide objectively whether their own or the other party’s beliefs are reasonable given the data.
9. Notwithstanding everything said above about the strong link between your prior, subjective beliefs and the conclusions you draw from data, by collecting data in a certain context – experiments – you can dramatically simplify the interpretation of the results. It’s actually possible to remove the dependence on identified subjective beliefs by intervening in the system under study experimentally.

This book takes a different approach than most statistics texts. Many people want statistics to be presented as a kind of automatic, algorithmic way to process data. People look for mathematical certainty in their conclusions. After all, there are right-or-wrong answers to the mathematical calculations that people (or computers) perform in statistics. Why shouldn’t there be right-or-wrong answers to the conclusions that people draw about the world?

The answer is that there can be, but only when you are dealing with narrow circumstances that may not apply to the situations you want to study. An insistence on certainty and provable correctness often results in irrelevancy.

The point of view taken in this book is that it is better to be useful than to be provably certain. The objective is to introduce methods and ideas that can help you deal with drawing conclusions about the real world from data. The methods and ideas are meant to guide your reasoning; even if the conclusions you draw are not guaranteed by proof to be correct, they can still be more useful than the alternative, which is the conclusions that you draw without data, or the conclusions you draw from simplistic methods that don’t honor the complexity of the real system.