Chapter 21 Samples and sampling

Data scientists often work with “big data” in the form of comprehensive collections of records. For instance, Amazon has a record of every purchase made. Cable television companies have records, for each customer, of every key press on the remote control. The term data warehouse evokes images of large-scale, long-term storage.

Statisticians tend to think of data using a different metaphor: a population rather than a warehouse. For statisticians, the objects or events for which data can be recorded are like the inhabitants of a city or a forest. You’re hardly likely to see a population of forest creatures lined up for systematic access. And unlike objects stored in warehouses, population can grow or diminish by their own mechanisms without the supervision of a central authority.

For a statistician, data are a sample from a population. The sample is very much the result of chance encounteres between the recorder of data and the individual members of the population going about their own business. To a newcomer, this may seem haphazard, but there are advantages to thinking this way. The sample-of-a-population perspective reminds us that statistical conclusions are often intended to be applied to a larger body than the sample itself. And the perspective raises the question of whether the process of collecting data has produced a sample that is representative of the wider population.

Such considerations are not rendered irrelevant by having access to a comprehensive data warehouse. It’s misplaced confidence to believe that big data alone is a guarantee that the conclusions drawn will be germane to new or future events that are not yet in the data warehouse. To a statistician, the warehouse is merely a large sample. Some events, such as the purchase of a book from Amazon, have been included in the sample. Other events, perhaps shaped by a momentary distraction that caused a customer not to complete the purchase, are not in the warehouse even though they may be informative.

Understanding the processes that determine which events end up in the sample and which do not can provide valuable insight in interpreting the system being studied. This chapter is about some of the sampling processes that can render interpretation straightforward and other processes that can produce misleading results.

21.1 Simple random sampling

An conceptual process for drawing a sample that is statistically representative of a population is misleadingly called “simple” random sampling. Draw a series of cards from a well-shuffled deck, repeatedly flipping a coin or rolling a die are examples in which a random sample can be said to be simple. A deck of cards can be thought of as a “population”. In drawing cards, you have equal access to every member of the population. In a coin flip, you have similarly equal access to each possible outcome – heads or tails – and each new flip is utterly unconnected in outcome to previous or future flips.

Of course, we’re not usually interested in studying cards or coin flips. A more genuine setting might be, for instance, the population of people with diabetes, which you wish to sample to probe some hypothesis about that disease. If the people with diabetes were all seated in an vast array of numbered chairs, you could generate a simple random sample by having a computer random number generate produce a sequence of seat numbers. But that is not reality. Instead, many people with diabetes will be hard to reach, living in isolation or remote places, perhaps not even knowing that they have diabetes.

In practice, taking a simple random sample requires organization and planning. To make the sample genuinely random, you need to have access in some way to the entire population so that you can pick any member with equal probability. For instance, if you want a sample of students at a particular university, you can get a list of all the students from the university registrar and use a computer to pick randomly from the list. Such a list of the entire set of possible cases is called a sampling frame.

In a sense, the sampling frame is the definition of the population for the purpose of drawing conclusions from the sample. For instance, the diabetes researcher might be able to locate the records of all people diagnosed with diabetes in the last year at a set of medical clinics. A random sample from that sampling frame can reasonably be assumed to represent that particular population, but not necessarily the population of all cancer patients.

You should always be careful to define your sampling frame precisely. If you decide to sample university students by picking randomly from those who enter the front door of the library, you will get a sample that might not be typical for all university students. There’s nothing wrong with using the library students for your sample, but you need to be aware that your sample will be representative of just the library students, not necessarily all students.

When sampling at random, use formal random processes. For example, if you are sampling students who walk into the library, you can flip a coin to decide whether to include that student in your sample. When your sampling frame is in the form of a list, it’s wise to use a computer random number generator to select the cases to include in the sample.

A convenience sample is one where the sampling frame is defined mainly in a way that makes it easy for the researcher. For example, during lectures I often sample from the set of students in my class. These students – the ones who take statistics courses from me – are not necessarily representative of all university students. It might be fine to take a convenience sample in a quick, informal, preliminary study. But don’t make the mistake of assuming that the convenience sample is representative of the population. Even if you believe it yourself, how will you convince the people who are skeptical about your results?

When cases are selected in an informal way, the cases may not be representative of the broader population. This is called a sampling bias. For example, in deciding which students to interview who walk into the library, you might consciously or subconsciously select those who seem most approachable or who don’t seem to be in a hurry.

There are many possible sources of sampling bias. In surveys, sampling bias can come from non-response or self-selection. Perhaps some of the students who you selected randomly from the people entering the library have declined to participate in your survey. This non-response can make your sample non-representative. Or, perhaps some people who you didn’t pick at random have walked up to you to see what you are up to and want to be surveyed themselves. Such self-selected people are often different from people who you would pick at random.

Internet surveys are notoriously susceptible to self-selection bias. The people who respond are those who know about the survey and are interested enough to take the time to complete it. Another common sort of bias is survival bias, where the sample consists of people or objects that have gotten through some preliminary filter. For instance, a standardized test of school performance might be given at the start and end of the school year in order to quantify students’ progress. No matter how extensive the sample (all the students in the school, say), only those students who completed the school year will be included. Such students may not be representative of the larger group of students, which includes those who moved during the year, perhaps due to family dislocation.

21.2 Example: Survival bias

VICTOR’S study of Alzheimer’s survival. We grab all the people who were in the clinic during a time interval as a sample and find the average survival time. Actual reference: Wolfson, Wolfson, and Dementia Study Group (2001)

Simulation of survival bias:

ACLU study of how fast cars drive on the highway.

When do people drive on the highway: MD Attorney General data.

21.3 Regression to the mean

IN DRAFT: Treat it as an unrepresentative sampling problem. Relate it to a collider.

21.4 Longitudinal and Cross-Sectional Samples

Data are often collected to study the links between different traits. For example, the data in Table 2.1 are a small part of a larger data set of the speeds of runners in a ten-mile race held in Washington, D.C. in 2008. The variable time gives the time from the start gun to the finish line, in seconds. Such data might be used to study the link between age and speed, for example to find out at what age people run the fastest and how much they slow down as they age beyond that.

Table 21.1: A few entries from the TenMileRace data table in the mosaicData package.
state time net age sex
MD 4948 4918 47 M
VA 3349 3346 26 M
MD 6156 5726 30 F

This sample is a cross section a snapshot of the population that includes people of different ages. Each person is included only once.

Another type of sample is longitudinal, where the cases are tracked over time, each person being included more than once in the data frame. Table 2.2 shows a small part of longitudinal data set for the runners. The individual runners have been tracked from year to year, so each individual person shows up in multiple rows.

Table 21.2: A longitudinal data set showing of runners’ times in races over several years. The ID is a unique identifier given to each runner.
ID year age sex net gun nruns
R1 2003 37 F 100.13333 103.30000 2
R1 2008 42 F NA 112.36667 2
R2 2005 14 F 89.36667 94.36667 2
R2 2006 15 F 71.21667 74.83333 2
R3 2005 31 M 85.35000 86.88333 4
R3 2006 32 M 91.75000 92.53333 4

If your concern is to understand how individual change as they age, it’s best to collect data that show such change in individuals. Using cross-sectional data to study a longitudinal problem is risky. Suppose, as seems likely, that younger runners who are slow tend to drop out of racing as they age, so the older runners who do participate are those who tend to be faster. This could bias your estimate of how running speed changes with age.

21.5 Exercises

21.6 Probabilities and data

Experience has shown that people often over-generalize the applicability of a classifier’s output. As a trivial example, the weather-prediction classifier constructed from data for Saint Paul, Minnesota (a northern-tier state in the center of the North American continent) is not likely to be much use in Miami, Florida (a southern coastal state). Similarly, a classifier built from data from the patients in a medical clinic can be considered applicable to the kinds of people similar to the patients, but not necessarily to a broader segment of the population. We’ll return to this important matter in Chapter ??.

References

Wolfson, Christina, David B. Wolfson, and Clinical Progression of Dementia Study Group. 2001. “A Reevaluation of the Duration of Survival After the Onset of Dementia.” New England Journal of Medicine 344 (15): 1111–6.