Chapter 6 A process of statistical investigation

STILL EARLY IN DRAFT

Every statistical project is different. Here we provide an outline to guide you, helping to make sure that you keep track of important components of the process.

An individual might work on only part of the process. Still, to make informed decisions about the paths to take on your journey, you should be familiar with all the components.

6.1 Identify your goal

Prediction: Gain knowledge of a hard-to-measure variable (perhaps because it hasn’t happened yet, perhaps because it’s just hard to measure) based on variables that are more easily available. This can also include the identification of anomalies – e.g. credit card fraud, a system malfunction, … – which can be seen as a kind of prediction problem where you see how likely the observed situation is according to the everything-is-fine model.

Hypothesis formation: Much like prediction

Intervention & Experiment: You want to know how components of the real-world are connected together causally.

Decision making: You need to know the value of outcomes and evaluate the pros and cons.

Tukey and Mosteller (1977, p. 268) offer seven purposes of regression, or, as I would paraphrase it, seven types of questions which regression analysis may help answer. Summarized, these seven purposes are:

  1. to get a summary;
  2. to set aside the effect of a variable;
  3. as a contribution to an attempt at causal analysis;
  4. to measure the size of an effect;
  5. to try to discover a mathematical or empirical law;
  6. for prediction;
  7. to get a variable out of the way. Source ICOTS 2, Terry Speed (Reference: Mosteller, Frederick & Tukey, John W. (1977). Data analysis and reqression. Sydney: Addison-Wesley Publishing Company. )

6.2 What existing knowledge do you have?

Knowing, for instance, population demographics can help you later in the process.

6.3 Draw out causal diagram and identify confounders/covariates or sampling biases.

For intervention/experiment, causal relationships are at the core.

Claim that correlation is not causation. This is not quite right, but might serve as a helpful reminder (like don’t put all your eggs in one basket) to be careful. And why be careful?

  1. Choosing a reponse variable does not mean that the corresponding entity in the real world is being caused by the explanatory factors. So choose carefully and approach claims skeptically. Example: ice-cream causes drowning? Certainly drowning doesn’t cause ice cream consumption. But they may have a common cause. X <- C -> Y. Another possibility is more subtle, that X -> C <- Y.

  2. Confounding. Diagram X <- C -> Y and possibly X -> Y. The problem is that X <- C -> Y makes it appear that X and Y move in tandem, even though the is no direct causal connection. Closely related, sampling bias.

Identify all plausible factors that might be involved in the relationship between X and Y. (Example: education not used when stratifying 2016 presidential polls.)

6.4 Create sampling and data collection plan to deal with confounders/covariates

As much as possible, measure the quantities you identified as confounding/stratifying. Keep in mind the ones that you had no way to measure when interpreting and communicating your results later on in the process.

Additional precautions against confounding/sampling-bias

  1. Random selection within strata. Why? Provides a kind of certificate believable by others, that your sampling wasn’t biased.

  2. Assignment. Cut off any possible relationship X <- C. You can do this by instituting U -> X C -> Y and possibly X -> Y. But, you can fool yourself (“this patient isn’t healthy enough to be put on the experiment drug” or “this is a lost cause, so no risk in trying the new drug”). So random assignment is a good way to go.

Still, keep a record of covariates.

Intent-to-treat and instrumental variables approaches.

SEE MATERIALS IN 033-samples.Rmd

  1. Census (example: employment discrimination and hypothesis testing)
  2. Simple random sample
  3. Cluster sampling
  4. Case/control study (always retrospective. Refer to example about the HBC high-blood pressure study in classification error)
  5. Cohort (prospective?)
  6. Retrospective cohort study
  7. Experiment (refer to fun examples from Kahneman’s Thinking Fast and Slow)
  8. A/B testing
  9. Pathetic reaching out
    • telephone polling
    • convenience samples

To avoid unnecessary and confusing abstraction, I’ll start with a specific setting: predicting whether a person will develop diabetes. Ideally, you should have a precise definition of “develop diabetes”: what constitutes diabetes, the time horizon of the prediction, and so on. There are many ways in which appropriate data can be collected. The different ways are called study designs.

One possible study design is a prospective cohort study. A cohort is a group of subjects with some features in common and others that differ. In a cohort study, a group of subjects are identified (“assembling” the cohort), say adults in the US with no previous signs of diabetes, and relevant observations made of existing conditions which will play the role of explanatory variables, e.g. age, sex, weight, diet, exercise, … whatever you think might be relevant. In a prospective study, the group is then followed forward over time and the eventual outcome – developing diabetes in our example – is recorded. To use the data for prediction for a new subject, compare the conditions for the new subject (age, sex, weight, etc.) to the set of observations in the cohort subjects. Pick out the members of the cohort whose original observations best resemble those of the new subject. Let’s call this the matching subset of the cohort. Then tabulate the eventual outcomes of the matching subset. This tabulation constitutes the prediction for the new subject.

It’s helpful to distinguish a prospective cohort study from a retrospective cohort study. In a retrospective study, the cohort subjects are not followed forward over time but backward. That is, the outcome is already known at the time the cohort is assembled. The values of the explanatory variables are extracted from historical records, say, the subjects’ medical records.

THIS IS FOR DETECTION, NOT PREDICTION. Still another design is a case/control study. If the outcome being studied is rare, a cohort might have to be very large in order to include people with the condition to be predicted. To illustrate, suppose we believe that drinking large amounts of sugary beverages (e.g. “Big Gulps”) is a factor that can be used to predict the onset of diabetes. For instance, according the the US Centers for Disease Control and Prevention (“National Diabetes Statistics Report, 2017” 2018), in a randomly selected group of 1000 US adults without signs of diabetes, approximately 7 will develop diabetes in the next year.

Possible introduction or enrichment for need to look at multiple variables. Hill-1937a-WA-I.pdf under “Definition of Statistics” and the following “Planning” section.

6.5 Generalization

A nice set of examples of selection bias is in Hill-1937a-WA-II.pdf.

6.6 Create model representation of your system.

A model is a representation for a purpose. You want something you can easily manipulate because you are going to be doing experiments on the model to understand/interpret the real-world system better.

6.7 Evaluate technical performance.

Feedback loop with (4)

6.8 Interpret and communicate.

Apply loss functions.

Express risk sensibly, attribute risk (causality) responsibly.

Express your uncertainty/ Standardize your results (adjustment) to help decision-makers see contrasts that are meaningful. (Example: Mexico and US death rates.)

Be attentive to false discovery.

Don’t be afraid to frame things in terms of causation, but only do so if you have handled the possibility of confounding in a responsible way.

Not p-values: - ASA editorial March 2019 - Nature article March 2019

6.9 On a second reading

After reading the methods, go back and re-read this.

Other points …

6.10 Sensitivity

About why we shouldn’t rely only on sampling variation to indicate what we know about the effect size. We should look at a variety of models and model architectures, proxies for the measured quantities (since a variable is not necessarily what we want it to be) to get a sense of how much variation there is among models of equal plausibility.

“The Statistical Confidence Game” – why do we focus on doing the same thing over and over?

Should get similar results when using different proxies for the effect, e.g. in the SAT data use expenditures, but also teachers’ salaries and class size, perhaps building and administrative expenses, ….

Motivated by Michael Lavine’s essay the The American Scientist. “Frequentist, Bayes, or Other?”" Michael Lavine https://doi.org/10.1080/00031305.2018.1459317

Also see Steven Ziliak’s article in the same issue, from which these quotes are taken:

G-7 Minimize “Real Error” with the 3 R’s: Represent, Replicate, Reproduce

A test of significance on a single set of data is nearly valueless. Fisher’s p, Student’s t, and other tests should only be used when there is actual repetition of the experiment. “One and done” is scientism, not scientific. Random error is not equal to real error, and is usually smaller and less important than the sum of nonrandom errors. Measurement error, confounding, specification error, and bias of the auspices, are frequently larger in all the testing sciences, agronomy to medicine. Guinnessometrics minimizes real error by repeating trials on stratified and balanced yet independent experimental units, controlling as much as possible for local fixed effects.

G-6 Economize With “Less Is More”: Small Samples of Independent Experiments

Small-sample analysis and distribution theory has an economic origin and foundation: changing inputs to the beer on the large scale (for Guinness, enormous global scale) is risky, with more than money at stake. But smaller samples, as Gosset showed in decades of barley and hops experimentation, does not mean “less than”, and Big Data is in any case not the solution for many problems.

References

“National Diabetes Statistics Report, 2017.” 2018. Centers for Disease Control; Prevention. https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf.