Lessons in Statistical Thinking

Author

Daniel Kaplan

Published

February 16, 2023

Detail from Paul Signac, La Corne d’Or. 1907

Preface

Note to students in Math 300

Up to now, Math 300 has been following the ModernDive textbook. For the remainder of the semester, however, we will continue with the lessons in this little book: Lessons in Statistical Thinking.

Lessons in Statistical Thinking is an update and reconsideration of the concepts and methods needed to extract information from data. Such an update is needed because the canon of traditional introductory statistics texts has long been obsolescent and fails to address the needs of the contemporary data scientist and decision-maker. That canon stems from an influential 1925 book, Ronald Fisher’s Statistical Methods for Research Workers. Research workers of that era typically ran small benchtop or field experiments with a dozen or fewer observations on each of two treatments. A first task with such small data is to rule out the possibility that calculated differences might reflect only the accidental arrangement of numbers into groups.

Perhaps emblematic of the current dissatisfaction with small-data methods is the controversy over “statistical significance.” Although situated at the core of many statistics textbooks, significance testing has little to do with the everyday meaning of “significant” as “important” or “relevant.” This article in the prestigious science journal Nature details the controversy. Figure 18.1 reproduces a cartoon from that article that puts the shortcomings of “statistical significance” in a historical context.

Figure 18.1: A cartoon published along with an article in Nature, “Retire statistical significance,” showing this once-respected idea heading to the graveyard for outdated and misleading “scientific” concepts such as phlogiston and aether.

Obsession with the mantra, “Correlation is not causation,” is another sign of the obsolescence of the traditional introduction to statistics. Around 1910, the pioneers of statistics were the first to emphasize an important innovation in scientific method: the randomized controlled trial (RCT). Adoption of the RCT in the twentieth century put several branches of science on a new footing. And many statistics instructors see “correlation is not causation” as a slogan pointing to the genuine importance of RCTs. Unfortunately, the mantra has been over-interpreted to mean the impossibility of causal knowledge without RCTs, as in the following cartoon (Figure 18.2):

Figure 18.2: XKCD’s take on correlation and causation.

Nowadays, when data are used to inform policy decisions in many areas, being statistically literate includes the need to make justifiable conclusions about causality. One approach to this was highlighted by the 2019 Nobel Prize in economics; breaking down complex issues of global poverty into smaller, more manageable questions where an RCT is feasible. Another approach, using “natural experiments” was honored by the 2021 Nobel Prize.

There is no Nobel in computer science. The equivalent is the Turing Award, which in 2011 was awarded for “fundamental contributions to … probabilistic and causal reasoning.” That such prestigious awards are being given in the last decade demonstrates how recent and how important causal reasoning is and why the fundamentals of causality ought to be part a modern introduction to statistics. Consequently, they are a major theme in these Lessons.

Statistical thinking

The work of today’s data scientists is often to discover novel connections among multiple variables and to guide decision-making. It is common for data to be available in large masses from observations rather than experiments. One common purpose is “prediction,” which might be as simple as the uses of medical screening tests or as breathtaking as machine-learning techniques of “artificial intelligence.” Another pressing need from data analysis is to understand possible causal connections between variables.

The twenty lessons that follow describe a way of thinking that is historically novel, unfamiliar to most otherwise well-educated people, and incredibly useful for making sense of the world and what data can tell us about the world.

Under construction

These links are to the working drafts in development for revisions to the first eighteen Lessons. At present, we use the first half of the ModernDive textbook to cover much of this material.

  1. Data frames
  2. Graphics basics
  3. Reading R
  4. Databases
  5. Operating on variables
  6. TBA
  7. TBA
  8. Tidy data
  9. TBA
  10. Exploratory data analysis
  11. Residuals
  12. Regression
  13. TBA
  14. Graphics for regression
  15. Categorical response variables
  16. Specialized regressions
  17. TBA
  18. Machine learning