- Table of contents
- Preface
**I Introduction**- 1 Variation and covariation
- 2 Data and information
- 3 Data graphics
- 4 Stratification and summary
- 5 Prediction
- 6 Simulation
**II Process**- 7 Process, Priors & Planning
- 8 Case study: from purpose to result
- 9 Bayes’ rule
**III Modeling frameworks**- 10 Modeling functions
- 11 Models that learn
- 12 Confounding
- 13 Effect size
- 14 Causal networks
- 15 Sampling variation
**IV Evaluation**- 16 Model performance
- 17 Classification error
- 18 Cross validation
- 19 Partitioning variance
- 20 Calculating confidence intervals with resampling
- 21 Small Data
**V Interpretation**- 22 DRAFT: Loss functions
- 23 False discovery

The origins of recorded history are, literally, data. Five-thousand years ago, in Mesopotamia, the climate was changing. Retreating sources of irrigation water called for an organized and coordinated response, beyond the scope of isolated clans of farmers. To provide this response, a new social structure – government – was established and grew. Taxes were owed and paid, for which some kind of record was needed. Food grain had to be measured and stored, livestock counted, trades and shipments memorialized.

Writing emerged as the technological innovation to keep track of all this. We know about this today because the records were incised on soft clay tablets and baked into permanence. Then, when the records were no longer needed, they were recycled as building materials for the growing settlements and cities. Archeologists started uncovering these tablets more than 100 years ago, spending decades to decipher the meaning of the simple marks in clay.

Record-keeping today is infinitely more elaborate and refined, inscribed in computer memory and transmitted, summarized, and retabulated with high-speed electronics. In addition to commercial records of buying and selling, we record transactions of all sorts: medical visits, tests, prescriptions, images; web-browsing searches and histories; shared photographs of social media; DNA sequences and gene expression; remote sensing of detailed geography and weather; the minutiae of government and other organizations; comings and goings across borders and in the public square; instant-by-instant readings from automobiles and roads and the many other networks which compose modern life. The hoped-for beneficial use of this torrent of data is the work of today’s data scientists, people who can enable data to speak, inform decision makers and researchers, and create new services such as self-driving cars and pinpoint delivery of fertilizer and the other inputs of sustainable agriculture.

Much of data science is concerned with the immense technological apparatus involved in recording, securing, and accessing data. But, ultimately, data is of no use until it is turned into information. There is much technology involved here, too. And there is a human side: the judgment and expertise needed to see patterns and distinguish those that are meaningful from those that are merely accidental fluctuations.

The main reason for this break is the increasing centrality of statistical reasoning to inform decision making in the worlds of business and governance and the emergence of new modes of science that call for novel statistical and data-science methodology.

Those familiar with my earlier textbooks (e.g., Kaplan (2011)) will not be surprised to see that this book engages with multiple explanatory variables from the very beginning. This is completely consistent with mainstream aspirational guidelines for teaching statistics, such as the American Statistical Association’s Guidelines for “Assessment and Instruction in Statistics Education” (GAISE) report (AmericanStatisticalAssociation 2016), which states:

Give students experience with multivariable thinking. We live in a complex world in which the answer to a question often depends on many factors. Students will encounter such situations within their own fields of study and everyday lives. We must prepare our students to answer challenging questions that require them to investigate and explore relationships among many variables. Doing so will help them to appreciate the value of statistical thinking and methods.

Throughout this book, you’ll see how embedding statistics in data science can directly address the GAISE goals of helping students become critical consumers of statistically-based results, undertake statistics as an investigative process, explore variability and the consequences and important uses of randomness, and use statistical *models* appropriately. Naturally, the book also treats statistical inference seriously, but does so in a way that emphasizes the important shifts in orientation that should be prompted by the American Statistical Association’s call for a markedly reduced role for p-values (Wasserstein and Lazar (2016)) and the gradual acceptance of the new tools developed in the “causal revolution.” (See, e.g. Pearl and Mackenzie (2018).)

Instructors and students who want to use computation without any start-up cost, will find the StatPREP LittleApps useful. These web-based interactive graphics were developed by me and use graphics that are identical in style and meaning to the data graphics this book.

For the student or instructor who is starting to engage with command-based statistical computing, this book is distributed with “tutorials” that provide, with no set-up cost, a way to use R which even people with no previous computing experience find straightforward and accessible. And for those who want a fully fledged use of R, all the commands introduced in the tutorials work in any of the modes of using R directly, for instance in R/Markdown documents.

*Data science* is an emerging computational discipline with roots in contemporary problems. Almost always these problems involve large amounts of data collected, organized, and accessed by computers. The goals for working with such data are varied: predicting the preferences of individual people for consumer products or news feeds; examining government or business or clinical medical records to answer questions such as the efficacy of a proposed program or an intervention in public health; detecting and classifying rare events such as credit-card fraud; finding useful patterns in clouds of text or data that might help identify harmful interactions between medicines or extract meaning from thousands of documents.

*Statistics* as a discipline emerged in the decades around 1900.

Statistics as a field can and did exist without data science. The contexts to which statistics has traditionally been applied are very different from the contexts in which data science is important. Traditional statisticians had to find ways to deal with very limited amounts of data, and so the mathematics of small data became central to the self-definition of statistics. Statisticians had to help steer medicine and science away from arguments based on anecdote, and so great emphasis was placed on the technique of random sampling and random assignment in experiment. And statisticians had to do their work in a world without electronic computers or the idea of software. Without software or the machines to run it on statisticians eliminated randomness by replacing it with deterministic, exactly repeatable idealizations of randomness. These idealizations are presented using algebraic formulas derived from mathematical stand-ins for procedures and supplemented with elaborate tables of standardized probabilities.

Today, with the ready availability of computers and sophisticated software, many aspects of statistics can often be better explained by simulating random sampling and reading off results directly rather than through the limiting and difficult formalism of algebra. The importance of statistics to making use of data is at a deeper level than the algebraic mechanisms used traditionally for communication. If statistics had not already existed, data science would need to invent it. If invented today, statistical concepts would likely be communicated with algorithms rather than algebra. This book does exactly that.

An “algorithm,” despite the somewhat off-putting name, is simply a way of describing a new computation in terms of computations that we already know how to perform. Statistical algorithms are built out of surprisingly simple components: randomization, repetition and iteration, tabulation, and the construction of models of how one *response* variable is associated with one or more *explanatory* variables.

The ways in which the world of data has changed in the last 50 years goes well beyond providing an algorithm-based way to describe statistics. The problems and applications of concern have changed as well. As two examples, consider one problem found in every traditional textbook and another important problem found in none of them. The traditional textbook topic of “confidence intervals” – these describe the uncertainty stemming from the process of random sampling – is undoubtedly important when the primary source of uncertainty is sampling variation. That’s the case with the small data sets of traditional statistics. In data science, data is often large. With the consequent diminishment of sampling variation, confidence intervals fail to capture important sources of ambiguity and uncertainty in the situations faced by data scientists.

The second example concerns a topic not found in traditional texts: causality. To be more precise, traditional texts tie causality exclusively to the results of randomized experiments. Such experiments are indeed a marvelous way of addressing causality … when you can do them. But experiments are often impractical, unethical, or impossible due to time constraints. Traditional texts insist that “correlation is not causation” and scare students with “lurking variables” for which the text provides no defense. Tradition disparages so-called “observational data,” that is, data which is not the outcome of a randomized experiment. For the many, many situations in which observational data is all we’ve got (and all we’re going to get), recent developments in statistical methodology actually do provide methods to draw responsible inferences. But these methods rely on statistical techniques capable of untangling the web of causal influences. When described in an algebraic formalism, these techniques are incomprehensible and mysterious to the vast majority of even Ph.D.-level researchers. When described algorithmically, they make much more sense. And, the advent of “machine learning” means there often is no algebraic representation at all.

*Stats for Data Science* re-imagines statistics as if it were being invented today, alongside data science. It emphasizes concepts and techniques relating directly to data with multiple variables, to constructing predictive models of individual outcomes, and to making responsible inferences about causal relationships from data.

AmericanStatisticalAssociation. 2016. “Guidelines for Assessment and Instruction in Statistics Education.” American Statistical Association. http://www.amstat.org/asa/files/pdfs/GAISE/GaiseCollege_Full.pdf.

Kaplan, Daniel T. 2011. *Statistical Modeling: A Fresh Approach*. 2nd ed. Project Mosaic Books. https://project-mosaic-books.com.

Pearl, Judea, and Dana Mackenzie. 2018. *The Book of Why: The New Science of Cause and Effect*. Basic books.

Wasserstein, R. L., and N. A. Lazar. 2016. “The Asa’s Statement on P-Values: Context, Process, and Purpose.” *The American Statistician* 70. http://dx.doi.org/10.1080/00031305.2016.1154108.