Statistical Modeling (2e)
Daniel T Kaplan
The purpose of this book is to provide an introduction to statistics that gives readers a sufficient mastery of statistical concepts, methods, and computations to apply them to authentic systems. By “authentic,” I mean the sort of multivariable systems often encountered when working in the natural or social sciences, commerce, government, law, or any of the many contexts in which data are collected with an eye to understanding how things work or to making predictions about what will happen.
The world is uncertain and complex. We deal with the complexity and uncertainty with a variety of strategies including the scientific method and the discipline of statistics.
Statistics deals with uncertainty, quantifying it so that you can assess how reliable — how likely to be repeatable — your findings are. The scientific method deals with complexity: reduce systems to simpler components, define and measure quantities carefully, do experiments in which some conditions are held constant but others are varied systematically.
Beyond helping to quantify uncertainty and reliability, statistics provides another great insight of which most people are unaware. When dealing with systems involving multiple influences, it is possible and best to deal with those influences simultaneously. By appropriate data collection and analysis, the confusing tangle of influences can sometimes be straightened out. In other words, statistics goes hand-in-hand with the scientific method when it comes to dealing with complexity and understanding how systems work.
The statistical methods that can accomplish this are often considered advanced: multiple regression, analysis of covariance, logistic regression, among others. With appropriate software, any method is accessible in the sense of being able to produce a summary report on the computer. But a method is useful only when the user has a way to understand whether the method is appropriate for the situation, what the method is telling about the data, and what the method is not capable of revealing. Computer scientist Richard Hamming1 (1915-1998) said: “The purpose of computing is insight, not numbers.” Without a solid understanding of the theory that underlies a method, the numbers generated by the computer may not give insight.
Advanced methods of statistics can give tremendous insight. For this reason, these methods need to be accessible both computationally and theoretically to the widest possible audience. Historically, access has been limited because few people have the algebraic skills needed to approach the methods in the way they are usually presented. But there are many paths to understanding and I have undertaken to find one — the “fresh approach” in the title — that takes the greatest advantage of the actual skills that most people already have in abundance.
In trying to meet that challenge, I have made many unconventional choices. Theory becomes simpler when there is a unified framework for treating many aspects of statistics, so I have chosen to present just about everything in the context of models: descriptive statistics as well as inference.
George Cobb (???) cogently describes the logic of statistical inference as the three-Rs: “Randomize, Repeat, Reject.” In a decade of teaching statistics, I have found that students can understand this algorithmic logic much better than the derivations of algebraic formulas for means and standard deviations of sampling distributions.2
Consequently, algebraic notation and formulas are strongly de-emphasized in this book. The traditional role that formulas have played in providing instructions for how to carry out a calculation is no longer essential for effective use of statistical methods. Software now implements the calculations. What’s needed is not a formula-based description that allows people to reproduce what computers do, but a way to understand the methods at a high level so that the rapidity and reliability of computers in performing calculations can be used to provide insight into real-world problems.
And then there is software. Traditionalists think that statistics should be taught without computers in order to help develop conceptual understanding. Others think that it is silly to ignore a technology that is universally used in practice and greatly expands our capabilities. Both points of view have merit.
The main body of this book is presented in a way that makes little or no reference to software; the statistical concepts are paramount. However, most chapters have a section on computational technique that shows how to get things done and aims to give the reader concrete skills in the analysis of data.
The software used is R, a modern, powerful and freely available system for statistical computations and graphics. The book assumes that you know nothing at all about scientific software and, accordingly, introduces R from basics. If you have experience with statistics, you probably already have a preferred software package. So long as that software will fit linear models with multiple explanatory variables and produce a more-or-less standard regression report, it can be used to follow this book. That said, I strongly encourage you to consider learning and using R. You can learn it easily by following the examples and can be doing productive statistics very quickly. Not only will you easily be able to fit models and get reports, but you can use R to explore ideas such as resampling and randomization. If you now use “educational” software, learning R will give you a professional-level tool for use in the future.
For many instructors, this book can support a nice second course in statistics — a follow-up to a conventional first introductory course. Increasingly, such a course is needed as more and more young people encounter basic statistical ideas in grade school and many of the topics of the conventional university course are absorbed into the high-school curriculum. At3 Macalester College, where I developed this book, mainstream students of biology, economics, political science, and so on use this book for their first statistics course. Accordingly, the book is written to be self-contained, making no assumption that readers have had any previous formal study in statistics.
This second edition of Statistical Modeling: A Fresh Approach provides me the opportunity to implement many suggestions provided by readers and instructors who used the First Edition. An early chapter now introduces simple, group-wise models. This allows the problems of confounding to be demonstrated earlier and therefore motivates the more sophisticated modeling techniques that are the central theme of the book. Another early chapter introduces confidence intervals via re-sampling.
The second edition continues to feature R, but now makes use of the mosaic package distributed through the standard R channels.
One of the most distinctive aspects of the First Edition was the use of geometry to provide a theory for statistical models. Many people find that geometrical explanations strongly support their development of an understanding of the statistical methods. Many others found the geometrical material an undesired detour. To accomodate both groups, I have broken down the basics of the geometrical material into short sections in early chapters. More extensive introductory material about vectors, subspaces, etc., as well as interpretations of the advanced modeling techniques in geometrical terms, is being published on-line on the book’s web site.
I have been fortunate to have the assistance and support of many people. Some of the colleagues who have played important roles are David Bressoud, George Cobb, Dan Flath, Tom Halverson, Gary Krueger, Weiwen Miao, Phil Poronnik, Victor Addona, Alicia Johnson, Karen Saxe, Michael Schneider, and Libby Shoop. Critical institutional support was given by Brian Rosenberg, Jan Serie, Dan Hornbach, Helen Warren, and Diane Michelfelder at Macalester and Mercedes Talley at the Keck Foundation.
I received encouragement from many in the statistics education community, including George Cobb, Joan Garfield, Dick De Veaux, Bob delMas, Julie Legler, Milo Schield, Paul Alper, Dennis Pearl, Jean Scott, Ben Hansen, Tom Short, Andy Zieffler, Sharon Lane-Getaz, Katie Makar, Michael Bulmer, Frank Shaw, and the participants in our monthly “Stat Chat” sessions. Helpful suggestions came from from Simon Blomberg, Dominic Hyde, Michael Lavine, Erik Larson, Julie Dolan, and Kendrick Brown. Michael Edwards helped with proofreading. Nick Trefethen and Dave Saville provided important insights about the geometry of fitting linear models.
It’s important to recognize the role played by the developers of the R software — the “core” R team as well as the group of volunteers who have provided numerous packages that extend R’s capabilities. Hadley Wickham, in particular, developed the
ggplot2 package used to create many of the graphics in this Second Edition, as well as a remarkable array of other utilities for treating data in a unified way. The design of R (and its progenitor S) are not just a matter of good software design, but of a brilliant understanding and systematization of statistics that makes the underlying logic of statistics accessible to students as well as experts. Further extending the reach of R, J.J. Allaire, Joe Chang, and Joshua Paulson have created the RStudio interface to R, which makes it much easier to teach and learn with R.
Special thanks are due to Randall Pruim and Nicholas Horton who, as mosaic activists, have improved the extensions to R used in this book and provided a wide range of suggestions that have found their way into the Second Edition.
Thanks also go to the hundred or so students at Macalester College who enrolled in the early, experimental sessions of Math 155 where many of the ideas in this book were first tested. Among those students, I want to acknowledge particular help from Alan Eisinger, Caroline Ettinger, Bernd Verst, Wes Hart, Sami Saqer, and Michael Snavely. Approximately 500 Macalester students have used the First Edition of this book, many of whom have helped identify errors and suggested clarifications and other improvements.
Crucial early support for this project was provided by a grant from the Howard Hughes Medical Institute. An important Keck Foundation grant was crucial to the continuing refinement of the approach and the writing of this book. Google provided summer-of-code funding for my student Andrew Rich to develop interactive applets that can be used along with this book.
Finally, my thanks and love to my wife, Maya, and daughters, Tamar, Liat, and Netta, who endured the many, many hours during which I was preoccupied by some or another statistics-related enthusiasm, challenge, or difficulty.