The two terms, “statistical learning” and “machine learning,” reflect mainly the artificialities of academic disciplines.
“Data Science” is a new term that reflects the reality that both statistical and machine learning are about data. Techniques and concepts from both statistics and computer science are essential.
Computer scientists took this on.
Story from early days of machine translation:
Statistical approach:
Where did the sample of phrases come from?
Result: Google translate.
Early days: computer systems with key words and search systems (as in library catalogs)
Now: dimension reduction (e.g. singular value decomposition), angle between specific documents and what might be called “eigen-documents”
Result: Google search
Each student in the class as a personal repository on GitHub. The instructor is also a contributor to this repository and can see anything in it. Complete instructions for doing this are in the appendix.
github.com/dtkaplan/math253-bobama
.We discussed what “machine learning” means and saw some examples of situations where machine-learning techniques have been used successfully to solve problems that had at best clumbsy solutions before. (Natural language translation, catalogs of large collections of documents.)
We worked through the process of connecting RStudio to GitHub, so that you can use your personal repository for organizing, backing up, and handing in your work.
The Day-1 programming activity introduced some basic components of R: assignments, strings, vectors, etc.
“Data science” lies at the intersection of statistics and computer science.
“Learning” is an attractive word and suggests that “machine learning” is an equivalent for what humans do. Perhaps it is to some extent …
But “modeling” is a more precise term. We will be building models of various aspects of the world based on data.
Wait until the end of the semester.
We will be doing only supervised learning until late in the course.
There are fundamental trade-offs that describe the structure of learning from data. There are also trade-offs that arise between different methods of learning. Finally, there are dicotomies that stem from the different purposes for learning.
These dicotomies provide a kind of road map to tell you where are are and identify where you might want to go.
And, as always, it’s important to know why you are doing what you’re doing: your purpose.
Example: Malignancy of cancer from appearance of cells. Works for guiding treatment. Does it matter why malignant cells have the appearance they do?
Story: Mid-1980s. Heart rate variability spectral analysis and holter monitors. (Holters were cassette tape recorders set to record ECG very, very slowly. Spectral analysis breaks down the overal signal into periodic components.) Very large spike at 0.03 Hz seen in people who will soon die.
Could use for prediction, but researchers were also interested in the underlying physiological mechanism. Causal influences. We want to use observations to inform our understanding of what influences what.
Story continued: The very large spike was the “wow and flutter” in the cassette tape mechanism. This had an exact periodicity: a spike in the spectrum. If the person was sick, their heart rate was steady: they had no capacity to vary it as other conditions in the body (arterial compliance, venus tone) called for. Understanding what happens in cardiac illness is, in part, about understanding how the various control systems interact.
In traditional statistics, this is often tied up with the concept of “degrees of freedom.”
Not flexible:
Individual fits miss how the explanatory variables interact. ISL Figure 2.1
Flexible:
Such detailed patterns are more closely associated with physical science data than with social/economic data. ISL Figure 2.2
And in multiple variables:
Not flexible:
ISL Figure 2.4
Flexible:
ISL Figure 2.6
Many learning techniques produce models that are not easily interpreted in terms of the working of the system. Examples: neural networks, random forests, etc. The role of input variables is implicit. Characterizing it requires experimenting on the model. In other learning techniques, the role of the various inputs and their interactions is explicit (e.g. model coefficients).
The reason to use a black-box model is that it can be flexible. So this tradeoff might be called “flexibility vs interpretability.”
A quick characterization of several model architectures (which they call “statistical learning methods”)
ISL Figure 2.7
How good can we make a model? How do we describe how good it is?
What does this mean? (from p. 19)
\[\begin{array}{rcl} E(Y - \hat{Y})^2 & = & E[f(X) + \epsilon - \hat{f}(X)]^2\\ & = & \underbrace{[f(X) - \hat{f}(X)]^2}_{Reducible} + \underbrace{Var(\epsilon)}_{Irreducible}\\ \end{array}\]
Notation:
Regression: quantitative response (value, probability, count, …)
Classification: categorical response with more than two categories. (When there are just two categories, regression (e.g. logistic regression) does the job.)
ISL Figure 2.8
Accuracy (flexibility) vs interpretability
We always want models to be accurate. Whether we need to be able to interpret the model depends on our overall purpose.
Reducible error vs irreducible error
It’s good to know how accurate our models can get. That gives a goal for trying out different types of models to know when we don’t need to keep searching.
A classification setting: Blood cell counts.
Build a machine which takes a small blood sample and examines and classifies individual white blood cells.
Blood cell classification
The classification is to be based on two measured inputs, shown on the x- and y-axes.
Training data has been developed where the cell was classified “by hand.” In medicine, this is sometimes called the gold standard. The gold standard is sometimes not very accurate. Here, each cell is one dot. The color is the type of the cell: granulocytes, lymphocytes, monocytes, …
In constructing a theory, it’s good to have a system you can play with where you know exactly what is going on: e.g. a simulation.
The dark blue line in the left panel is a function the authors created for use in a simulation:
ISL Figure 2.9
The dots are data the textbook authors generated from evaluating the function at a few dozen values of \(x\) and adding noise to each result.
The difference between the dots’ vertical position and the function value is the residual, which they are calling the error. The mean square error MSE is
\[\mbox{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2\]
Looking again at the left panel in Figure 2.9, you can see three different functions that they have fitted to the data. It’s not important right now, but you might as well know what these model architectures are:
Each of these three functions were fitted to the data. Another word for fitted is trained. As such, we use the term training error for the difference between the data points and the fitted functions. Also, because the functions are not the Platonic \(f(x)\), they are written \(\hat{f}(x)\).
For each of the functions, the training MSE is
\[\mbox{Training MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\]
Right panel of the graph is something completely different: both the axes are different than in the left panel.
ISL Figure 2.10
When examining training MSE, the more flexible model has the smaller MSE. This answer is pre-ordained, regardless of the actual shape of the Platonic \(f(x)\).
In traditional regression, we use ANOVA or adjusted$ \(R^2\) to help avoid this inevitability that more complicated models will be closer to the training data. Both of those traditional methods inflate* the estimate of the MSE by taking into account the “degrees of freedom,” df, in the model and how that compares to the number of cases \(n\) in the training dataset. The inflation looks like
\[ \frac{n}{n - \mbox{df}} \]
So when \(\mbox{df} \rightarrow n\), we inflate the MSE quite a lot.
Another approach to this is to use testing MSE rather than training MSE. So pick the model with flexibility at the bottom of the U-shaped testing MSE curve.
In traditional regression, we get at the variance by using confidence intervals on parameters. The broader the confidence interval, the higher the variation from random sample to random sample. These confidence intervals come from normal theory or from bootstrapping. Bootstrapping is a simulation of the variation in model fit due to training data.
Bias decreases with higher flexibility.
Variance tends to increase with higher flexibility.
Irreducible error is constant.
ISL Figure 2.12
Simulation: Make and edit a file Day-03.Rmd
.
Add three different sources of variation. The width of the individual sources is measured by the standard deviation sd=
.
n <- 1000
sd( rnorm(1000, sd=3) + rnorm(1000, sd=1) + rnorm(1000, sd=2) )
## [1] 3.652297
rexp()
The variance of the sum of independent random variables is the sum of the variances of the individual random variables.
\[E( y - \hat{f}(x) )^2 = \mbox{Var}(\hat{f}(x)) + [\mbox{Bias}(\hat{f}(x))]^2 + \mbox{Var}(\epsilon)\]
Breaks down the total “error” into three independent sources of variation:
\[\underbrace{E( y - \hat{f}(x) )^2}_{\mbox{Total error}} = \underbrace{\mbox{Var}(\hat{f}(x))}_{\mbox{source 3.}} + \underbrace{[\mbox{Bias}(\hat{f}(x))]^2}_{\mbox{source 2.}} + \underbrace{\mbox{Var}(\epsilon)}_{\mbox{source 1.}}\]