Topic 1: Introduction

Math 253: Statistical Computing & Machine Learning

Daniel Kaplan

Statistical and Machine Learning

The two terms, “statistical learning” and “machine learning,” reflect mainly the artificialities of academic disciplines.

“Data Science” is a new term that reflects the reality that both statistical and machine learning are about data. Techniques and concepts from both statistics and computer science are essential.

Example 1: Machine translation of natural languages

Computer scientists took this on.

Story from early days of machine translation:

Statistical approach:

Where did the sample of phrases come from?

Result: Google translate.

Example 2: From library catalogs to latent semantic indexing

Early days: computer systems with key words and search systems (as in library catalogs)

Now: dimension reduction (e.g. singular value decomposition), angle between specific documents and what might be called “eigen-documents”

Result: Google search

Computing technique

Each student in the class as a personal repository on GitHub. The instructor is also a contributor to this repository and can see anything in it. Complete instructions for doing this are in the appendix.

  1. Set up some communications and security systems (e.g. an RSA key)
  2. Clone your repository from GitHub. It is at an address like github.com/dtkaplan/math253-bobama.

Day 1 Programming Activity

Review of Day 1

We discussed what “machine learning” means and saw some examples of situations where machine-learning techniques have been used successfully to solve problems that had at best clumbsy solutions before. (Natural language translation, catalogs of large collections of documents.)

We worked through the process of connecting RStudio to GitHub, so that you can use your personal repository for organizing, backing up, and handing in your work.

The Day-1 programming activity introduced some basic components of R: assignments, strings, vectors, etc.

Theoretical concepts ISL §2.1

“Data science” lies at the intersection of statistics and computer science.

Statistics concepts

Computing concepts

Cross fertilization

Many techniques

“Learning” is an attractive word and suggests that “machine learning” is an equivalent for what humans do. Perhaps it is to some extent …

But “modeling” is a more precise term. We will be building models of various aspects of the world based on data.

Unsupervised learning

Wait until the end of the semester.

We will be doing only supervised learning until late in the course.

Supervised learning:

Basic dicotomies in machine learning

There are fundamental trade-offs that describe the structure of learning from data. There are also trade-offs that arise between different methods of learning. Finally, there are dicotomies that stem from the different purposes for learning.

These dicotomies provide a kind of road map to tell you where are are and identify where you might want to go.

And, as always, it’s important to know why you are doing what you’re doing: your purpose.

Purposes for learning:

Dicotomies

Prediction versus mechanism

Example: Malignancy of cancer from appearance of cells. Works for guiding treatment. Does it matter why malignant cells have the appearance they do?

Story: Mid-1980s. Heart rate variability spectral analysis and holter monitors. (Holters were cassette tape recorders set to record ECG very, very slowly. Spectral analysis breaks down the overal signal into periodic components.) Very large spike at 0.03 Hz seen in people who will soon die.

Could use for prediction, but researchers were also interested in the underlying physiological mechanism. Causal influences. We want to use observations to inform our understanding of what influences what.

Story continued: The very large spike was the “wow and flutter” in the cassette tape mechanism. This had an exact periodicity: a spike in the spectrum. If the person was sick, their heart rate was steady: they had no capacity to vary it as other conditions in the body (arterial compliance, venus tone) called for. Understanding what happens in cardiac illness is, in part, about understanding how the various control systems interact.

Flexibility versus variance

In traditional statistics, this is often tied up with the concept of “degrees of freedom.”

Not flexible:

Individual fits miss how the explanatory variables interact. ISL Figure 2.1

Individual fits miss how the explanatory variables interact. ISL Figure 2.1

Flexible:

Such detailed patterns are more closely associated with physical science data than with social/economic data. ISL Figure 2.2

Such detailed patterns are more closely associated with physical science data than with social/economic data. ISL Figure 2.2

And in multiple variables:

Not flexible:

ISL Figure 2.4

ISL Figure 2.4

Flexible:

ISL Figure 2.6

ISL Figure 2.6

Black box vs interpretable models

Many learning techniques produce models that are not easily interpreted in terms of the working of the system. Examples: neural networks, random forests, etc. The role of input variables is implicit. Characterizing it requires experimenting on the model. In other learning techniques, the role of the various inputs and their interactions is explicit (e.g. model coefficients).

The reason to use a black-box model is that it can be flexible. So this tradeoff might be called “flexibility vs interpretability.”

A quick characterization of several model architectures (which they call “statistical learning methods”)

ISL Figure 2.7

ISL Figure 2.7

Reducible versus irreducible error

How good can we make a model? How do we describe how good it is?

What does this mean? (from p. 19)

\[\begin{array}{rcl} E(Y - \hat{Y})^2 & = & E[f(X) + \epsilon - \hat{f}(X)]^2\\ & = & \underbrace{[f(X) - \hat{f}(X)]^2}_{Reducible} + \underbrace{Var(\epsilon)}_{Irreducible}\\ \end{array}\]

Notation:

Regression versus classification

Regression: quantitative response (value, probability, count, …)

Classification: categorical response with more than two categories. (When there are just two categories, regression (e.g. logistic regression) does the job.)

Supervised versus unsupervised

ISL Figure 2.8

ISL Figure 2.8

Programming Activity 1

Using R/Markdown

Review of Day 2

Trade-offs/Dicotomies

A Classifier example

A classification setting: Blood cell counts.

Build a machine which takes a small blood sample and examines and classifies individual white blood cells.

Blood cell classification

Blood cell classification

The classification is to be based on two measured inputs, shown on the x- and y-axes.

Training data has been developed where the cell was classified “by hand.” In medicine, this is sometimes called the gold standard. The gold standard is sometimes not very accurate. Here, each cell is one dot. The color is the type of the cell: granulocytes, lymphocytes, monocytes, …

Programming Activity 2

Some basics with data

Day 3 theory: accuracy, precision, and bias

Figure 2.10

In constructing a theory, it’s good to have a system you can play with where you know exactly what is going on: e.g. a simulation.

The dark blue line in the left panel is a function the authors created for use in a simulation:

ISL Figure 2.9

ISL Figure 2.9

The dots are data the textbook authors generated from evaluating the function at a few dozen values of \(x\) and adding noise to each result.

The difference between the dots’ vertical position and the function value is the residual, which they are calling the error. The mean square error MSE is

\[\mbox{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2\]

Looking again at the left panel in Figure 2.9, you can see three different functions that they have fitted to the data. It’s not important right now, but you might as well know what these model architectures are:

  1. Linear regression line (orange)
  2. Smoothing splines (green and light blue). A smoothing spline is a functional form with a parameter: the smoothness. The green function is less smooth than the light blue function.
  3. That smoothness measure can also be applied to the linear regression form

Each of these three functions were fitted to the data. Another word for fitted is trained. As such, we use the term training error for the difference between the data points and the fitted functions. Also, because the functions are not the Platonic \(f(x)\), they are written \(\hat{f}(x)\).

For each of the functions, the training MSE is

\[\mbox{Training MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\]

Right panel of the graph is something completely different: both the axes are different than in the left panel.

Another example: A smoother simulated \(f(x)\).

ISL Figure 2.10

ISL Figure 2.10

What’s the “best” of these models?

When examining training MSE, the more flexible model has the smaller MSE. This answer is pre-ordained, regardless of the actual shape of the Platonic \(f(x)\).

In traditional regression, we use ANOVA or adjusted$ \(R^2\) to help avoid this inevitability that more complicated models will be closer to the training data. Both of those traditional methods inflate* the estimate of the MSE by taking into account the “degrees of freedom,” df, in the model and how that compares to the number of cases \(n\) in the training dataset. The inflation looks like

\[ \frac{n}{n - \mbox{df}} \]

So when \(\mbox{df} \rightarrow n\), we inflate the MSE quite a lot.

Another approach to this is to use testing MSE rather than training MSE. So pick the model with flexibility at the bottom of the U-shaped testing MSE curve.

Why is testing MSE U-shaped?

In traditional regression, we get at the variance by using confidence intervals on parameters. The broader the confidence interval, the higher the variation from random sample to random sample. These confidence intervals come from normal theory or from bootstrapping. Bootstrapping is a simulation of the variation in model fit due to training data.

Bias decreases with higher flexibility.

Variance tends to increase with higher flexibility.

Irreducible error is constant.

ISL Figure 2.12

ISL Figure 2.12

Measuring the variance of independent sources of variation

Simulation: Make and edit a file Day-03.Rmd.

Explore

Add three different sources of variation. The width of the individual sources is measured by the standard deviation sd=.

n <- 1000
sd( rnorm(1000, sd=3) + rnorm(1000, sd=1) + rnorm(1000, sd=2) )
## [1] 3.652297
  • Divide into small groups and
    • construct a theory about how the variation in the individual components relates to the variation in the whole.
    • test whether your theory works for other random distributions, e.g. rexp()

Result (Don’t read until you’ve drawn your own conclusions!)

 

 

 

 

The variance of the sum of independent random variables is the sum of the variances of the individual random variables.

Equation 2.7

\[E( y - \hat{f}(x) )^2 = \mbox{Var}(\hat{f}(x)) + [\mbox{Bias}(\hat{f}(x))]^2 + \mbox{Var}(\epsilon)\]

Breaks down the total “error” into three independent sources of variation:

  1. How \(y_i\) differs from \(f(x_i)\). This is the irreducible noise: \(\epsilon\)
  2. How \(\hat{f}(x_i)\) (if fitted to the testing data) differs from \(f(x_i)\). This is the bias.
  3. How the particular \(\hat{f}(x_i)\) fitted to the training data differs from the \(\hat{f}(x_i)\) that would be the best fit to the testing data.

\[\underbrace{E( y - \hat{f}(x) )^2}_{\mbox{Total error}} = \underbrace{\mbox{Var}(\hat{f}(x))}_{\mbox{source 3.}} + \underbrace{[\mbox{Bias}(\hat{f}(x))]^2}_{\mbox{source 2.}} + \underbrace{\mbox{Var}(\epsilon)}_{\mbox{source 1.}}\]

Programming Activity 3

Indexing on data: training and testing data sets

Review of Day 3

Start Thursday 15 Sept.

Programming Basics I

Indexing on data: training and testing data sets