Topic 3: Foundations: linear algebra, likelihood and Bayes’ rule

Math 253: Statistical Computing & Machine Learning

Daniel Kaplan

The topics in this section — linear algebra, Bayes’ rule, and likelihood — underlie many of the machine-learning techniques we will be studying later in the course. Bayes’ rule is a way to flip conditional probabilities. Among other things it allows you to interpret data in the light of previous knowledge and belief (which to be fancy, we can call “theory”). Likelihood is a unifying principle for using data to estimate model parameters and is fundamental in statistical theory. It’s also an essential part of Bayes’ rule. And linear algebra is used throughout statistics and machine learning. Among other things, it’s at work behind the motivation and calculations of regression.

Linear Algebra

The idea here is not to teach you linear algebra, but to expose you to some of the terminology and operations of linear algebra, so that when you see it again later in the course you’ll have a good start.

For linear algebra folks: Projection is the Q part of QR decomposition. R is the solve part. - In economics, they write things in an older style: Solve \(M b = y\) for \(b\). But \(M\) may not be square, so no regular inverse. - Pre-multiply by \(M^T\) to get \(M^T M b = M^T y\) - Invert to get \(b = (M^T M)^{-1} M^T y\). The matrix \((M^T M)^{-1} M^T\) is called the pseudo inverse.

Arithmetic of linear algebra operations

  1. Addition comes for free. Confirm this.
  2. Scalar multiplication comes for free. Confirm this.
  3. Write a function for the dot product.
vdot <- function(v, w) {
  sum(v * w)
}   
  1. Write a function for the vector length.
vlength <- function(v) {
  sqrt(vdot(v, v))
}
  1. Write a function for the cosine of the angle between two vectors.
vangle <- function(v, w, degrees = FALSE) {
   theta <- acos(vdot(v, w) / (vlength(v) * vlength(w)))
  
   if (degrees) theta * (180 / pi )
   else theta
}
  1. Write a function to project vector \(\vec{a}\) onto \(\vec{b}\). Subtracting the result from \(\vec{a}\) will give the component of \(\vec{a}\) orthogonal to \(\vec{b}\). So we can decompose \(\vec{a}\) into two components relative to \(\vec{b}\). Show that the supposedly ortogonal component is really orthogonal to \(b\) — that is, the dot product is 0.
vproject <- function(v, onto) { # the red thing
  onto * vlength(v) * cos(vangle(v, onto)) / vlength(onto)
}
vresid <- function(v, onto) {
  v - vproject(v, onto)
}
  1. Generalization: Write a function to orthogonalize a matrix M.
  2. Generalization: Write a function to calculate the projection of V onto M.
vdot <- function(a, b) {
  sum(a * b)
}
vlen <- function(a) sqrt(vdot(a, a))
vcos <- function(a, b) { 
  vdot(a, b) / (vlen(a) * vlen(b))
}
vangle <- function(a, b, degrees = FALSE) {
  res <- vcos(a, b)
  if (degrees) res <- res * 180 / pi
  
  res
}
vproj <- function(a, onto = b) {
  vlen(a) * vcos(a, onto) * onto
}

The geometry of fitting

Precision of the coefficients

\[ \mbox{standard error of B coef.} = | \mbox{residuals} | \frac{1}{| \mbox{B} |}\ \frac{1}{\sin( \theta )}\ \frac{1}{\sqrt{n}}\ \sqrt{\frac{n}{n-m}}\]

Likelihood and Bayes

We accept that our models won’t produce an \(\hat{f}(x)\) that always matches \(y\). There is the irreducible error \(\epsilon\), in addition to variance and bias.

We’re using mean square error or sum of square errors as a measure of how far \(\hat{f}(x_i)\) is from the actual result \(y_i\).

Now we’re going to look at the difference in terms of probabilities: what would be the probability of any particular \(\hat{y}_i\) given our \(\hat{f}(x_i)\).

Let’s quantify probability.

Summary of Day 8

We finished up our brief introduction to linear algebra and started discussing probability. I suggested the rather broad definition of a probability as a number between zero and one.

Day 9 Announcements

Make sure you’ve accepted the invitation to the discussion group.

Reading for Thursday: “What is Bayesian statistics and why everything else is wrong”

What’s a probability?

$ p( | , , )$

Conditional probability

The probability of an event in a given state of the world. That state of the world might have been set up by another event having occurred.

Inverting conditional probabilities

What we want is \(p(\mbox{state of world} | \mbox{observations})\). I’ll write this \(p(\theta | {\cal O})\)

Tree with cancer (benign or malignant) and cell shape (round, elongated, ruffled)

SPACE FOR THE TREE

SEE PAPER NOTES. (remember to transcribe them here)

, e.g. observe ruffled, what is the chance that the tumor is malignant.

Of the 10000 people in the study,
* 7000 had benign tumors of which 10% or 700 had ruffled cells * 3000 had malignant tumors of whom 60% or 1800 had ruffled cells

So, of the 2500 people with ruffled cells, 1800 had malignant tumors. \(p( \theta | {\cal O} )\)

Summary of Day 9

Announcement: Read “What is Bayesian Statistics and why everything else is wrong.”

We derived Bayes’ rule (simple!) from fundamental principles of probability. There are three components of the formula that have individual names:

Controversy! Many people believe that performing calculations using a prior is unscientific. This is because the prior reflects the views of the researcher, rather than a solid fact of reality. Nonetheless, everyone agrees that the likelihood is meaningful. Many of the estimation problems of statistics amount to finding parameters that maximize the likelihood. (People who think that using a prior is a reasonable way of doing business, like the author, point out that the model itself is a subjective choice.)

The in-class programming task involved calculating a likelihood with a model that involved the exponential probability distribution. In that task, we were not doing the full Bayesian calculation. Instead, we took as our point estimate of the parameter the argmax, that is, the parameter value that produces the highest likelihood.

Likelihood example

Consider a mechanism like that behind Figure 3.1 in ISLR.

Figure 3.1 from ISRL

Figure 3.1 from ISRL

There seems to be a straight-line relationship between Sales and TV. You can also see that the variance of the residuals is bigger for larger values of TV. The usual least-squares estimator is based on maximizing the likelihood of a model like \(\mbox{Sales} = a + b \mbox{TV} + \epsilon\), where \(\epsilon\) is “iid normal”. But we know the \(\epsilon\) estimated from the data won’t be iid. Our model is therefore wrong. That’s not necessarily a problem, since all models are wrong (they are models after all!) but some models are useful. So the iid model might be useful.

Let’s make some similar data on which we can demonstrate a likelihood calculation.

TV <- runif(500)
Sales <- 3.5 + 12 * TV + (3* TV + 0.75) * rnorm(length(TV))
plot(TV, Sales, ylim = c(0, 25))

Here’s a simple calculation with a “wrong” model:

lm(Sales ~ TV)
## 
## Call:
## lm(formula = Sales ~ TV)
## 
## Coefficients:
## (Intercept)           TV  
##        3.40        12.42

If we like, however, we can do the calculation with a “better” model, say, taking the following as the model behind our likelihood.

Sales_likelihood <- function(params) {
  a <- params[1]
  b <- params[2]
  c <- params[3]
  d <- params[4]

  # Negate so the the minimization routine will produce the maximum likelihood
  -sum(log(dnorm(Sales - (a + b * TV), mean = 0, sd = c + d * TV )))
}
Sales_likelihood(c(5, 10, 1, 1))
## [1] 1257.368
nlm(Sales_likelihood, c(5, 10, 1, 1))
## $minimum
## [1] 1057.491
## 
## $estimate
## [1]  3.5354724 12.1258700  0.6342547  3.0234677
## 
## $gradient
## [1] 2.070850e-05 8.419254e-06 3.251444e-05 3.429254e-05
## 
## $code
## [1] 1
## 
## $iterations
## [1] 25

Exponential probability density

What’s the time between random events, e.g. 500-year storms or earthquakes in a region that has a big one roughly every 100 years?

Earthquake warning in Southern California, late Sept. 2016

But over the last week, anxieties were particularly heightened, and the natural denial that is part of living in earthquake country was harder to pull off.

A swarm of seismic activity at the Salton Sea that began a week ago prompted scientists to say there was an elevated risk for a big San Andreas fault earthquake. By Monday [Oct 3, 2016], that risk had lessened.

But the impact of that warning was still being felt. For some, it meant checking quake safety lists. Others looked at preparing for the Big One, such as bolting bookshelves to walls, installing safety latches on kitchen cabinets and strapping down televisions.

Why has the risk gotten smaller? How much smaller?

Meanwhile, further north …

From *The Really Big One, an article in the New Yorker about discoveries in the last few decades that established a high risk in the Pacific Northwest for an earthquake of magnitude 9.

We now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

The last paragraph …

All day long, just out of sight, the ocean rises up and collapses, spilling foamy overlapping ovals onto the shore. Eighty miles farther out, ten thousand feet below the surface of the sea, the hand of a geological clock is somewhere in its slow sweep. All across the region, seismologists are looking at their watches, wondering how long we have, and what we will do, before geological time catches up to our own.

Have students propose distributions and justify them.

California earthquake warning, reprise

The Salton Sea earthquake happens. Our prior on large \(\lambda\) immediately surges, so there is a significant probability of a quake in the next hours. But as more time goes by, that probability goes down.

Plot out the posterior for different values of D:

The area to the right of 5 days (in expected time to the next quake) is the conventional model.

Plot this out as a function of \(1/\lambda\), so we need to adjust the density by \(| df/d\lambda | = | d \frac{1}{\lambda} / d\lambda| = \lambda^2\)

D <- .1
lambda <- 100/(1:5000) 
# prior: proportional to lambda: 
#  small lambda unlikely, so short time to next earthquake
prior <- function(lambda) (ifelse(lambda < .2, 25, 1/lambda))
plot(lambda, ( prior(lambda)), type = "l", xlim = c(0,5))

plot(1/lambda, exp( - lambda * D) * prior(lambda) * (lambda^2), 
     type = "l", xlab = "Expected time to the big one, days.",
     xlim = c(0, 20))
lines(1/lambda, lambda*.005/prior(lambda), col = "red")

For small D, the “urgency” part of the prior overwhelms the likelihood. As D gets bigger, we revert to the standard model.

The Price is Right!

The Price is Right is a game show in which contestants compete to guess the price of a prize. The winner is the person whose guess is closest to the actual price considering just those contestants who guesses a price less than or equal to the actual price.

Strategy:

  1. First person to guess: an honest guess, hedged on the low side.
  2. Second person: bias guess to be far from the first person’s guess.
  3. Third person:
  4. Fourth person: Zero, or just above one of the other guesses.

Play this game. Call down 4 contestants. What’s the price of this yacht?

Now, suppose rather than being a strategic game biased toward the last guesser, we wanted to evaluate political prognosticators. The winner should be the person who makes the best prediction rather than the best guess.

Game: Predict the number of electoral college votes for Donald Trump.

Game: Predict the results of the Ukrainian Parliament’s vote of no confidence in Prime Minister Arseniy Yatsenyuk. How many votes for no confidence were there.1 Actual result for you to compare your prediction to: one-hundred ninety-four out of three-hundred thirty-nine.

Play this game asking people to draw the probability distribution of their prediction.

Have the contestants keep their identity secret at first.

Draw a density on the board. Give them a vertical scale for density, insisting that each of their densities has area one.

  1. The Likelihood Game: Who won? How to evaluate the predictions?

  2. The Bayesian Game:
    • The contestants reveal their identity
    • What’s your posterior probability on each of them.

From likelihood to Bayes

Multiply likelihood by prior probability. Normalize so that total probability is 1.

Choosing models using maximum likelihood

Straight line model:

Gaussian errors:

\[f(x \; | \; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }\]

What happens when you take the log … why it’s sum of squares.

Question: What about minimizing the absolute value of the residuals, rather than the square? - Corresponds to a two-sided exponential distribution like \(\frac{\lambda}{2} \exp(-\lambda |x|)\)

Day 9 Review

Likelihood.

Example: Mean and standard deviation

data_vals <- runif(10000, min = 20, max = 50)
# data_vals <- rexp(10000, rate = 1) 
mean(data_vals)
## [1] 35.04812
median(data_vals)
## [1] 35.18501
sd(data_vals)
## [1] 8.70109
IQR(data_vals)
## [1] 14.89111
LL_gaussian <- function(params) {
  center = params[1]
  spread = params[2]
  sum(log(dnorm(data_vals, mean = center, sd = spread)))
}
optim(par = c(25, 10), LL_gaussian, control = list(fnscale = -1))
## $par
## [1] 35.05039  8.70036
## 
## $value
## [1] -35823.37
## 
## $counts
## function gradient 
##       53       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
LL_exp <- function(params) {
  center = params[1]
  spread = params[2]
  sum(log(dexp(abs(data_vals - center), rate = 1/spread)))
}
optim(par = c(25, 10), LL_exp, control = list(fnscale = -1))
## $par
## [1] 35.185767  7.513027
## 
## $value
## [1] -30166.14
## 
## $counts
## function gradient 
##       59       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Reading: What is Bayesian Statistics

Link to “What is Bayesian Statistics and why everything else is wrong.”

Go through sections 1, 2, and 4 in class: about the likelihood calculation, the p-value calculation, and the philosophical criticism of p-values.

Find the likelihood of observing \(k\) cases out of \(E\) employees with a rate of \(\theta\).

E <- 145
k <- 8
theta <- .01
L <- function(theta) dbinom(k, size = E, prob = theta)

Making the plot of likelihood versus \(\theta\)

x <- seq(0,.10, length = 100)
y <- L(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")

Emphasize the choice of what detail of the sampling model to use. Just this school in isolation?

Suppose we consider that there are 1000 schools near high-tension lines. Our school is presumably one of the highest rates, since other schools who had bigger numbers would come forward. Let’s imagine that our school is in the top 10%. This is like calculating that of 10 schools, the 8 cancer cases we observed are the most of any of those 10. What does this give for the likelihood of theta?

nschools <- 2
Lschools <- function(theta) {
  prob_of_k <- dbinom(k, size = E, prob = theta)
  less_than_k <- pbinom(k - 0.5, size = E, prob = theta)^(nschools - 1)
  prob_of_k * less_than_k
}
x <- seq(0,.05, length = 100)
y <- Lschools(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")

Programming Basics: Conditionals

x <- rnorm(10)
table(x < 0)
## 
## FALSE  TRUE 
##     7     3
ifelse(x >= 0, sqrt(x), "bogus")
## Warning in sqrt(x): NaNs produced
##  [1] "0.894373484707928" "1.05755578985983"  "0.561229531943972"
##  [4] "0.283877332330318" "bogus"             "bogus"            
##  [7] "0.311034048939835" "0.8440657630661"   "bogus"            
## [10] "1.25796758166101"

Both of these can be nested.

ifelse() examples

library(ISLR)
data(Default) # loan balances and defaults.
head(Default)
##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559
you_pay <- with(Default,
  ifelse(balance / 10  <  income, 0.10 * balance, 0.05 * income))
sum(you_pay)
## [1] 835374.9

Determine annual payment amount for student loans. E.g.

if … else … examples

  1. Calculate the median of a set of values.
    • if an odd number in the set, then the middle value
    • if an even number in the set, the mean of the two values in the middle.
my_median <- function(x) {
  if (length(x) %% 2 == 1) {
    # odd length
    sort(x)[ceiling(length(x)/2)]
  } else {
    # even length
    inds <- length(x)/2 + c(0, 1)
    mean(sort(x)[inds])

  }
}
my_median(1:11)
## [1] 6
median(1:11)
## [1] 6
  1. pmax() is a function that takes the case-by-case maximum of each of two input vectors. We’re going to add some error checking.
my_pmax <- function(v1, v2) ifelse(v1 > v2, v1, v2)
my_pmax3 <- function(v1, v2, v3){
  ifelse(v1 >= v2,
         my_pmax(v1, v3),
         my_pmax(v2, v3)
  )
  }
my_pmax4 <- function(v1, v2, v3, v4) {
  ifelse(v1 >= v2, 
         my_pmax3(v1, v3, v4),
         my_pmax3(v2, v3, v4))
}
my_pmax_any <- function(...) {
  vectors <- list(...)
  if (length(vectors) == 2) my_pmax(...)
  if (length(vectors) == 3) my_pmax3(...)
  if (length(vectors) == 4) my_pmax4(...)
}


my_pmax(rnorm(10), rnorm(10))
##  [1] -0.32131029  0.24776045 -0.02983905  0.61739202  0.80969198
##  [6]  1.16987760 -1.25000089 -0.28313383 -0.20915314 -0.22586907

Simple

Write functions that return, case by case,

max_in_parallel <- function(...) {
  Vecs <- list(...)
}

Blood testing

It is known that 5% of the members of a population have disease X, which can be discovered by a blood test (that is assumed to perfectly identify both diseased and nondiseased populations). Suppose that N people are to be tested, and the cost of the test is nontrivial. The testing can be done in two ways:

  1. Everyone can be tested separately; or
  2. the blood samples of k people are pooled to be analyzed.

Assume that N = nk with n being an integer. If the test is negative, all the people tested are healthy (that is, just this one test is needed). If the test result is positive, each of the k people must be tested separately (that is, a total of k + 1 tests are needed for that group).

  1. For fixed k what is the expected number of tests needed in (B)?
  2. Find the k that will minimize the expected number of tests in (B).
  3. Using the k that minimizes the number of tests, on average how many tests does (B) save in comparison with (A)? Be sure to check your answer using an empirical simulation.
ntests <- function(p = 0.05, npools = 500, pool_size = 10, nsims=1000) {
  # generate the number of infected people in each pool
  infected_in_pool <- rbinom(npools, p = p, size = pool_size)
  # if one or more in a pool is infected, pool_size+1 tests, 
  # otherwise 1 test
  tests_in_each_pool <- ifelse(infected_in_pool > 0, pool_size + 1, 1)
  # total across all pools
  sum(tests_in_each_pool)
}

Can we do this recursively to get more savings?

people <- runif(100000) < .05
ntests <- function(population) {
  if ( (! any(population)) || length(population) == 1) {
    # we're done!
    total_tests <- 1
  } else {
    # Split into two groups and test again
    split_point <- round(length(population)/2)
    group1 <- population[1:split_point]
    group2 <- population[(split_point + 1) : length(population)]
    total_tests <-
      ntests(group1) + 
      ntests(group2) + 1  
    # + 1 for the test that said we need to divide the groups
  } 
  
  total_tests
}

How many tests needed for a population of 10,000 with a prevalence of 1%?

library(mosaic)
do(10) * ntests(runif(10000) < 0.01)
##    ntests
## 1    1509
## 2    1459
## 3    1385
## 4    1379
## 5    1333
## 6    1505
## 7    1419
## 8    1239
## 9    1393
## 10   1323

Try other prevalences to find prevalence at which it’s no longer worthwhile to pool the samples.

The (hyper)-volume of the hypersphere.

A framework for the volumes: \(C_n r^n\).

TASK: Find \(C_4\), \(C_5\), … , \(C_8\).

Programming approach:

  1. Write the logic. Give explicit names to all quantities that you might want to change later.
dim <- 3  # might vary
npts <- 1000  # might vary
pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
dists <- rowSums( pts^2 )
(2^dim) * sum(dists <= 1) / length(dists)
## [1] 4.072
  1. Make the “might vary” quantities the arguments to a function that encapsulates the rest of the logic.
sphere_volume <- function(dim=3, npts=1000000) {
  pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
  dists <- rowSums( pts^2 )
  
  (2^dim) * sum(dists <= 1) / length(dists)
}
sapply(1:20, FUN = sphere_volume)
##  [1] 2.000000 3.141956 4.187528 4.928192 5.258080 5.213376 4.707328
##  [8] 4.094976 3.347968 2.600960 1.933312 1.224704 0.802816 0.507904
## [15] 0.425984 0.065536 0.262144 0.000000 0.000000 0.000000

Volume of the unit hyper-cube:

2 ^ (1:20)
##  [1]       2       4       8      16      32      64     128     256
##  [9]     512    1024    2048    4096    8192   16384   32768   65536
## [17]  131072  262144  524288 1048576

So the volume of the encompassed hyper-sphere goes to zero percent of the volume of the encompassing hyper-cube.

Theoretical formula: \(V_n(R) = \frac{\pi^{n/2}}{\Gamma(\frac{n}{2} + 1)}R^n\)

sapply(1:20, FUN=function(n) (pi^(n/2) / gamma(n/2 + 1)) )
##  [1] 2.00000000 3.14159265 4.18879020 4.93480220 5.26378901 5.16771278
##  [7] 4.72476597 4.05871213 3.29850890 2.55016404 1.88410388 1.33526277
## [13] 0.91062875 0.59926453 0.38144328 0.23533063 0.14098111 0.08214589
## [19] 0.04662160 0.02580689

Find the surface area, \(D_n r^{n-1}\).

In-class programming activity

Explanation of draw poker