Topic 3 Foundations: linear algebra, likelihood and Bayes’ rule

The topics in this section — linear algebra, Bayes’ rule, and likelihood — underlie many of the machine-learning techniques we will be studying later in the course. Bayes’ rule is a way to flip conditional probabilities. Among other things it allows you to interpret data in the light of previous knowledge and belief (which to be fancy, we can call “theory”). Likelihood is a unifying principle for using data to estimate model parameters and is fundamental in statistical theory. It’s also an essential part of Bayes’ rule. And linear algebra is used throughout statistics and machine learning. Among other things, it’s at work behind the motivation and calculations of regression.

3.1 Linear Algebra

The idea here is not to teach you linear algebra, but to expose you to some of the terminology and operations of linear algebra, so that when you see it again later in the course you’ll have a good start.

A vector — a column of numbers. The dimension is the count of numbers in the vector.
A space: the set of all possible vectors of a given dimension.
Scalar multiplication
Vector addition: walk the first vector, then the second.
Linear combination: do scalar multiplication on each vector, then add.
A matrix — a collection of vectors (all the same dimension).
Dot product: a basic calculations on vectors:
- the length (via Pythagorus)
- the angle between two vectors
- orthogonality: when two vectors are perpendicular, their dot product is zero.
Matrix operation: Linear combination.
- Take a linear combination of the vectors in a matrix. Analogous to taking a trip. Result: a vector representing the end-point of the trip.
The subspace spanned by the matrix: the set of all possible points you can get to with a linear combination.
Matrix operation: Orthogonalization — Find perpendicular vectors that span the same subspace as a matrix. Example, draw the picture for two vectors $\vec{a}$ and $\vec{b}$.
Matrix operation: Projection
- Given a matrix M and a vector V, find the closest point in the subspace of M to the vector V. How? Orthogonalize matrix M, then for each vector in orthogonalized M, subtract out the part of $V$ aligned with that vector.
Matrix operation: inversion — the inverse operation to linear combination.
- given an end-point in the space spanned by M, figure out a linear combination that will get you there.
Vector to vector operation: Outer product. col vector $\times$ row vector.
- Can generalize to operations other than $\times$.

For linear algebra folks: Projection is the Q part of QR decomposition. R is the solve part. - In economics, they write things in an older style: Solve $M b = y$ for $b$. But $M$ may not be square, so no regular inverse. - Pre-multiply by $M^T$ to get $M^T M b = M^T y$ - Invert to get $b = (M^T M)^{-1} M^T y$. The matrix $(M^T M)^{-1} M^T$ is called the pseudo inverse.

3.2 Arithmetic of linear algebra operations

Addition comes for free. Confirm this.
Scalar multiplication comes for free. Confirm this.
Write a function for the dot product.

vdot <- function(v, w) {
  sum(v * w)
}

Write a function for the vector length.

vlength <- function(v) {
  sqrt(vdot(v, v))
}

Write a function for the cosine of the angle between two vectors.

vangle <- function(v, w, degrees = FALSE) {
   theta <- acos(vdot(v, w) / (vlength(v) * vlength(w)))
  
   if (degrees) theta * (180 / pi )
   else theta
}

Write a function to project vector $\vec{a}$ onto $\vec{b}$. Subtracting the result from $\vec{a}$ will give the component of $\vec{a}$ orthogonal to $\vec{b}$. So we can decompose $\vec{a}$ into two components relative to $\vec{b}$. Show that the supposedly ortogonal component is really orthogonal to $b$ — that is, the dot product is 0.

vproject <- function(v, onto) { # the red thing
  onto * vlength(v) * cos(vangle(v, onto)) / vlength(onto)
}
vresid <- function(v, onto) {
  v - vproject(v, onto)
}

Generalization: Write a function to orthogonalize a matrix M.
Generalization: Write a function to calculate the projection of V onto M.

vdot <- function(a, b) {
  sum(a * b)
}
vlen <- function(a) sqrt(vdot(a, a))
vcos <- function(a, b) { 
  vdot(a, b) / (vlen(a) * vlen(b))
}
vangle <- function(a, b, degrees = FALSE) {
  res <- vcos(a, b)
  if (degrees) res <- res * 180 / pi
  
  res
}
vproj <- function(a, onto = b) {
  vlen(a) * vcos(a, onto) * onto
}

3.3 The geometry of fitting

Data tables: cases and variables.
Case space (the rows of the matrix) and variable space (the columns).
A quantitative variable is a vector.
A categorical variable can be encoded as a set of “dummy” vectors.
Response variable and explanatory variable
The linear projection problem: find the point spanned by the explanatory variables that’s closest to the response. That linear combination is the best-fitting model.
- One explanatory and the response
- Two explanatory on board and the response on the board (perfect, but meaningless fit)
- Two explanatory in three-space and the response (residual likely)

3.4 Precision of the coefficients

\[ \mbox{standard error of B coef.} = | \mbox{residuals} | \frac{1}{| \mbox{B} |}\ \frac{1}{\sin( \theta )}\ \frac{1}{\sqrt{n}}\ \sqrt{\frac{n}{n-m}}\]

$m$ — degrees of freedom in model
$\theta$ — angle between this model vector and the space spanned by the others
B — this model vector
residuals — the residual vector

3.5 Likelihood and Bayes

We accept that our models won’t produce an $\hat{f}(x)$ that always matches $y$. There is the irreducible error $\epsilon$, in addition to variance and bias.

Variance: a measure of how far off our $\hat{f}()$ is from that we would have been able to construct with an infinite amount of data: $\hat{f}_\infty()$.
Bias: a measure of how far off $\hat{f}_\infty()$ is from $f()$.

We’re using mean square error or sum of square errors as a measure of how far $\hat{f}(x_i)$ is from the actual result $y_i$.

Now we’re going to look at the difference in terms of probabilities: what would be the probability of any particular $\hat{y}_i$ given our $\hat{f}(x_i)$.

Let’s quantify probability.

3.6 Summary of Day 8

We finished up our brief introduction to linear algebra and started discussing probability. I suggested the rather broad definition of a probability as a number between zero and one.

3.7 Day 9 Announcements

Make sure you’ve accepted the invitation to the discussion group.

Reading for Thursday: “What is Bayesian statistics and why everything else is wrong”

3.7.1 What’s a probability?

Chances of something happening
Frequentist: Number of “favorable events” / number of events
Bayesian. Number between 0 and 1.

$ p( | , , )$

densities
cumulative — this is really what probability refers to.
discrete events
joint events
conditional events
relating joint and conditional: p(A & X) = p(A | X) p(X) = p(X | A) p(A)
Bayes rule p(A | X) = p(X | A) p(A) / p(X)

3.8 Conditional probability

The probability of an event in a given state of the world. That state of the world might have been set up by another event having occurred.

3.9 Inverting conditional probabilities

What we want is $p(\mbox{state of world} | \mbox{observations})$. I’ll write this $p(\theta | {\cal O})$

Tree with cancer (benign or malignant) and cell shape (round, elongated, ruffled)

SPACE FOR THE TREE

SEE PAPER NOTES. (remember to transcribe them here)

, e.g. observe ruffled, what is the chance that the tumor is malignant.

Of the 10000 people in the study,
* 7000 had benign tumors of which 10% or 700 had ruffled cells * 3000 had malignant tumors of whom 60% or 1800 had ruffled cells

So, of the 2500 people with ruffled cells, 1800 had malignant tumors. $p( \theta | {\cal O} )$

3.10 Summary of Day 9

Announcement: Read “What is Bayesian Statistics and why everything else is wrong.”

We derived Bayes’ rule (simple!) from fundamental principles of probability. There are three components of the formula that have individual names:

The prior. What you know before you examine the data.
The likelihood. A conditional probability. Given a theory (that is, a way that you think the world works), what is the probability of the data you have observed.
- Typically a likelihood is built around a model that has both a deterministic and a random component. For instance, the output $y_i$ depends on an input $x_i$ according to $y_i = a + b x + \epsilon$.
- We specify the properties (e.g., the variance) of the random component as part of the theory. For instance, we might specify that $\epsilon$ is drawn from a normal distribution with mean zero and variance $c^2$.
- The likelihood can be calculated by multiplying together the probabilities of each of values the data indicates. Continuing with the example in the previous points, $\epsilon_i = y_i - (a + b x_i)$. The calculation would be likelihood $= \prod_i$dnorm($\epsilon_i$, mean = 0, sd = c).
- In order to have acceptable performance in computer arithmetic, we generally calculate the log likelihood. This allows us to turn the product in the above point into a sum.
The posterior. Our updated beliefs given the data.

Controversy! Many people believe that performing calculations using a prior is unscientific. This is because the prior reflects the views of the researcher, rather than a solid fact of reality. Nonetheless, everyone agrees that the likelihood is meaningful. Many of the estimation problems of statistics amount to finding parameters that maximize the likelihood. (People who think that using a prior is a reasonable way of doing business, like the author, point out that the model itself is a subjective choice.)

The in-class programming task involved calculating a likelihood with a model that involved the exponential probability distribution. In that task, we were not doing the full Bayesian calculation. Instead, we took as our point estimate of the parameter the argmax, that is, the parameter value that produces the highest likelihood.

3.11 Likelihood example

Consider a mechanism like that behind Figure 3.1 in ISLR.

Figure 3.1 from ISRL

There seems to be a straight-line relationship between Sales and TV. You can also see that the variance of the residuals is bigger for larger values of TV. The usual least-squares estimator is based on maximizing the likelihood of a model like $\mbox{Sales} = a + b \mbox{TV} + \epsilon$, where $\epsilon$ is “iid normal”. But we know the $\epsilon$ estimated from the data won’t be iid. Our model is therefore wrong. That’s not necessarily a problem, since all models are wrong (they are models after all!) but some models are useful. So the iid model might be useful.

Let’s make some similar data on which we can demonstrate a likelihood calculation.

TV <- runif(500)
Sales <- 3.5 + 12 * TV + (3* TV + 0.75) * rnorm(length(TV))
plot(TV, Sales, ylim = c(0, 25))

Here’s a simple calculation with a “wrong” model:

lm(Sales ~ TV)

## 
## Call:
## lm(formula = Sales ~ TV)
## 
## Coefficients:
## (Intercept)           TV  
##       3.391       11.910

If we like, however, we can do the calculation with a “better” model, say, taking the following as the model behind our likelihood.

Sales_likelihood <- function(params) {
  a <- params[1]
  b <- params[2]
  c <- params[3]
  d <- params[4]

  # Negate so the the minimization routine will produce the maximum likelihood
  -sum(log(dnorm(Sales - (a + b * TV), mean = 0, sd = c + d * TV )))
}

Sales_likelihood(c(5, 10, 1, 1))

## [1] 1263.285

nlm(Sales_likelihood, c(5, 10, 1, 1))

## $minimum
## [1] 1040.835
## 
## $estimate
## [1]  3.4181833 11.8437802  0.6808023  2.8594718
## 
## $gradient
## [1]  1.303770e-05  1.746993e-06 -1.318767e-05  8.905789e-06
## 
## $code
## [1] 1
## 
## $iterations
## [1] 23

3.12 Exponential probability density

What’s the time between random events, e.g. 500-year storms or earthquakes in a region that has a big one roughly every 100 years?

Earthquake warning in Southern California, late Sept. 2016

But over the last week, anxieties were particularly heightened, and the natural denial that is part of living in earthquake country was harder to pull off.

A swarm of seismic activity at the Salton Sea that began a week ago prompted scientists to say there was an elevated risk for a big San Andreas fault earthquake. By Monday [Oct 3, 2016], that risk had lessened.

But the impact of that warning was still being felt. For some, it meant checking quake safety lists. Others looked at preparing for the Big One, such as bolting bookshelves to walls, installing safety latches on kitchen cabinets and strapping down televisions.

Why has the risk gotten smaller? How much smaller?

3.12.1 Meanwhile, further north …

From *The Really Big One, an article in the New Yorker about discoveries in the last few decades that established a high risk in the Pacific Northwest for an earthquake of magnitude 9.

We now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.

It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.

The last paragraph …

All day long, just out of sight, the ocean rises up and collapses, spilling foamy overlapping ovals onto the shore. Eighty miles farther out, ten thousand feet below the surface of the sea, the hand of a geological clock is somewhere in its slow sweep. All across the region, seismologists are looking at their watches, wondering how long we have, and what we will do, before geological time catches up to our own.

Have students propose distributions and justify them.

3.13 California earthquake warning, reprise

The Salton Sea earthquake happens. Our prior on large $\lambda$ immediately surges, so there is a significant probability of a quake in the next hours. But as more time goes by, that probability goes down.

We’re interested in $\lambda$ in the exponential distribution $\lambda \exp(-\lambda t)$ . This has a cumulative $1 - \exp(-\lambda t)$.
Observation: Earthquake hasn’t occurred after D days. Likelihood is 1 minus the cumulative, or $\exp(-\lambda D)$.
Prior: a mix of the conventional (very small lambda) and some small probability of very high lambda.

Plot out the posterior for different values of D:

D = 0.1 two hours after the quake.
D = 1 a day after the quake
D = 3 three days after the quake

The area to the right of 5 days (in expected time to the next quake) is the conventional model.

Plot this out as a function of $1/\lambda$, so we need to adjust the density by $| df/d\lambda | = | d \frac{1}{\lambda} / d\lambda| = \lambda^2$

D <- .1
lambda <- 100/(1:5000) 
# prior: proportional to lambda: 
#  small lambda unlikely, so short time to next earthquake
prior <- function(lambda) (ifelse(lambda < .2, 25, 1/lambda))
plot(lambda, ( prior(lambda)), type = "l", xlim = c(0,5))

plot(1/lambda, exp( - lambda * D) * prior(lambda) * (lambda^2), 
     type = "l", xlab = "Expected time to the big one, days.",
     xlim = c(0, 20))
lines(1/lambda, lambda*.005/prior(lambda), col = "red")

For small D, the “urgency” part of the prior overwhelms the likelihood. As D gets bigger, we revert to the standard model.

3.14 The Price is Right!

The Price is Right is a game show in which contestants compete to guess the price of a prize. The winner is the person whose guess is closest to the actual price considering just those contestants who guesses a price less than or equal to the actual price.

Strategy:

First person to guess: an honest guess, hedged on the low side.
Second person: bias guess to be far from the first person’s guess.
Third person:
Fourth person: Zero, or just above one of the other guesses.

Play this game. Call down 4 contestants. What’s the price of this yacht?

Now, suppose rather than being a strategic game biased toward the last guesser, we wanted to evaluate political prognosticators. The winner should be the person who makes the best prediction rather than the best guess.

Game: Predict the number of electoral college votes for Donald Trump.

Game: Predict the results of the Ukrainian Parliament’s vote of no confidence in Prime Minister Arseniy Yatsenyuk. How many votes for no confidence were there.¹

Play this game asking people to draw the probability distribution of their prediction.

Suppose you know something about the contestants.
- David Moore from International Studies
- Gary Krueger from Economics
- Sybill Trelawney from Divination Science
- Jesse Ventura from Political Science
You’ve been asked to assign a probability to each contestant. You’ll use this probability to weight each of their future predictions.

Have the contestants keep their identity secret at first.

Draw a density on the board. Give them a vertical scale for density, insisting that each of their densities has area one.

The Likelihood Game: Who won? How to evaluate the predictions?
The Bayesian Game:
- The contestants reveal their identity
- What’s your posterior probability on each of them.

3.15 From likelihood to Bayes

Multiply likelihood by prior probability. Normalize so that total probability is 1.

3.16 Choosing models using maximum likelihood

We model the error as random, with a probability distribution we choose. Often this distribution has parameters.
To find the error, we need to make an assumption of what the parameters of the deterministic model are.
- Make that assumption.
- Make a similar assumption for the parameters of the probability distribution.
- Find the errors.
- Calculate the probability of those errors given the probability distribution that we choose. That’s the likelihood for the assumed parameters.
Repeat in order to modify the assumptions to increase the likelihood.

Straight line model:

Gaussian errors:

\[f(x \; | \; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }\]

What happens when you take the log … why it’s sum of squares.

Question: What about minimizing the absolute value of the residuals, rather than the square? - Corresponds to a two-sided exponential distribution like $\frac{\lambda}{2} \exp(-\lambda |x|)$

3.17 Day 9 Review

Likelihood.

Choose the “best” of a set of competing models. Often the set is parameterized by quantities such as slope $m$, intercept $b$, rate , mean, standard deviation, …
“Figure of merit” for each model is the probability of the data given the model.
Often, models have a deterministic part (e.g. $m x + b$) and a random part $\epsilon$.
Part of the model is our choice for the distribution of $\epsilon$.
Given that distribution, and treating each error $\epsilon_i$ as random, to calculate the likelihood we find the probability of each $\epsilon_i$ and multiply them together.
For practical reasons (both algebraic and computational) we work with the log-likelihood.

Example: Mean and standard deviation

data_vals <- runif(10000, min = 20, max = 50)
# data_vals <- rexp(10000, rate = 1) 
mean(data_vals)

## [1] 35.07677

median(data_vals)

## [1] 35.01521

sd(data_vals)

## [1] 8.670784

IQR(data_vals)

## [1] 15.02683

LL_gaussian <- function(params) {
  center = params[1]
  spread = params[2]
  sum(log(dnorm(data_vals, mean = center, sd = spread)))
}
optim(par = c(25, 10), LL_gaussian, control = list(fnscale = -1))

## $par
## [1] 35.077971  8.671052
## 
## $value
## [1] -35788.48
## 
## $counts
## function gradient 
##       53       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

LL_exp <- function(params) {
  center = params[1]
  spread = params[2]
  sum(log(dexp(abs(data_vals - center), rate = 1/spread)))
}
optim(par = c(25, 10), LL_exp, control = list(fnscale = -1))

## $par
## [1] 35.015632  7.515695
## 
## $value
## [1] -30171.14
## 
## $counts
## function gradient 
##       55       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

3.18 Reading: What is Bayesian Statistics

Link to “What is Bayesian Statistics and why everything else is wrong.”

Go through sections 1, 2, and 4 in class: about the likelihood calculation, the p-value calculation, and the philosophical criticism of p-values.

Find the likelihood of observing $k$ cases out of $E$ employees with a rate of $\theta$.

E <- 145
k <- 8
theta <- .01
L <- function(theta) dbinom(k, size = E, prob = theta)

Making the plot of likelihood versus $\theta$

x <- seq(0,.10, length = 100)
y <- L(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")

Emphasize the choice of what detail of the sampling model to use. Just this school in isolation?

Suppose we consider that there are 1000 schools near high-tension lines. Our school is presumably one of the highest rates, since other schools who had bigger numbers would come forward. Let’s imagine that our school is in the top 10%. This is like calculating that of 10 schools, the 8 cancer cases we observed are the most of any of those 10. What does this give for the likelihood of theta?

nschools <- 2
Lschools <- function(theta) {
  prob_of_k <- dbinom(k, size = E, prob = theta)
  less_than_k <- pbinom(k - 0.5, size = E, prob = theta)^(nschools - 1)
  prob_of_k * less_than_k
}

x <- seq(0,.05, length = 100)
y <- Lschools(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")

3.19 Programming Basics: Conditionals

a function: ifelse(condition, yes_value, no_value) Carries out the test for each element in the vector.
a special form: if (condition) {statements} else {statements}

x <- rnorm(10)
table(x < 0)

## 
## FALSE  TRUE 
##     5     5

ifelse(x >= 0, sqrt(x), "bogus")

## Warning in sqrt(x): NaNs produced

##  [1] "0.415098097926984" "bogus"             "0.78650648752133" 
##  [4] "bogus"             "0.539146279368726" "1.72054962703838" 
##  [7] "0.889947136565264" "bogus"             "bogus"            
## [10] "bogus"

Both of these can be nested.

3.20 `ifelse()` examples

library(ISLR)

## 
## Attaching package: 'ISLR'

## The following object is masked _by_ '.GlobalEnv':
## 
##     College

data(Default) # loan balances and defaults.
head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

you_pay <- with(Default,
  ifelse(balance / 10  <  income, 0.10 * balance, 0.05 * income))
sum(you_pay)

## [1] 835374.9

Determine annual payment amount for student loans. E.g.

If it’s a student, no payment, otherwise $100
If it’s a student, no payment. Otherwise 10% of the balance.
If the balance is less than 10 times the income, 10% of the balance, otherwise 5% of income.
For those in default, nothing. For the others, any of the above three schemes.

3.21 if … else … examples

Calculate the median of a set of values.
- if an odd number in the set, then the middle value
- if an even number in the set, the mean of the two values in the middle.

my_median <- function(x) {
  if (length(x) %% 2 == 1) {
    # odd length
    sort(x)[ceiling(length(x)/2)]
  } else {
    # even length
    inds <- length(x)/2 + c(0, 1)
    mean(sort(x)[inds])

  }
}
my_median(1:11)

## [1] 6

median(1:11)

## [1] 6

pmax() is a function that takes the case-by-case maximum of each of two input vectors. We’re going to add some error checking.

my_pmax <- function(v1, v2) ifelse(v1 > v2, v1, v2)
my_pmax3 <- function(v1, v2, v3){
  ifelse(v1 >= v2,
         my_pmax(v1, v3),
         my_pmax(v2, v3)
  )
  }
my_pmax4 <- function(v1, v2, v3, v4) {
  ifelse(v1 >= v2, 
         my_pmax3(v1, v3, v4),
         my_pmax3(v2, v3, v4))
}
my_pmax_any <- function(...) {
  vectors <- list(...)
  if (length(vectors) == 2) my_pmax(...)
  if (length(vectors) == 3) my_pmax3(...)
  if (length(vectors) == 4) my_pmax4(...)
}


my_pmax(rnorm(10), rnorm(10))

##  [1]  1.05113667  0.42541644  0.98409896  0.98163834 -0.37525939
##  [6]  0.60408141  0.06943719  1.50049940  1.49312466  1.17298846

Unless the vectors are either numeric or character, and of the same type, throw an error.
Add an argument handle_na which, if TRUE replaces NA with -Inf for the purposes of the comparison.
Add an argument na_rid=
- if “either”, throw away the cases where either of the values is NA
- if “both”, throw away cases where both are NA and otherwise handle NA as -Inf
- if “neither”, keep all cases.

3.22 Simple

Write functions that return, case by case,

the maximum of two vectors.
three vectors
four vectors
five vectors
Write a supervisory function that does it for 1 to 5 vectors. Use ...

max_in_parallel <- function(...) {
  Vecs <- list(...)
}

Write a supervisory function that will handle more than 5 vectors.

3.23 Blood testing

It is known that 5% of the members of a population have disease X, which can be discovered by a blood test (that is assumed to perfectly identify both diseased and nondiseased populations). Suppose that N people are to be tested, and the cost of the test is nontrivial. The testing can be done in two ways:

Everyone can be tested separately; or
the blood samples of k people are pooled to be analyzed.

Assume that N = nk with n being an integer. If the test is negative, all the people tested are healthy (that is, just this one test is needed). If the test result is positive, each of the k people must be tested separately (that is, a total of k + 1 tests are needed for that group).

For fixed k what is the expected number of tests needed in (B)?
Find the k that will minimize the expected number of tests in (B).
Using the k that minimizes the number of tests, on average how many tests does (B) save in comparison with (A)? Be sure to check your answer using an empirical simulation.

ntests <- function(p = 0.05, npools = 500, pool_size = 10, nsims=1000) {
  # generate the number of infected people in each pool
  infected_in_pool <- rbinom(npools, p = p, size = pool_size)
  # if one or more in a pool is infected, pool_size+1 tests, 
  # otherwise 1 test
  tests_in_each_pool <- ifelse(infected_in_pool > 0, pool_size + 1, 1)
  # total across all pools
  sum(tests_in_each_pool)
}

Can we do this recursively to get more savings?

people <- runif(100000) < .05
ntests <- function(population) {
  if ( (! any(population)) || length(population) == 1) {
    # we're done!
    total_tests <- 1
  } else {
    # Split into two groups and test again
    split_point <- round(length(population)/2)
    group1 <- population[1:split_point]
    group2 <- population[(split_point + 1) : length(population)]
    total_tests <-
      ntests(group1) + 
      ntests(group2) + 1  
    # + 1 for the test that said we need to divide the groups
  } 
  
  total_tests
}

How many tests needed for a population of 10,000 with a prevalence of 1%?

library(mosaic)
do(10) * ntests(runif(10000) < 0.01)

##    ntests
## 1    1373
## 2    1475
## 3    1497
## 4    1273
## 5    1217
## 6    1247
## 7    1499
## 8    1321
## 9    1253
## 10   1411

Try other prevalences to find prevalence at which it’s no longer worthwhile to pool the samples.

3.24 The (hyper)-volume of the hypersphere.

A framework for the volumes: $C_n r^n$.

$n=1$ — the line segment of length $2r$. Volume is $2r^1$ so $C_n = 2$.
$n=2$ — the circle of radius $r$: volume is $\pi r^2$, so $C_2 = \pi$
$n=3$ — the sphere of radius $r$: volume is $\frac{4}{3}\pi r^3$, so $C_3 = \frac{4}{3} \pi$

TASK: Find $C_4$, $C_5$, … , $C_8$.

Programming approach:

Write the logic. Give explicit names to all quantities that you might want to change later.

dim <- 3  # might vary
npts <- 1000  # might vary
pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
dists <- rowSums( pts^2 )
(2^dim) * sum(dists <= 1) / length(dists)

## [1] 4.312

Make the “might vary” quantities the arguments to a function that encapsulates the rest of the logic.

sphere_volume <- function(dim=3, npts=1000000) {
  pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
  dists <- rowSums( pts^2 )
  
  (2^dim) * sum(dists <= 1) / length(dists)
}

sapply(1:20, FUN = sphere_volume)

##  [1] 2.000000 3.143072 4.192328 4.929392 5.285536 5.171712 4.689920
##  [8] 4.050944 3.212800 2.506752 1.961984 1.253376 0.892928 0.540672
## [15] 0.360448 0.327680 0.393216 0.000000 0.000000 0.000000

Volume of the unit hyper-cube:

2 ^ (1:20)

##  [1]       2       4       8      16      32      64     128     256
##  [9]     512    1024    2048    4096    8192   16384   32768   65536
## [17]  131072  262144  524288 1048576

So the volume of the encompassed hyper-sphere goes to zero percent of the volume of the encompassing hyper-cube.

Theoretical formula: $V_n(R) = \frac{\pi^{n/2}}{\Gamma(\frac{n}{2} + 1)}R^n$

sapply(1:20, FUN=function(n) (pi^(n/2) / gamma(n/2 + 1)) )

##  [1] 2.00000000 3.14159265 4.18879020 4.93480220 5.26378901 5.16771278
##  [7] 4.72476597 4.05871213 3.29850890 2.55016404 1.88410388 1.33526277
## [13] 0.91062875 0.59926453 0.38144328 0.23533063 0.14098111 0.08214589
## [19] 0.04662160 0.02580689

3.25 Find the surface area, $D_n r^{n-1}$.

$D_1 = 0$
$D_2 = 2 \pi$
$D_3 = 4 \pi$
Find $D_4$, $D_5$, …, $D_7$
Two ways:
1. Take the derivative wrt $r$ of the volume.
2. Find the fraction of points in a narrow shell between distance 0.99 and 1.01 from the origin.

3.26 In-class programming activity

Explanation of draw poker

cards: ranks and suits
hands: royal flush, straight-flush,

Actual result for you to compare your prediction to: one-hundred ninety-four out of three-hundred thirty-nine.↩

Notes for Statistical Computing & Machine Learning

Topic 3 Foundations: linear algebra, likelihood and Bayes’ rule

3.1 Linear Algebra

3.2 Arithmetic of linear algebra operations

3.3 The geometry of fitting

3.4 Precision of the coefficients

3.5 Likelihood and Bayes

3.6 Summary of Day 8

3.7 Day 9 Announcements

3.7.1 What’s a probability?

3.8 Conditional probability

3.9 Inverting conditional probabilities

3.10 Summary of Day 9

3.11 Likelihood example

3.12 Exponential probability density

3.12.1 Meanwhile, further north …

3.13 California earthquake warning, reprise

3.14 The Price is Right!

3.15 From likelihood to Bayes

3.16 Choosing models using maximum likelihood

3.17 Day 9 Review

3.18 Reading: What is Bayesian Statistics

3.19 Programming Basics: Conditionals

3.20 `ifelse()` examples

3.21 if … else … examples

3.22 Simple

3.23 Blood testing

3.24 The (hyper)-volume of the hypersphere.

3.25 Find the surface area, \(D_n r^{n-1}\).

3.26 In-class programming activity

Topic 3 Foundations: linear algebra, likelihood and Bayes’ rule

3.1 Linear Algebra

3.2 Arithmetic of linear algebra operations

3.3 The geometry of fitting

3.4 Precision of the coefficients

3.5 Likelihood and Bayes

3.6 Summary of Day 8

3.7 Day 9 Announcements

3.7.1 What’s a probability?

3.8 Conditional probability

3.9 Inverting conditional probabilities

3.10 Summary of Day 9

3.11 Likelihood example

3.12 Exponential probability density

3.12.1 Meanwhile, further north …

3.13 California earthquake warning, reprise

3.14 The Price is Right!

3.15 From likelihood to Bayes

3.16 Choosing models using maximum likelihood

3.17 Day 9 Review

3.18 Reading: What is Bayesian Statistics

3.19 Programming Basics: Conditionals

3.20 ifelse() examples

3.21 if … else … examples

3.22 Simple

3.23 Blood testing

3.24 The (hyper)-volume of the hypersphere.

3.25 Find the surface area, \(D_n r^{n-1}\).

3.26 In-class programming activity

3.20 `ifelse()` examples