The topics in this section — linear algebra, Bayes’ rule, and likelihood — underlie many of the machine-learning techniques we will be studying later in the course. Bayes’ rule is a way to flip conditional probabilities. Among other things it allows you to interpret data in the light of previous knowledge and belief (which to be fancy, we can call “theory”). Likelihood is a unifying principle for using data to estimate model parameters and is fundamental in statistical theory. It’s also an essential part of Bayes’ rule. And linear algebra is used throughout statistics and machine learning. Among other things, it’s at work behind the motivation and calculations of regression.
The idea here is not to teach you linear algebra, but to expose you to some of the terminology and operations of linear algebra, so that when you see it again later in the course you’ll have a good start.
For linear algebra folks: Projection is the Q part of QR decomposition. R is the solve part. - In economics, they write things in an older style: Solve \(M b = y\) for \(b\). But \(M\) may not be square, so no regular inverse. - Pre-multiply by \(M^T\) to get \(M^T M b = M^T y\) - Invert to get \(b = (M^T M)^{-1} M^T y\). The matrix \((M^T M)^{-1} M^T\) is called the pseudo inverse.
vdot <- function(v, w) {
sum(v * w)
}
vlength <- function(v) {
sqrt(vdot(v, v))
}
vangle <- function(v, w, degrees = FALSE) {
theta <- acos(vdot(v, w) / (vlength(v) * vlength(w)))
if (degrees) theta * (180 / pi )
else theta
}
vproject <- function(v, onto) { # the red thing
onto * vlength(v) * cos(vangle(v, onto)) / vlength(onto)
}
vresid <- function(v, onto) {
v - vproject(v, onto)
}
vdot <- function(a, b) {
sum(a * b)
}
vlen <- function(a) sqrt(vdot(a, a))
vcos <- function(a, b) {
vdot(a, b) / (vlen(a) * vlen(b))
}
vangle <- function(a, b, degrees = FALSE) {
res <- vcos(a, b)
if (degrees) res <- res * 180 / pi
res
}
vproj <- function(a, onto = b) {
vlen(a) * vcos(a, onto) * onto
}
\[ \mbox{standard error of B coef.} = | \mbox{residuals} | \frac{1}{| \mbox{B} |}\ \frac{1}{\sin( \theta )}\ \frac{1}{\sqrt{n}}\ \sqrt{\frac{n}{n-m}}\]
We accept that our models won’t produce an \(\hat{f}(x)\) that always matches \(y\). There is the irreducible error \(\epsilon\), in addition to variance and bias.
We’re using mean square error or sum of square errors as a measure of how far \(\hat{f}(x_i)\) is from the actual result \(y_i\).
Now we’re going to look at the difference in terms of probabilities: what would be the probability of any particular \(\hat{y}_i\) given our \(\hat{f}(x_i)\).
Let’s quantify probability.
We finished up our brief introduction to linear algebra and started discussing probability. I suggested the rather broad definition of a probability as a number between zero and one.
Make sure you’ve accepted the invitation to the discussion group.
Reading for Thursday: “What is Bayesian statistics and why everything else is wrong”
$ p( | , , )$
The probability of an event in a given state of the world. That state of the world might have been set up by another event having occurred.
What we want is \(p(\mbox{state of world} | \mbox{observations})\). I’ll write this \(p(\theta | {\cal O})\)
Tree with cancer (benign or malignant) and cell shape (round, elongated, ruffled)
SPACE FOR THE TREE
SEE PAPER NOTES. (remember to transcribe them here)
, e.g. observe ruffled, what is the chance that the tumor is malignant.
Of the 10000 people in the study,
* 7000 had benign tumors of which 10% or 700 had ruffled cells * 3000 had malignant tumors of whom 60% or 1800 had ruffled cells
So, of the 2500 people with ruffled cells, 1800 had malignant tumors. \(p( \theta | {\cal O} )\)
Announcement: Read “What is Bayesian Statistics and why everything else is wrong.”
We derived Bayes’ rule (simple!) from fundamental principles of probability. There are three components of the formula that have individual names:
dnorm(\(\epsilon_i\), mean = 0, sd = c).Controversy! Many people believe that performing calculations using a prior is unscientific. This is because the prior reflects the views of the researcher, rather than a solid fact of reality. Nonetheless, everyone agrees that the likelihood is meaningful. Many of the estimation problems of statistics amount to finding parameters that maximize the likelihood. (People who think that using a prior is a reasonable way of doing business, like the author, point out that the model itself is a subjective choice.)
The in-class programming task involved calculating a likelihood with a model that involved the exponential probability distribution. In that task, we were not doing the full Bayesian calculation. Instead, we took as our point estimate of the parameter the argmax, that is, the parameter value that produces the highest likelihood.
Consider a mechanism like that behind Figure 3.1 in ISLR.
Figure 3.1 from ISRL
There seems to be a straight-line relationship between Sales and TV. You can also see that the variance of the residuals is bigger for larger values of TV. The usual least-squares estimator is based on maximizing the likelihood of a model like \(\mbox{Sales} = a + b \mbox{TV} + \epsilon\), where \(\epsilon\) is “iid normal”. But we know the \(\epsilon\) estimated from the data won’t be iid. Our model is therefore wrong. That’s not necessarily a problem, since all models are wrong (they are models after all!) but some models are useful. So the iid model might be useful.
Let’s make some similar data on which we can demonstrate a likelihood calculation.
TV <- runif(500)
Sales <- 3.5 + 12 * TV + (3* TV + 0.75) * rnorm(length(TV))
plot(TV, Sales, ylim = c(0, 25))
Here’s a simple calculation with a “wrong” model:
lm(Sales ~ TV)
##
## Call:
## lm(formula = Sales ~ TV)
##
## Coefficients:
## (Intercept) TV
## 3.40 12.42
If we like, however, we can do the calculation with a “better” model, say, taking the following as the model behind our likelihood.
Sales_likelihood <- function(params) {
a <- params[1]
b <- params[2]
c <- params[3]
d <- params[4]
# Negate so the the minimization routine will produce the maximum likelihood
-sum(log(dnorm(Sales - (a + b * TV), mean = 0, sd = c + d * TV )))
}
Sales_likelihood(c(5, 10, 1, 1))
## [1] 1257.368
nlm(Sales_likelihood, c(5, 10, 1, 1))
## $minimum
## [1] 1057.491
##
## $estimate
## [1] 3.5354724 12.1258700 0.6342547 3.0234677
##
## $gradient
## [1] 2.070850e-05 8.419254e-06 3.251444e-05 3.429254e-05
##
## $code
## [1] 1
##
## $iterations
## [1] 25
What’s the time between random events, e.g. 500-year storms or earthquakes in a region that has a big one roughly every 100 years?
Earthquake warning in Southern California, late Sept. 2016
But over the last week, anxieties were particularly heightened, and the natural denial that is part of living in earthquake country was harder to pull off.
A swarm of seismic activity at the Salton Sea that began a week ago prompted scientists to say there was an elevated risk for a big San Andreas fault earthquake. By Monday [Oct 3, 2016], that risk had lessened.
But the impact of that warning was still being felt. For some, it meant checking quake safety lists. Others looked at preparing for the Big One, such as bolting bookshelves to walls, installing safety latches on kitchen cabinets and strapping down televisions.
Why has the risk gotten smaller? How much smaller?
From *The Really Big One, an article in the New Yorker about discoveries in the last few decades that established a high risk in the Pacific Northwest for an earthquake of magnitude 9.
We now know that the Pacific Northwest has experienced forty-one subduction-zone earthquakes in the past ten thousand years. If you divide ten thousand by forty-one, you get two hundred and forty-three, which is Cascadia’s recurrence interval: the average amount of time that elapses between earthquakes. That timespan is dangerous both because it is too long—long enough for us to unwittingly build an entire civilization on top of our continent’s worst fault line—and because it is not long enough. Counting from the earthquake of 1700, we are now three hundred and fifteen years into a two-hundred-and-forty-three-year cycle.
It is possible to quibble with that number. Recurrence intervals are averages, and averages are tricky: ten is the average of nine and eleven, but also of eighteen and two. It is not possible, however, to dispute the scale of the problem.
The last paragraph …
All day long, just out of sight, the ocean rises up and collapses, spilling foamy overlapping ovals onto the shore. Eighty miles farther out, ten thousand feet below the surface of the sea, the hand of a geological clock is somewhere in its slow sweep. All across the region, seismologists are looking at their watches, wondering how long we have, and what we will do, before geological time catches up to our own.
Have students propose distributions and justify them.
The Salton Sea earthquake happens. Our prior on large \(\lambda\) immediately surges, so there is a significant probability of a quake in the next hours. But as more time goes by, that probability goes down.
Plot out the posterior for different values of D:
The area to the right of 5 days (in expected time to the next quake) is the conventional model.
Plot this out as a function of \(1/\lambda\), so we need to adjust the density by \(| df/d\lambda | = | d \frac{1}{\lambda} / d\lambda| = \lambda^2\)
D <- .1
lambda <- 100/(1:5000)
# prior: proportional to lambda:
# small lambda unlikely, so short time to next earthquake
prior <- function(lambda) (ifelse(lambda < .2, 25, 1/lambda))
plot(lambda, ( prior(lambda)), type = "l", xlim = c(0,5))
plot(1/lambda, exp( - lambda * D) * prior(lambda) * (lambda^2),
type = "l", xlab = "Expected time to the big one, days.",
xlim = c(0, 20))
lines(1/lambda, lambda*.005/prior(lambda), col = "red")
For small D, the “urgency” part of the prior overwhelms the likelihood. As D gets bigger, we revert to the standard model.
The Price is Right is a game show in which contestants compete to guess the price of a prize. The winner is the person whose guess is closest to the actual price considering just those contestants who guesses a price less than or equal to the actual price.
Strategy:
Play this game. Call down 4 contestants. What’s the price of this yacht?
Now, suppose rather than being a strategic game biased toward the last guesser, we wanted to evaluate political prognosticators. The winner should be the person who makes the best prediction rather than the best guess.
Game: Predict the number of electoral college votes for Donald Trump.
Game: Predict the results of the Ukrainian Parliament’s vote of no confidence in Prime Minister Arseniy Yatsenyuk. How many votes for no confidence were there.1 Actual result for you to compare your prediction to: one-hundred ninety-four out of three-hundred thirty-nine.
Play this game asking people to draw the probability distribution of their prediction.
Have the contestants keep their identity secret at first.
Draw a density on the board. Give them a vertical scale for density, insisting that each of their densities has area one.
The Likelihood Game: Who won? How to evaluate the predictions?
Multiply likelihood by prior probability. Normalize so that total probability is 1.
Straight line model:
Gaussian errors:
\[f(x \; | \; \mu, \sigma) = \frac{1}{\sigma\sqrt{2\pi} } \; e^{ -\frac{(x-\mu)^2}{2\sigma^2} }\]
What happens when you take the log … why it’s sum of squares.
Question: What about minimizing the absolute value of the residuals, rather than the square? - Corresponds to a two-sided exponential distribution like \(\frac{\lambda}{2} \exp(-\lambda |x|)\)
Likelihood.
Example: Mean and standard deviation
data_vals <- runif(10000, min = 20, max = 50)
# data_vals <- rexp(10000, rate = 1)
mean(data_vals)
## [1] 35.04812
median(data_vals)
## [1] 35.18501
sd(data_vals)
## [1] 8.70109
IQR(data_vals)
## [1] 14.89111
LL_gaussian <- function(params) {
center = params[1]
spread = params[2]
sum(log(dnorm(data_vals, mean = center, sd = spread)))
}
optim(par = c(25, 10), LL_gaussian, control = list(fnscale = -1))
## $par
## [1] 35.05039 8.70036
##
## $value
## [1] -35823.37
##
## $counts
## function gradient
## 53 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
LL_exp <- function(params) {
center = params[1]
spread = params[2]
sum(log(dexp(abs(data_vals - center), rate = 1/spread)))
}
optim(par = c(25, 10), LL_exp, control = list(fnscale = -1))
## $par
## [1] 35.185767 7.513027
##
## $value
## [1] -30166.14
##
## $counts
## function gradient
## 59 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
Link to “What is Bayesian Statistics and why everything else is wrong.”
Go through sections 1, 2, and 4 in class: about the likelihood calculation, the p-value calculation, and the philosophical criticism of p-values.
Find the likelihood of observing \(k\) cases out of \(E\) employees with a rate of \(\theta\).
E <- 145
k <- 8
theta <- .01
L <- function(theta) dbinom(k, size = E, prob = theta)
Making the plot of likelihood versus \(\theta\)
x <- seq(0,.10, length = 100)
y <- L(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")
Emphasize the choice of what detail of the sampling model to use. Just this school in isolation?
Suppose we consider that there are 1000 schools near high-tension lines. Our school is presumably one of the highest rates, since other schools who had bigger numbers would come forward. Let’s imagine that our school is in the top 10%. This is like calculating that of 10 schools, the 8 cancer cases we observed are the most of any of those 10. What does this give for the likelihood of theta?
nschools <- 2
Lschools <- function(theta) {
prob_of_k <- dbinom(k, size = E, prob = theta)
less_than_k <- pbinom(k - 0.5, size = E, prob = theta)^(nschools - 1)
prob_of_k * less_than_k
}
x <- seq(0,.05, length = 100)
y <- Lschools(x)
plot(x, y, type = "l", xlab = "theta", ylab = "likelihood")
ifelse(condition, yes_value, no_value) Carries out the test for each element in the vector.if (condition) {statements} else {statements}x <- rnorm(10)
table(x < 0)
##
## FALSE TRUE
## 7 3
ifelse(x >= 0, sqrt(x), "bogus")
## Warning in sqrt(x): NaNs produced
## [1] "0.894373484707928" "1.05755578985983" "0.561229531943972"
## [4] "0.283877332330318" "bogus" "bogus"
## [7] "0.311034048939835" "0.8440657630661" "bogus"
## [10] "1.25796758166101"
Both of these can be nested.
ifelse() exampleslibrary(ISLR)
data(Default) # loan balances and defaults.
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
you_pay <- with(Default,
ifelse(balance / 10 < income, 0.10 * balance, 0.05 * income))
sum(you_pay)
## [1] 835374.9
Determine annual payment amount for student loans. E.g.
my_median <- function(x) {
if (length(x) %% 2 == 1) {
# odd length
sort(x)[ceiling(length(x)/2)]
} else {
# even length
inds <- length(x)/2 + c(0, 1)
mean(sort(x)[inds])
}
}
my_median(1:11)
## [1] 6
median(1:11)
## [1] 6
pmax() is a function that takes the case-by-case maximum of each of two input vectors. We’re going to add some error checking.my_pmax <- function(v1, v2) ifelse(v1 > v2, v1, v2)
my_pmax3 <- function(v1, v2, v3){
ifelse(v1 >= v2,
my_pmax(v1, v3),
my_pmax(v2, v3)
)
}
my_pmax4 <- function(v1, v2, v3, v4) {
ifelse(v1 >= v2,
my_pmax3(v1, v3, v4),
my_pmax3(v2, v3, v4))
}
my_pmax_any <- function(...) {
vectors <- list(...)
if (length(vectors) == 2) my_pmax(...)
if (length(vectors) == 3) my_pmax3(...)
if (length(vectors) == 4) my_pmax4(...)
}
my_pmax(rnorm(10), rnorm(10))
## [1] -0.32131029 0.24776045 -0.02983905 0.61739202 0.80969198
## [6] 1.16987760 -1.25000089 -0.28313383 -0.20915314 -0.22586907
handle_na which, if TRUE replaces NA with -Inf for the purposes of the comparison.na_rid=
NANA and otherwise handle NA as -InfWrite functions that return, case by case,
five vectors
Write a supervisory function that does it for 1 to 5 vectors. Use ...
max_in_parallel <- function(...) {
Vecs <- list(...)
}
It is known that 5% of the members of a population have disease X, which can be discovered by a blood test (that is assumed to perfectly identify both diseased and nondiseased populations). Suppose that N people are to be tested, and the cost of the test is nontrivial. The testing can be done in two ways:
Assume that N = nk with n being an integer. If the test is negative, all the people tested are healthy (that is, just this one test is needed). If the test result is positive, each of the k people must be tested separately (that is, a total of k + 1 tests are needed for that group).
ntests <- function(p = 0.05, npools = 500, pool_size = 10, nsims=1000) {
# generate the number of infected people in each pool
infected_in_pool <- rbinom(npools, p = p, size = pool_size)
# if one or more in a pool is infected, pool_size+1 tests,
# otherwise 1 test
tests_in_each_pool <- ifelse(infected_in_pool > 0, pool_size + 1, 1)
# total across all pools
sum(tests_in_each_pool)
}
Can we do this recursively to get more savings?
people <- runif(100000) < .05
ntests <- function(population) {
if ( (! any(population)) || length(population) == 1) {
# we're done!
total_tests <- 1
} else {
# Split into two groups and test again
split_point <- round(length(population)/2)
group1 <- population[1:split_point]
group2 <- population[(split_point + 1) : length(population)]
total_tests <-
ntests(group1) +
ntests(group2) + 1
# + 1 for the test that said we need to divide the groups
}
total_tests
}
How many tests needed for a population of 10,000 with a prevalence of 1%?
library(mosaic)
do(10) * ntests(runif(10000) < 0.01)
## ntests
## 1 1509
## 2 1459
## 3 1385
## 4 1379
## 5 1333
## 6 1505
## 7 1419
## 8 1239
## 9 1393
## 10 1323
Try other prevalences to find prevalence at which it’s no longer worthwhile to pool the samples.
A framework for the volumes: \(C_n r^n\).
TASK: Find \(C_4\), \(C_5\), … , \(C_8\).
Programming approach:
dim <- 3 # might vary
npts <- 1000 # might vary
pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
dists <- rowSums( pts^2 )
(2^dim) * sum(dists <= 1) / length(dists)
## [1] 4.072
sphere_volume <- function(dim=3, npts=1000000) {
pts <- matrix(runif(dim * npts, min=0, max=1), ncol=dim)
dists <- rowSums( pts^2 )
(2^dim) * sum(dists <= 1) / length(dists)
}
sapply(1:20, FUN = sphere_volume)
## [1] 2.000000 3.141956 4.187528 4.928192 5.258080 5.213376 4.707328
## [8] 4.094976 3.347968 2.600960 1.933312 1.224704 0.802816 0.507904
## [15] 0.425984 0.065536 0.262144 0.000000 0.000000 0.000000
Volume of the unit hyper-cube:
2 ^ (1:20)
## [1] 2 4 8 16 32 64 128 256
## [9] 512 1024 2048 4096 8192 16384 32768 65536
## [17] 131072 262144 524288 1048576
So the volume of the encompassed hyper-sphere goes to zero percent of the volume of the encompassing hyper-cube.
Theoretical formula: \(V_n(R) = \frac{\pi^{n/2}}{\Gamma(\frac{n}{2} + 1)}R^n\)
sapply(1:20, FUN=function(n) (pi^(n/2) / gamma(n/2 + 1)) )
## [1] 2.00000000 3.14159265 4.18879020 4.93480220 5.26378901 5.16771278
## [7] 4.72476597 4.05871213 3.29850890 2.55016404 1.88410388 1.33526277
## [13] 0.91062875 0.59926453 0.38144328 0.23533063 0.14098111 0.08214589
## [19] 0.04662160 0.02580689
Explanation of draw poker