|>
Dice_rolls group_by(outcome) |>
summarize(count = n())
# A tibble: 6 × 2
outcome count
<int> <int>
1 1 184
2 2 180
3 3 179
4 4 192
5 5 156
6 6 109
Jane Doe
. | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
underloaded | 18% | 18% | 18% | 18% | 18% | 10% |
fair | 16.67% | 16.67% | 16.67% | 16.67% | 16.67% | 16.67% |
loaded | 15% | 15% | 15% | 15% | 15% | 25% |
We have mentioned—just mentioned, mind you—a statistical quantity called the “likelihood.” Depending on how you look at things, a likelihood can be seen from four closely related different perspectives:
A conditional probability, namely, the probability of the observed outcome given a hypothesis. For instance, suppose we hypothesize that a playing die is “loaded,” for instance more likely to come up six, say, one time in 4 as opposed to one time in 6. In this case, the likelihood is p(6 | loaded) = 25%. Note that “loaded” is our hypothesis. In a likelihood, the given part of the conditional probability is always a hypothesis. Therefore, the probability p(loaded | 6) is not a likelihood, even though it is a conditional probability.
A function of the possible observations. This function takes a possible observation as input and returns as output the likelihood of that observation given the hypothesis. For instance:
obs. | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
prob when “loaded” | 15% | 15% | 15% | 15% | 15% | 25% |
Note that adding up the probabilities over all the possible observations (1-6) gives a total probability of 1.
A function of the hypotheses under consideration. For instance, based on your understanding of how the loaded dice was made, you might entertain three hypotheses: underloaded, fair, loaded.
hypothesis | likelihood p(6 |
---|---|
underloaded | 10% |
fair | 16.67% (that is, 1/6) |
loaded. | 25% |
Notice that the sum of the three probabilities is not 1. It doesn’t need to be. You can think of the three likelihoods as probabilities on three different planets. On Planet Underloaded the probability of observing 6 is 10%. On the other two planets, the probability of observing 6 is something else and need have nothing at all to do with the probability on Planet Underloaded.
A function of two inputs, namely,
input 1: the possibilities for the observations (1-6)
input 2: the different hypotheses under consideration
. | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
underloaded | 18% | 18% | 18% | 18% | 18% | 10% |
fair | 16.67% | 16.67% | 16.67% | 16.67% | 16.67% | 16.67% |
loaded | 15% | 15% | 15% | 15% | 15% | 25% |
The function in (3) is merely a slice of the table just above, taking only the “6” column.
The function in (2) is also a slide of the above table, taking only the “loaded” column.
Recall that fitting (or “training”) a model is done by taking the observed data and finding the values of parameters for the model specification that produce a model output that is “as close as possible” to the observed response variable. Almost always in statistics, “as close as possible” is defined to mean the parameter values that produce the maximum likelihood of the observed data.
In this section, you will “fit a model” of dice rolls in the sense of picking the hypothesis that maximizes the likelihood of the observed data.
The data are the results from rolling a die 1000 times and are stored under outcome
in the data frame Dice_rolls
. (To generate the data, you will need to run the first chunk in this document, the one that says “include=FALSE
”.)
Task 1: Use View
to look at Dice_rolls
and confirm that it has 1000 rows. Then, use data wrangling to count the number of rows for each of the possible outcomes. From this table of counts, comment on which of the three hypotheses is most likely.
Task 2: Now we are going to calculate the likelihood of the observed data given each of the hypotheses. For this purpose, you have been provided with a data frame named Hypotheses
that records the likelihood function.
Steps for calculating the likelihood for one of the hypotheses:
Temp <- Dice_rolls |> left_join(Hypotheses)
. View Temp
and make sure you understand how it is structured.QUESTION: How many rows are their in Temp
? What are the names of the columns in Temp
that were not already in Dice_rolls
.
summarize()
, calculate the product (prod()
) of all of the entries in each hypothesis column. Why product and not sum? Each of the rows in Temp
records one event. Now consider the fair
column. The entries in fair
are the probabilities of seeing that row’s event outcome (given the “fair” hypothesis). When there are multiple events (as the 1000 rows of Dice_rolls
) then the joint probability of seeing all of those particular outcomes is the product of the probability of seeing each row’s outcome.QUESTION: What are the numerical values of the likelihood of all 1000 recorded events for each of the three hypotheses under consideration: “underloaded,” “fair,” “loaded?” (Advance warning: This calculation is fast, but unreliable when done by computer. So the results may surprise you.)
Task 3: The calculation you were tasked with in Task 2 is mathematically correct, but gives unreliable results on the computer. The reason has to do with the nature of arithmetic on the computer. The hardware chips and software that implement arithmetic do arithmetic in a way that is indistinguishable from idealized arithmetic … most of the time. But there are gotchas which computer experts know about and work to avoid by making slight modifications to the calculations.
One of the gotchas has to do with computer arithmetic being done with a finite number of digits: about 22. Usually, we never look past a few digits, but let’s look at all 22 digits so that we can see the effects of round-off:
[1] 0.3333333333333333148296
[1] 1.333333333333333259318
In the calculation of likelihood in Task 2, the problem is that multiplying 1000 small numbers together gives a result that is rounded off to zero. Experts fix the problem by doing such calculations with logarithms.
Repeat the likelihood calculation from Task 2, but instead of using, say, prod(fair)
calculate sum(log(fair))
. The result will be the “log-likelihood” but it will be easy to read off which likelihood is the highest.
QUESTION: Which is the highest log likelihood?
Based on the observed data, how much better is the “underloaded” hypothesis that the “fair” hypothesis.” One way that statisticians present the answer is with the “likelihood ratio” test. A likelihood ratio is larger than 10 is considered “strong” evidence for the favored.
Let’s calculate the likelihood ratio comparing “fair” to “loaded.” To do this, you need to remember a little bit of high-school math about logarithms: log(A/B) = log(A) - log(B). A/B is the ratio while log(A/B) is the logarithm of the ratio.
The logarithm of the ratio (“fair”/“loaded”) is \(-1791 - -1841 = 50\). So log(A/B) = 50, but what is A/B itself. Another bit of high-school math: to undo the logarithm use the exponential function. So the ratio will be
This is a HUGE number. So big, that it’s reasonable to say that the data completely rule out the possibility of the “loaded” hypothesis.
QUESTION: What’s the likelihood ratio comparing the “fair” to the “underloaded” hypotheses?