# 29 Competing hypotheses

\[\newcommand{\Ptest}{\mathbb{P}} \newcommand{\Ntest}{\mathbb{N}} \newcommand{\given}{\ |\!\!|\ }\]

There are many yes-or-no conditions. A patient has a disease or does not. A credit-card transaction is genuine or fraudulent.

But it is not always straightforward to figure out at the time the patient comes to the clinic or the credit-card transaction is made, whether the condition is yes or no. If we could wait, the condition might reveal itself: the patient gets critically ill or the credit-hard holder complains about an unauthorized charge. But we can’t wait. We want to treat the patient *before* he or she gets critically ill. We want to block the credit-card transaction before it is completed.

Instead of waiting, we measure whatever relevant variables we can when the patient arrives at the clinic or the credit-card transaction has been submitted for approval. For the patient, we might look at the concentration of specific markers for cancer in the blood. For the transaction, we might look at the shipping address to see if it matches the credit-card holder’s genuine address. Such variables may provide an indication, imperfect though it may be, of whether the condition is yes or no.

A **classifier** is a statistical model used to *predict* the unknown outcome of a yes-or-no situation from information that is already available. This Lesson concerns three closely related topics about classifiers: how we collect data for training the model, how we summarize the performance of the classifier, and how we “tune” the classifier.

## Identifying cases

Consider this news report and note the time lag between collection of the dietary explanatory variables and the response variable—whether the patient developed pancreatic cancer.

Higher vitamin D intake has been associated with a significantly reduced risk of pancreatic cancer, according to a study released last week. Researchers combined data from two prospective studies that included 46,771 men ages 40 to 75 and 75,427 women ages 38 to 65. They identified 365 cases of pancreatic cancer over 16 years. Before their cancer was detected, subjects filled out dietary questionnaires, including information on vitamin supplements, and researchers calculated vitamin D intake. After statistically adjusting

^{1}for age, smoking, level of physical activity, intake of calcium and retinol and other factors, the association between vitamin D intake and reduced risk of pancreatic cancer was still significant. Compared with people who consumed less than 150 units of vitamin D a day, those who consumed more than 600 units reduced their risk by 41 percent. -New York Times, 19 Sept. 2006, p. D6.

This was not an experiment; it was an observational study without any intervention to change anyone’s diet.

## The training sample

In building a classifier, we have a similar situation. Perhaps we can perform the blood test today, but that gives us only the test result, not the subject’s true condition. We might have to wait years for that condition to reveal itself. Only at that point can we measure the performance of the classifier.

To picture the situation, let’s imagine many people enrolled in the study, some of whom have the condition and some who don’t. On Day 1 of the study, we test everyone and get raw score on a scale from 0 to 40. The results are shown in Figure 29.1. Each glyph is a person. The varying locations are meant to help us later on; for now, just think of them as representing where each person lives in the world. The different shapes of glyph—circle, square, triangle—are meant to remind you that people are different from one another in age, gender, risk-factors, etc.

Each person took a blood test. The raw result from that test is a score from 0 to 40. The distribution of scores is shown in the right panel of the figure. We also show the score in the world-plot; the higher the raw score, the more blue the glyph. On Day 1, it isn’t known who has the condition and who does not.

Having recorded the raw test results for each person, we wait. In the pancreatic cancer study, they waited 16 years for the cancer to reveal itself.

… waiting …

After the waiting period, we can add a new column to the original data; whether the person has the condition (C) or doesn’t (H).

Figure 29.2 shows the distribution of raw test scores for the C group and the H group. The scores are those recorded on Day 1, but after waiting to find out the patients’ conditions, we can subdivide them into those who have the condition (C) and those who don’t (H).

## Applying a threshold

To finish the classifier, we need to identify a “**threshold score**.” Raw scores above this threshold will generate a \({\mathbb{P}}\) test; scores below the threshold generate a \({\mathbb{N}}\) test.

We can make a good guess at an appropriate threshold score from the presentation in the right panel of Figure 29.2. The objective in setting the threshold is to distinguish the C group from the H group. Setting the threshold at a score around 3 does a pretty good job.

It helps to give names to the two test results: \({\mathbb{P}}\) and \({\mathbb{N}}\). Anyone with a score above 3 has result \({\mathbb{P}}\), anyone with a score below 3 has an \({\mathbb{N}}\) result.

## False positives and false negatives

NARRATE Figure 29.3 to point out the gray dots in the C group and the blue dots in the H group. These are errors. But there are two kinds of errors.

False-positive: blue dots in the H group. The “positive” refers to the \({\mathbb{P}}\) test result, the “false” simply means the test result was wrong.

False-negative: gray dots in the C group. The “negative” refers to the \({\mathbb{N}}\) result. Again, the “false” means simply that the test result is out of line with the actual condition of the person.

In the training sample shown in Figure 29.3, there are 300 people altogether and 17 false-negatives. This gives a false-negative rate of about 6%. Similarly there are 30 false-negatives, a false-positive rate of 10%.

Naturally, the objective when building a classifier is to avoid errors. One way to avoid errors is by careful “**feature engineering**.” Here, “features” refers to the inputs to the classifier model. Often, the designer of the classifier has multiple variables (“features”) to work with. (See example.) Choosing a good set of features can be the difference between a successful classifier and one that makes so many mistakes as to be useless.

We will use the name “Bullseye” to refer to a major, national, big-box retailing chain which sells, among many other products, dog food. Sales are largely determined by customer habits; people tend to buy where and what they have previously bought. There are many places to buy dog food, for instance pet supermarkets and grocery stores.

One strategy for increasing sales involves discount coupons. A steep discount provides a consumer incentive to try something new and, maybe, leads to consumers forming new habits. But, from a sales perspective, there is little point in providing discounts to people who already have the habit of buying dog food from the retailer. Instead, it is most efficient to provide the discount only to people who don’t yet have that habit

The Bullseye marketing staff decided to build a classifier to identify pet owners who already shop at Bullseye but do not purchase dog food there. The data available, from Bullseye’s “loyalty” program, consisted of individual customers’ past purchases of the tens of thousands of products sold at Bullseye.

Which of these many products to use as indicators of a customer’s potential to switch to Bullseye’s dog food? This is where feature engineering comes in. Searching through Bullseye’s huge database, the feature engineers identified that customers who buy dog food also buy carpet cleaner. But many people buy carpet cleaner who don’t buy dog food. The engineers searched for purchases might distinguish dog owners from other users of carpet cleaner.

The feature engineers’ conclusion: Send dog-food coupons to people who buy carpet cleaner but do not buy diapers. Admittedly, this will leave out the people who have both dogs and babies: these are false negatives. It will also lead to coupons being sent to petless, spill-prone people whose children, if any, have moved beyond diapers: false-positives.

## Threshold, sensitivity and specificity

In Figure 29.3 the threshold between \({\mathbb{P}}\) and \({\mathbb{N}}\) is set at a score of 3. That might have been a good choice, but it pays to take a more careful look.

That graph is hard to read because the scores have a very long-tailed distribution; the large majority of scores are below 2 but the scores go up to 40. To make it easier to compare scores between the C and H groups, Figure 29.4 shows the scores on a nonlinear axis. Each score is marked as a letter: “P” means \({\mathbb{P}}\), “N” means \({\mathbb{N}}\). False results are colored red.

Moving the threshold up would reduce the number of false-positives. At the same time, the larger threshold would *increase* the number of false-negatives. Figure 29.5 shows what the situation would be if the threshold had been set at, say, 10 or 0.5.

By setting the threshold larger, as in Figure 29.5(a), the number of false-negatives (red Ns) increases but the number of false-positives (red Ps) goes down. Setting the threshold lower, as in Figure 29.5(b), reduces the number of false-negatives but increases the number of false-positives.

This trade-off between the number of false-positives and the number of false-negatives is characteristic of classifiers.

Figure 29.6 shows the overall pattern for false results versus threshold. At a threshold of 0, all test results are \({\mathbb{P}}\). Hence, none of the C group results are false; if there are no \({\mathbb{N}}\) results, there cannot be any false-negatives. On the other hand, all of the H group are false-positives.

Increasing the threshold changes the results. At a threshold of 1, many of the H group—about 50%—are being correctly classified as \({\mathbb{N}}\). Unfortunately, the higher threshold introduces some negative results for the C group. So the fraction of correct results in the C group goes down to about 90%. This pattern continues: raising the threshold improves the fraction correct in the H group and lowers the fraction correct in the C group.

There are two names given to the fraction of correct classifications, depending on whether one is looking at the C group or the H group. The fraction correct in the C group is called the “**sensitivity**” of the test. The fraction correct in the H group is the “**specificity**” of the test.

The sensitivity and the specificity, taken together, summarize the error rates of the classifier. Note that there are two error rates: one for the C group and another for the H group. Figure 29.6 shows that, depending on the threshold used, the sensitivity and specificity can be very different from one another.

Ideally, both the sensitivity and specificity would be 100%. In practice, high sensitivity means lower specificity and *vice versa*.

Sensitivity and specificity will be particularly important when we take into consideration the **prevalence**, that is, the fraction of the population with condition C

## Prevalence

The “**prevalence**” of C is the fraction of the population who have condition C. Prevalence is an important factor in the performance of a classifier.

**?(caption)**

**?(caption)**

Lesson 29 used a **training sample**, first shown in Figure 29.3 and duplicated here in the margin. The training sample allowed us look at the consequences of the choice of threshold used in the test. That training sample had roughly equal numbers of people from the C and H groups. It’s sensible to use such a training sample in order to make sure both the C and H groups are well represented.

The prevalence among the actual population is usually very different than in the training sample. Figure 29.9 illustrates the typical situation: many people in the H group and few people in the C group.

The prevalence can be seen by how densely the H group is populated compared to the C group. The prevalence depicted in Figure 29.9 is about 10%, that is, one in ten people has condition C. In real-world conditions, prevalence is often much lower, perhaps 0.1%. Indeed, epidemiologists often move alway from a percentage scale when quantifying prevalences, often using “cases per 100,000.”

Even though the prevalence is different in Figure 29.3 than in Figure 29.3, the sensitivity is exactly the same. Likewise for the specificity.

We don’t usually have comprehensive testing of a population, so drawing a picture like Figure 29.9 has to be done theoretically based on the limited information available: prevalence (from surveys of the population) as well as sensitivity and specificity (from the training sample). This is easy to do.

The first step is to determine the number in the C group and in the H group using the population size. If the population size is \(N\), then the number in the C group will be \(p(C) N\). We are writing the prevalence here as a probability, the probability \(p(C)\) that a randomly selected person from the population has condition C. Similarly, the size of the H group is \((1-p(C)) N\).

Consider now the sensitivity. Sensitivity is relevant only to the C group; it tells the fraction in the C group who will be correctly classified. That’s enough information to know how many people in C to color blue (for \(\mathbb{P}\)) or gray (for \(\mathbb{H}\)).

Similarly, the specificity tells us what fraction among the H group to color blue and gray.

This is how Figure 29.9 was generated: specifying population size \(N\), prevalence \(p(C)\), and sensitivity and specificity. The false-positives are the blue dots in the H group, the false-negatives are the gray dots in the C group.

## From the patient’s point of view

Figure 29.9 is drawn from the perspective of the epidemiologist or test developer. But it doesn’t directly provide information of use to the patient, simply because the patient has only a test result (\(\mathbb{P}\) or \(\mathbb{H}\)) but no definitive knowledge of the actual condition (C or H).

Re-organizing the epidemiologist’s graph can put it in a form relevant to the patient. Instead of plotting people by C or H, we can plot them by \(\mathbb{P}\) or \(\mathbb{H}\). This perspective is shown in Figure 29.11, which is exactly the same people as in Figure 29.3 but arranged differently.

For the patient who has gotten a \(\mathbb{P}\) result, the left panel of Figure 29.11 is highly informative. The patient can see that only a small fraction of the people testing \(\mathbb{P}\) actually have condition C. (The people with C are shown as filled symbols.)

The test result \(\mathbb{P}\) is not definitive, it is merely a clue.

## Likelihood

A “clue” is a piece of information or an observation that tells something about a mystery, but not usally everything. As an example, consider a patient who has just woken up from a coma and doesn’t know what month it is. It is a mystery. With no information at all, it is almost equally likely to be any month. So the hypotheses in contention might be labeled Jan, Feb, March, and so on.

The person looks out the window and observes snow falling. The observation of snow is a clue. It tells something about what month it might be, but not everything. For instance, the possibility that it is July becomes much less likely if snow has been observed; the possibility that it is February (or January or March) becomes more likely.

Statistical thinkers often have to make use of clues. Suppose the coma patient is a statistician. She might try to quantify the likelihood of each month given the observation of snow. Here’s a reasonable try:

Month | Probability of seeing snow when looking out the window for the first time each day | Notation |
---|---|---|

January | 2.3% | \(p(\text{snow}{\ |\!\!|\ } \text{January})\) |

February | 3.5% | \(p(\text{snow}{\ |\!\!|\ } \text{February})\) |

March | 2.1% | \(p(\text{snow}{\ |\!\!|\ } \text{March})\) |

April | 1.2% | \(p(\text{snow}{\ |\!\!|\ } \text{April})\) |

May | 0.5% | … and so on … |

June | 0.1% | |

July | 0 | |

August | 0 | |

September | 0.2% | |

October | 0.6% | |

November | 0.9% | |

December | 1.4% |

The table lists 12 probabilities, one for each month. For the coma patient, these probabilities let her look up which months it is likely to be. For this reason, the probabilities are called “**likelihoods**.”

The coma patient has 12 hypotheses for which month it is. The table as a whole is a “**likelihood function**” describing how the likelihood varies from one hypothesis to another. Think of the entries in the table as having been radioed back to Earth from the 12 hypothetical planets \({\ |\!\!|\ } \text{January})\) through \({\ |\!\!|\ } \text{December})\).

It is helpful, I think, to have a notation that reminds us when we are dealing with a likelihood and a likelihood function. We will use the fancy \({\cal L}\) to identify a quantity as a likelihood. The coma patient is interested in the likelihood of snow, which we will write \({\cal L}_\text{snow}\). From the table we can see that the likelihood of snow is a function of the month, that is \({\cal L}_\text{snow}(\text{month})\), where month can be any of January through December.

This likelihood function has a valuable purpose: It will allow the coma patient to calculate the probability of it being any of the twelve months given her observation of snow, that is \(p(\text{month} {\ |\!\!|\ } \text{snow})\).

In general, likelihoods are useful for converting knowledge like \({\cal L}_a(b)\) into the form \(p(b {\ |\!\!|\ } a)\). The formula for doing the conversion is called “**Bayes’ Rule**.”

The form of Bayes’ rule appropriate to the coma patient allows her to calculate the probability of it being any given month from the likelihoods. We also need to account for February, with only 28 days, being shorter than the other months. So we will define a probability function, \(p(\text{month}) = \frac{\text{number of days in month}}{365}\)

**Bayes’ Rule** \[p(\text{month} {\ |\!\!|\ } \text{snow}) = \frac{{\cal L}_\text{snow}(\text{month}) \cdot p(\text{month})}{{\cal L}_\text{snow}(\text{Jan}) \cdot p(\text{Jan}) +
{\cal L}_\text{snow}(\text{Feb}) \cdot p(\text{Feb}) + \cdots +
{\cal L}_\text{snow}(\text{Dec}) \cdot p(\text{Dec})}\]

## How serious is it, Doc?

Imagine a patient getting a \(\mathbb{P}\) test result and wondering what the probability is of his having condition C. That is, he wants to know \(p(C {\ |\!\!|\ } \mathbb{P})\). This is equivalent to asking, “How serious is it, Doc?”

The doctor could point to Figure 29.10 as her answer. That figure was generated by creating a population with the relevant prevalence, using the sensitivity and specificity to determine the fraction of the C and H groups with \(\mathbb{P}\) or \(\mathbb{H}\) respectively, the *re-organizing* into new groups: the \(\mathbb{P}\) group and the \(\mathbb{H}\) group.

Alternatively, we can do the calculation in the same way we did for the coma patient seeing snow. There, the observation of snow was the clue. Now, the test result \(\mathbb{P}\) is the clue. One of the relevant likelihoods to interpret \(\mathbb{P}\) is \({\cal L}_{\mathbb{P}}(C)\): the likelihood for a person who genuinely has condition C of getting a \(\mathbb{P}\) result. Of course, this is just another way of writing the sensitivity.

Similarly, the specificity is \({\cal L}_{\mathbb{H}}(H)\). But since our person got a \(\mathbb{P}\) result, the likelihood \({\cal L}_{\mathbb{H}}(H)\) is not directly relevant. (It would be relevant only to a person with a \(\mathbb{H}\) result.) Fortunately, there is a simple relationship between \({\cal L}_{\mathbb{P}}(H)\) and \({\cal L}_{\mathbb{H}}(H)\). If we know the probability of an H person getting a \(\mathbb{H}\) result we can figure out the probability of an H person getting a \(\mathbb{P}\) result. \[{\cal L}_{\mathbb{P}}(H) = 1 - {\cal L}_{\mathbb{H}}(H)\]

Bayes’ Rule for the person with a \(\mathbb{P}\) result is

\[p(C{\ |\!\!|\ } \mathbb{P}) = \frac{{\cal L}_{\mathbb{P}}(C) \cdot p(C)}{{\cal L}_{\mathbb{P}}(C) \cdot p(C) + {\cal L}_{\mathbb{P}}(H) \cdot p(H)}\]

Suppose that \(p(C) = 1\%\) for this age of patient. (Consequently, \(p(H) = 99\%\).) And imagine that the test taken by the patient has a threshold score of 1. From Figure 29.6 we can look up the sensitivity (\({\cal L}_{\mathbb{P}}(C) = 0.95\)) and specificity (\({\cal L}_{\mathbb{P}}(H) = 0.50)\) for the test. Substituting these numerical values into Bayes’ Rule gives

\[p(C {\ |\!\!|\ } \mathbb{P}) = \frac{0.95\times 0.01}{0.95\times 0.01 + 0.50*0.99} = 1.9\%\] The \(\mathbb{P}\) result has changed the probability that the patient has C from 1% to 1.9%. That’s big proportionally, but not so big in absolute terms.

The advantage of the Bayes’ Rule form of the calculation over the \(\mathbb{P}\) group in Figure 29.10 is that it is very easy to do the Bayes’ Rule calculation for any value of prevalence \(p(C)\). Why would we be interested in doing this?

Typically the prevalence of a condition is different for different groups in the population. For example, for an 80-year-old with a family history of C the prevalence might be 20% rather than the 1% that applied to the patient in the previous example. For the 80-year-old, the probability of having C given a \(\mathbb{P}\) result is substantially different from the 1.9% found in the example:

\[p(C {\ |\!\!|\ } \mathbb{P}) = \frac{0.95\times 0.2}{0.95\times 0.2 + 0.50*0.8} = 32\%\]

## Screening tests

The reliability of a \(\mathbb{P}\) result differs depending on the prevalence of C. A consequence of this is that medical screening tests are recommended for one group of people but not for another.

For instance, the US Preventative Services Task Force (USPSTF) issues recommendations about a variety of medical screening tests. According to the Centers for Disease Control (CDC) summary:

The USPSTF recommends that women who are 50 to 74 years old and are at average risk for breast cancer get a mammogram every two years. Women who are 40 to 49 years old should talk to their doctor or other health care provider about when to start and how often to get a mammogram.

Recommendations such as this can be baffling. Why recommend mammograms only for people 50 to 74? Why not for older women as well? And how come women 40-49 are only told to “talk to their doctor?”

The CDC summary needs decoding. For instance, the “talk to [your] doctor” recommendation really means, “We don’t think a mammogram is useful to you, but we’re not going to say that straight out because you’ll think we are denying you something. We’ll let your doctor take the heat, although typically if you ask for a mammogram, your doctor will order one for you. If you are a woman younger than 40, a mammogram is even less likely to give a useful result, so unlikely that we won’t even hint you should talk to a doctor.”

The reason mammograms are not recommended for women 40-49 is that the prevalence for breast cancer is much lower in that group of people than in the 50-74 group. The prevalence of breast cancer is even lower in women younger than 40.

So what about women 75+? The prevalence of breast cancer is high in this group, but at that age, non-treatment is likely to be the most sensible option. Cancers can take a long while to develop from the stage identified on a mammogram, and at age 75+ it’s not likely to be the cause of eventual death.

The USPSTF web site goes into some detail about the reasoning for their recommendations. It’s worthwhile reading to see what considerations went into their decision-making process.

Let’s look more closely at the details of breast-cancer screening. The reported sensitivity of digital mammography is 85% and the specificity is 90%.^{2}

The National Cancer Institute publishes cancer-risk tables. Figure 29.12 shows the NCI table for breast cancer.

**?(caption)**

Women 60 to 70 have a risk of about 2%—we will take this as the prevalence. Out of 1000 such women:

- 20 women have breast cancer, of whom 90% will receive a \(\mathbb{P}\) test result. Consequently, 18 women with cancer get a \(\mathbb{P}\) result.
- 980 women do not have breast cancer. Since the test specificity is 85%, the probability of a \(\mathbb{P}\) test result is 15%, so 147 women in this group will get a \(\mathbb{P}\) result.

Altogether, 165 out of the group of 1000 women will have a \(\mathbb{P}\) result, of whom 18 have cancer. Thus, for a woman with a \(\mathbb{P}\) result, the prevalence is 18/165, that is, 11%. Using the same logic, for a woman with a \(\mathbb{N}\) result, the risk of cancer is reduced from 2% to about 0.2%.

The point of the screening test is to identify at low cost those women at higher risk of cancer. For mammography, that higher risk is 11%. This is by no means a definitive result.

Now imagine that a different test for breast cancer is available, perhaps one that is more invasive and expensive and therefore not appropriate for women at low risk of cancer. Imagine that the sensitivity and specificity of this expensive test are also 90% and 85% respectively. Applying this second test to the 165 women who received a \(\mathbb{P}\) result on the first test, about 16 of the women with cancer will get a second \(\mathbb{P}\) result. But there are also about 147 people in the group of 165 who do not have cancer. These have a \(1-0.85\) chance of a \(\mathbb{P}\) positive test. Thus, there will be 22 women who do not have cancer but who nonetheless get a \(\mathbb{P}\) result on the second test. The risk of having cancer for the \(16+22\) women who have gotten a \(\mathbb{P}\) result on both tests is \(16/38 = 42\%\).

EXERCISE: Calculate the cancer risk for those who get a positive result on the first test and a negative result on the second. (About 8%.)

EXERCISE: What happens with a third, even more invasive/expensive test, with the 90/85 sensitivity/specificity. What is the risk for the women who get a positive result on that test. (About 82%.)

### The Loss Function

NEED TO FIX THIS. The prevalence wasn’t included in the calculation.

In order to set the threshold at an optimal level, it is important to measure the impact of the positive or negative test result. This impact of course will depend on whether the test is right or wrong about the person’s true condition. It is conventional to measure the impact as a “**loss**,” that is, the amount of harm that is done.

If the test result is right, there’s no loss. Of course, it’s not nice that a person is C, but a \(\mathbb{P}\) test result will steer our actions to treat the condition appropriately: no loss in that.

Typically, the loss stemming from a false negative is reckoned as more than the loss of a false positive. A false negative will lead to failure to treat the person for a condition that he or she actually has.

In contrast, a false-positive will lead to unnecessary treatment. This also is a loss that includes several components that would have been avoided if the test result had been right. The cost of the treatment itself is one part of the loss. The harm that a treatment might do is another part of the loss. And the anxiety that the person and his or her family go through is still another part of the loss. These losses are not necessarily small. The woman who gets a false positive breast-cancer diagnosis will suffer from the effects of chemotherapy and the loss of breast tissue. The man who gets a false-positive prostate-cancer diagnosis may end up with urinary incontinence and impotence.

The aim in setting the threshold is to minimize the total loss. This will be the loss incurred due to false negative times the number of false negatives plus the loss incurred from a false positive times the number of false positives.

In Lesson 29, we saw that the threshold for transforming a raw test score into a \(\mathbb{P}\) or \(\mathbb{H}\) result determined the sensitivity and specificity of the test. (See Figure 29.6.) Of course, its best if both sensitivity and specificity are as high as possible, but there is a trade-off between the two: increasing sensitivity by lowering the threshold will decrease specificity. Likewise, raising the threshold will improve specificity but lower sensitivity.

The “**loss function**” provides a way to set an optimal value for the threshold. It is a function, because the loss depends on whether the test result is a false-positive or a false-negative.

Suppose that the

## Exercises

DRAFT

A new driver has just gotten her license and wants to arrange car insurance. In order to set the premium (price of insurance), the insurance company needs an estimate of the accident risk.

At the start, it reasonable to assume a relatively high risk (per mile). USE THIS TO FORM A PRIOR, then multiply it by the likelihood of not being in an accident for the miles driven in the first year.

DRAFT OF EXERCISE Bayes theorem in odds form:

Odds of alternative hypothesis after seeing data = Odds of alternative before seeing data TIME p(Data | Alt)/p(Data/Null)

The quantity p(Data | Alt)/p(Data/Null) is called the Bayes factor.

Calculate the Bayes factor for a given sensitivity and specificity. Use this to calculate the posterior odds and translate these back into the posterior probability.