Lesson 34 take-aways

Class sessions
classifier
logistic regression
threshold
loss function
false-positive/false-negative
Author

Daniel Kaplan

Published

April 23, 2023

  1. We talked about “classifiers,” a mechanism aimed to distinguish people with a given condition (e.g. prospective congestive heart failure, or colon cancer, …) from people without that conditions.
  2. The classifier takes measurements from a person and returns either a positive (\(\mathbb P\)) or a negative (\(\mathbb N\)) result.
  3. The two numbers that summarize the performance of the test are:
    • Sensitivity: The fraction of people with the condition who test \(\mathbb P\), which is the correct result for such people.
    • Specificity: The fraction of people without the condition to test \(\mathbb N\), which is the correct result for this other group of people.
  4. An incorrect result is called “false,” a correct result is “true.”
    • For people who do not have the condition, a \(\mathbb P\) result is false. Such people are called “false positives.” A good classifier keeps the rate of false positives low because such people may receive treatment which is not needed.
    • For people who have the condition, a \(\mathbb N\) result is false. Such people are called “false negatives.” A good classifier keeps the rate of false negatives low because such people do not receive the treatment which is called for.
  5. By changing a number called the “threshold,” the false negative rate can be improved, but at the cost of a higher false positive rate. And vice versa. That is, there is always a trade-off between false positives and false negatives.
  6. We resolve the trade-off by looking at the overall loss due to the mistaken classifications. The loss for a false positive is generally different (often, less) than the loss for a false negative. To calculate the overall loss, multiply the false-positive rate and the false-negative rate by their respective losses, and add. Choose the threshold that minimizes the overall loss.

More detail

  1. A classifier is a machine that, based on measurements of some sort assigns a categorical level to the object the measurements came from. We can make this less abstract by talking about classifiers in the context of medical screening. There, the “object” is a patient; the possible categorical levels are \(\mathbb{P}\) and \(\mathbb{N}\), meaning a “positive” test result or a “negative” test result.

  2. In this Lesson, we demonstrated how to build a classifier. The classifier takes the form of a statistical model fitted to training data.

  3. Skipping, for the moment, to the use of a classifier that’s already been built, this is a matter of evaluating the statistical model at the inputs relevant to the patient, then applying a threshold to the model output to make the choice between \(\mathbb{P}\) and \(\mathbb{N}\). Interpreting the meaning of \(\mathbb{P}\) and \(\mathbb{N}\) requires that we know about:

    1. The performance of the classifier, which is summarized by two numbers, the sensitivity and the specificity. Using these two numbers (along with a third number, called the prevalence) is the topic of Lesson 35.
    2. You may hear people talk about the “accuracy” of a classifier. We’ll show how, in the process of building the classifier, to calculate the accuracy, specificity, and sensitivity. Note that the “accuracy” is inadequate for describing the classifier.
  4. There are three steps to building a classifier:

    1. Assembling training data. This is a very difficult and drawn-out process, usually done in a clinical setting and involving hundreds or thousands of test subjects.
      1. In our example, we used the math300::Framingham data frame, which is the product of a lot of work by many people.
      2. The training data includes one or more measurements made on each of the test subjects. In Framingham such measurements included age, sex, smoking status, whether the subject was taking medicine for high blood pressure, BMI, and so on. In the well known screening tests (e.g. for breast cancer, prostate cancer, colon cancer) the measurement is often the concentration in blood or stool of a particular antigen.
      3. The training data must include an outcome variable. With Framingham, the outcome is TenYearCHD which records whether or not the subject developed congestive heart failure (CHD) in the ten-year follow up period (that is, the ten years after the measurements in (b) were made. The outcome variable is a zero-one variable.
    2. Building a statistical model of the outcome variable as a function of one or more of the measurements in (b). The role of this model, once built, is to convert the measurements in (b) into a score. Often in such model building, care is taken to identify the measurements in (b) that give the best “performance.” Performance refers to producing a wide range of model output values that correlate well with the outcome variable.
    3. The final step is to establish a threshold. This is a number, in the units of the score. When the score is above this threshold, the classifier produces a \(\mathbb P\) result. A score below the threshold means the result is \(\mathbb N\).

    In class, we used logistic regression to build the statistical model (section 4.ii). Then we evaluated the model on the training data to produce a score for each subject in the training data. Each of these subjects either developed CHD (in the ten-year follow-up) or did not. The fraction of subjects who developed CHD is called the “training prevalence.” In the example constructed for the table below (in (5)), there are 1000 subjects altogether, of whom 50 +150=200 had CHD as the outcome. The training prevalence is the fraction of the whole who had the CHD outcome: 200/1000 = 20% in the example below.

  5. The process of choosing a threshold (section 4.iii) is an important part of the decision-making guided by the test. It works like this:

    1. Choose a candidate threshold. Apply this to each of the scores in the model-evaluated training data to produce a new column which we can call the output of the candidate test. This will have entries that are \(\mathbb P\) or \(\mathbb N\) depending on whether the score is above the candidate threshold.
    2. Tally up the table using the outcome (disease or not) and the test result (\(\mathbb P\) or \(\mathbb N\)). This will produce a table like this:
patient outcome test output count description
Disease. \(\mathbb P\) 150. “true positive”
Disease. \(\mathbb N\) 50. “false negative”
Healthy \(\mathbb P\) 75. “false positive”
Healthy. \(\mathbb N\) 725. “true negative”
  1. From the above table, you can easily compute three measurements of the performance of the test:
  • sensitivity: The fraction of the diseased subjects who tested positive. In the example, that’s 150/(150+50) = 75%.
  • specificity: The fraction of the healthy subjects who had a negative test result. Here that’s 725/(725+75) = 90.6%.

Note that neither the sensitivity nor the specificity reflect the training prevalence, since each is calculated within either the disease group or the healthy group.

  • accuracy: The “accuracy” is the fraction of all subjects who received a correct test result, that is, one that matches the patient outcome. Here that’s the fraction with “true” in the description, that is, (150+725)/1000 = 87.5%. The problem with “accuracy” is that it depends strongly on the training prevalence.
  1. The training prevalence is an artifact of the data collection process and is typically much higher than the prevalence in the overall population, that is, the population prevalence.

  2. Remember that we are still in the process of evaluating the merits of the candidate threshold used to construct the table above. Now we want to re-arrange the counts in that table to reflect the population prevalence.

    1. The training prevalence was 20%. Suppose that the prevalence in the population to whom the test is targeted is 5%.
    2. Construct a new table of counts where the sensitivity and specificity are exactly as found in the table from the training-data/candidate-threshold but where the prevalence is the desired 5%. We can accomplish this by arithmetically increasing the numbers in the healthy category so that there are altogether 4000 people in the table. (200 diseased out of 4000 total is a prevalence of 5%.) In increasing the number of healthy people, we need to be careful to keep the specificity at the level found from the training data: 90.6. That is, we will change the healthy numbers so that there are 3800 healthy people (giving 4000 total, including the diseased) of whom 90.6% got a \(\mathbb N\) result. This gives us 38000.906 = 3443 in the last row and 3800(1-0.906) = 357 in the second to last row.

The adjusted table that reflects the population prevalence (as opposed to the training prevalence) is:

patient outcome test output count description
Disease \(\mathbb P\) 150 “true positive”
Disease \(\mathbb N\) 50 “false negative”
Healthy \(\mathbb P\) 357 “false positive”
Healthy \(\mathbb N\) 3443 “true negative”
  1. Time to evaluate the candidate threshold.
    1. We can measure how good the threshold is by calculating the “accuracy” of the test on the tests results on the adjusted table. Unlike the accuracy on the training results, which depends on an unrealistic prevalence, the adjusted table has the right prevalence. The accuracy is (150 + 3443)/4000 = 89.8%. Looking at things from the other side, the “mistake” rate is 11.2%.
    2. It’s usually the case that the cost of making a mistake is very different for a “false negative” than a “false positive.” A person with a false negative will not receive the appropriate treatment for his or her disease. We call this a “loss,” and it may be a very big problem for that person. On the other hand, a person with a false positive will receive treatment that is unnecessary and potentially harmful, risky, or imposes financial or emotional costs. That’s also a loss. It is a matter of judgement what the relevant sizes of the loss from a false negative and the loss from a false positive.
  2. Let’s suppose that the false-negative loss is 10, and the false-positive loss is 0.5. (These are just made-up numbers for the example.) Then, referring to the table above, the total loss is 50$\(10 + 357\)\(0.5 = 678.5\). The loss rate, that is, the loss per person, is 678.5/4000 = 0.17 per person.

The test developers can repeat all these calculations for other candidate thresholds. That will give them a set of thresholds and a set of corresponding loss rates. Pick the threshold with the lowest loss rate.

Note the importance of the relative sizes of the false-positive loss and the false-negative loss. Expertise in the area of application of the test and follow-up treatments is important to assign meaningful values to the comparative losses. Our 10 and 0.5 are just for the purposes of example.