Chapter 17 Classification error

In introduced prediction error, which compares, for each row in test data, the actual value of the response variable to the model output generated when that row’s values for the explanatory variables are inputs to the model. \[\mbox{error} = \mbox{actual value} - \mbox{model output} .\] we introduced the root mean square error (RMSE) as a measure of the performance of a prediction model. In that chapter, we used RMSE to compare models, declaring that a model with a smaller RMSE is “better” than a model with a larger RMSE. That’s a valuable rule of thumb and one that’s used widely in statistics.

For data scientists, building models is not an exercise in mathematics. A model is intended for use in some way. The model is part of a larger system which involves guiding decisions or actions. The utility of a model is a matter of its fitness for purpose, the extent to which the decisions and actions guided by the model output are helpful or harmful. That’s subtly different than asking whether the model output matches the actual outcome.

The focus in this chapter is on classifiers, particularly those where the categorical response variable has two possible levels, and where the output of the classifier is used to select one of two possible actions. For instance, suppose that you have been tasked by a health benefits company – call it HBC – to use a collection of health records to develop a system to predict whether a person will develop high blood pressure. Those people who the classifier indicates are at strong risk of high blood pressure will be invited to participate in a diet and exercise program to reduce the risk.

17.1 Building versus using a classifier

Continuing with the story of the health benefits company …

To start, think about how the company, which we’re calling HBC, will select the data to give you for building the classifier. HBC covers, let’s say, 250,000 people. Their medical records are confidential so HBC can’t give you a dump of all their records. HBC will be judicious in selecting records. They may even be sophisticated when it comes to classifiers, and divide the data into training and testing datasets, giving you only the training data and holding back the testing data so that they can evaluate the classifier you will be constructing for them.

One way for them to create the training data is to look through their entire set of records for people who already have high blood pressure. There will be many of them. The prevalence of high blood pressure in adults (in the US) is about one in three. Depending on the demographics of the HBC clients, perhaps 25,000 have already been diagnosed.

The classifier is being built for the purpose of predicting who is at strong risk of high blood pressure. So HBC will track back in time the medical history of the 25,000 to find their records, say, five years before the high blood pressure was diagnosed. For people who have just been diagnosed, this will mean going back five years. For people who were diagnosed ten years ago, this means going back 15 years in the records.

Suppose they can find 3000 people with such records. That’s the data they will send you, perhaps withholding 1000 for their own testing purposes.

Of course, to build the classifier you also need data on people who will not develop high blood pressure. It helps if the data for people who will not develop high blood pressure are similiar to those who will: similar ages, sex, ethnicities, etc. Here’s what the HBC data scientists might do: For each of the 3000 people already identified with high blood pressure, they will scan through their medical records for a match, that is, a person who is similar – same age, same sex, … – but who does not have high blood pressure. Then, for those matches, they will extract the medical records from five years before their matching partner was diagnosed with high blood pressure.

The end result of this data collection process is your receiving data: 2000 people who will go on to have high blood pressure and another 2000 who will not. We call these two groups of 2000 the cases and the controls.

As already mentioned, it’s a good practice for you to build you classifier using training data and evaluate its performance using testing data. For the sake of simplicity in this Chapter, imagine that you have randomly divided your cases and controls each into two groups of 1000 people. One group to use for training and the other for testing. Chapter 18 introduces a method, k-fold cross validation, that lets you use all your data for both training and testing.)

Developing a classifier based on this data is the easy part. The response variable will be whether the person was a case or a control. The logistic family of models is appropriate for such a classifier. You’ll select explanatory variables using whatever insight you have. If you’re diligent, you’ll try different candidates for sets of explanatory variables to find the best candidate for distinguishing cases from controls.

For each candidate classifier, you’ll construct a summary of how well that classifier performs on the test data. You’ll use that summary to decide which candidate is the best, and then send your classifier to HBC.

What should that summary look like? Many people presume, incorrectly, that a good summary is the error rate: what fraction of the rows in your testing data is misclassified. (Or, in the same misguided way, you may prefer to use the accuracy, the fraction of rows that are properly classified. Of course the error rate and the accuracy tell the same story. The accuracy is just one minus the error rate.)

The error rate (or, equivalently, the accuracy) can be severely misleading. To see this, imagine how HBC is going to use the classifier you develop. They will apply the model function you give them to each of their clients who has not yet been diagnosed with high blood pressure. For each client, this results in a prediction, a probability that the person will develop high blood pressure in the next five years. HBC will send the people with the greatest risk of developing high blood pressure and invitation to the risk-reduction intervention.

Unfortunately, the probability put out by your classifier is not necessarily the risk of developing high blood pressure. Remember that your classifier is based on data in which half of the people will develop high blood pressure in the next five years. The data were constructed that way. But the actual population of people to whom HBC will apply the classifier may, as a group, face a very different risk of developing high blood pressure. For instance, for the 225,000 HBC clients who have not been diagnosed with high blood pressure, perhaps only 5-10 percent will actually develop it in the next five years.

The fraction of a group of people who will develop a condition over a given period is called the incidence rate of that condition. The incidence rate in your testing data is 50%. The incidence rate for the 225,000 HBC clients might be, say, 5-10%.

The error rate in your test data reflects both the performance of the classifier and the incidence rate. Given that the incidence rate for the HBC client body is different than from your testing data, reporting the error rate for your testing data is misleading.

17.2 Two ways to be right, two ways to be wrong

It’s understandable that a classifier may not have perfect performance. After all, it’s trying to make a prediction based on limited data, and randomness may play a role.

It’s important to recognize that there are different ways of making a mistake, and these different ways have very different consequences. One kind of mistake, called a false positive, involves a classifier output that’s positive (i.e. the classifier indicates that the patient will develop high blood pressure) but which is wrong. The consequence of this sort of mistake in the present example is that a patient who will not benefit from the diet and exercise program (at least so far as high blood pressure is concerned) but will be invited to participate even so.

The other kind of mistake is called a false negative. Here, the classifer output is that the patient will not develop high blood pressure, but the actual patient will indeed go on to have high blood pressure. The consequence of this kind of mistake is different: the patient will not be invited to participate in the diet and exercise program, missing what might be an important opportunity to maintain good health.

The nomenclature signals that a mistake has been made with the word “false.” The kind of mistake is either “positive” or “negative”, corresponding to the output of the classifier.

When the classifier gets things right, that is a “true” result. As with the false results, a true result is possible both for a “positive” and a “negative” classifier output. So the two ways of getting things right are called true positive and true negative.

17.3 Specificity and sensitivity

Instead of using a single number like the error rate to describe a classifier’s performance, it’s better to use two numbers. One of the numbers refers to the group of people who will develop the condition. In our HBC example, those are the people who will develop blood pressure in five years. The number, called the sensitivity of the classifier, is the probability in this group the classifier gives a positive result.

The other number is called specificity and refers to another group of people: those who will not develop the condition. The specificity is the probability, in this group, that the classifier will deliver a negative result.

Both sensitivity and specificity are probabilities that the classifier will give a correct output. The two probabilities apply to completely different groups: respectively those who will and those who won’t develop the condition. That is, the sensitivity tells us about the classifier’s performance on the cases. The specificity tells us about the classifier’s performance on the controls.

Calculating the sensitivity and specificity of a classifier is straightforward. For sensitivity, look at the performance of the classifier on the cases. Suppose, for instance, that out of 1000 cases, the classifier declared 850 to have positive results. The sensitivity will then be 85%.

For specificity, look at the performance of the classifier on the controls. Out of the 1000 controls, if the classifier gives a negative output for 600, the specificity would be 60%.

A perfect classifier will have a sensitivity of 100% and a specificity of 100%. An imperfect classifier will have lower sensitivity and/or specificity.

How well the classifier works in actual use depends on the incidence rate as well as the sensitivity and specificity. To illustrate, let’s apply our classifier with a hypothetical sensitivity of 85% and specificity of 60% to two different groups. One has an incidence of 50%, the other an incidence of 10%.

When the incidence is 50%, we know that out of 1000 people 500 will develop the condition and 500 won’t. For the 500 who will develop the condition, the sensitivity tells us that 85% will get the right classifier outcome: the true positive count will be 500 * 85% = 425. Correspondingly, the false positive count will be 75.

For the 500 who will not develop the condition, the specificity tells us that the classifier result will be correct 60% of the time. This means that the true negative count will be 500 * 60% = 300. Correspondingly, the false negative count will be 200.

Just so you can see how the error rate can be misleading, note that the error count in the population with 50% incidence will be 75 + 200 out of the 1000, or 27.5%.

Now consider a different population with an incidence rate of 10%. There will be 100 people out of 1000 who will develop the condition. The classifier will get 85% of these right. The remaining 900 people will not develop the condition. The classifier will get 60% of these right. Overall, the various counts are:

  • True positive: 100 * 85% = 85
  • False positive: 100 - 85 = 15
  • True negative: 900 * 60% = 540
  • False negative: 900 - 540 = 360

When applied to the group with the 10% incidence rate, the error count will be 360 + 15 out of 1000, or 37.5%.

17.4 Example: Accuracy of airport security screening

Airplane passengers have, for decades, gone through a security screening process involving identity checks, “no fly” lists, metal detection, imaging of baggage, random pat-downs, and such. How accurate is such screening? Almost certainly, the accuracy is not as good as an extremely simple, no-input, alternative process: automatically identify every passenger as “not a security problem.” We can estimate the accuracy of the “not a security problem” classifier by guessing what fraction of airplane passengers are indeed a threat to aircraft. In the US alone, there are about 2.5 million airplane passengers each day and security problems of any sort rarely happen. So the accuracy of the no-input classifier is something like 99.999%.

The actual screening system, using metal detectors, baggage x-rays, etc. will have a lower accuracy. We know this since it regularly mis-identifies innocent people as security problems.

The problem here is not with airport security screening, but with the flawed use of accuracy as a measure of performance. Indeed, achieving super-high accuracy is not the objective of the security screening process. Instead, the objective is to deter security problems by convincing potential terrorists that they are likely to get caught before they can get on a plane. This has to do with the sensitivity of the system. The specificity of the system, although important to the everyday traveller, is not what deters the terrorist.