Model Performance Metrics for Binary Classification
Before we start building predictive models in earnest, we would like to know how we can determine, once we've created a model, whether it is "good" in some sense of the word. As you may imagine, this question has received a lot of attention from researchers and practitioners. Consequently, there is a wide variety of model performance metrics to choose from.
Note
For an idea of the range of options, have a look at the scikit-learn model evaluation page: https://scikit-learn.org/stable/modules/model_evaluation.html#model-evaluation.
When selecting a model performance metric to assess the predictive quality of a model, it's important to keep two things in mind.
Appropriateness of the metric for the problem
Metrics are typically only defined for a specific class of problems, such as classification or regression. For a binary classification problem, several metrics characterize the correctness of the yes or no question that the model answers. An additional level of detail here is how often the model is correct for each class, the positive and negative classes. We will go into detail on these metrics here. On the other hand, regression metrics are aimed at measuring how close a prediction is to the target quantity. If we are trying to predict the price of a house, how close did we come? Are we systematically over- or under-estimating? Are we getting the more expensive houses wrong but the cheaper ones right? There are many possible ways to look at regression metrics.
Does the metric answer the business question?
Whatever class of problem you are working on, there will be many choices for the metric. Which one is the right one? And even then, how do you know if a model is "good enough" in terms of the metric? At some level, this is a subjective question. However, we can be objective when we consider what the goal of the model is. In a business context, typical goals are to increase profit or reduce loss. Ultimately, you need to unify your business question, which is often related to money in some way, and the metric you will use to judge your model.
For example, in our credit default problem, is there a particularly high cost associated with not correctly identifying accounts that will default? Is this more important than potentially misclassifying some of the accounts that won't default?
Later in the book, we'll incorporate the concept of relative costs and benefits of correct and incorrect classifications in our problem and conduct a financial analysis. First, we'll introduce you to the most common metrics used to assess the predictive quality of binary classification models, the kinds of model we need to build for our case study.
Splitting the Data: Training and Test Sets
In the scikit-learn introduction of this chapter, we introduced the concept of using a trained model to make predictions on new data that the model had never "seen" before. It turns out this is a foundational concept in predictive modeling. In our quest to create a model that has predictive capabilities, we need some kind of measure of how well the model can make predictions on data that were not used to fit the model. This is because in fitting a model, the model becomes "specialized" at learning the relationship between features and response on the specific set of labeled data that were used for fitting. While this is nice, in the end we want to be able to use the model to make accurate predictions on new, unseen data, for which we don't know the true value of the labels.
For example, in our case study, once we deliver the trained model to our client, they will then generate a new dataset of features like those we have now, except instead of spanning the period from April to September, they will span from May to October. And our client will be using the model with these features, to predict whether accounts will default in November.
In order to know how well we can expect our model to predict which accounts will actually default in November (which won't be known until December), we can take our current dataset and reserve some of the data we have, with known labels, from the model training process. This data is referred to as test data and may also be called out-of-sample data since it consists of samples that were not used in training the model. Those samples used to train the model are called training data. The practice of holding out a set of test data gives us an idea of how the model will perform when it is used for its intended purpose, to make predictions on samples that were not included during model training. In this chapter, we'll create an example train/test split to illustrate different binary classification metrics.
We will use the convenient train_test_split
functionality of scikit-learn to split the data so that 80% will be used for training, holding 20% back for testing. These percentages are a common way to make such a split; in general, you want enough training data to allow the algorithm to adequately "learn" from a representative sample of data. However, these percentages are not set in stone. If you have a very large number of samples, you may not need as large a percentage of training data, since you will be able to achieve a pretty large, representative training set with a lower percentage. We encourage you to experiment with different sizes and see the effect. Also, be aware that every problem is different with respect to how much data is needed to effectively train a model. There is no hard and fast rule for sizing your training and test sets.
For our 80/20 split, we can use the code shown in the following snippet:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split\ (df['EDUCATION']\ .values.reshape(-1,1),\ df['default payment\ ' next month']\ .values, test_size=0.2,\ random_state=24)
Notice that we've set test_size
to 0.2
, or 20%. The size of the training data will be automatically set to the remainder, 80%. Let's examine the shapes of our training and test data, to see whether they are as expected, as shown in the following output:
You should confirm for yourself that the number of samples (rows) in the training and test sets is consistent with an 80/20 split.
In making the train/test split, we've also set the random_state
parameter, which is a random number seed. Using this parameter allows a consistent train/test split across runs of this notebook. Otherwise, the random splitting procedure would select a different 20% of the data for testing each time the code was run.
The first argument to train_test_split
is the features, in this case just EDUCATION
, and the second argument is the response. There are four outputs: the features of the samples in the training and test sets, respectively, and the corresponding response variables that go with these sets of features. All this function has done is randomly select 20% of the row indices from the dataset and subset out these features and responses as test data, leaving the rest for training. Now that we have our training and test data, it's good to make sure the nature of the data is the same between these sets. In particular, is the fraction of the positive class similar? You can observe this in the following output:
The positive class fractions in the training and test data are both about 22%. This is good, as we can say that the training set is representative of the test set. In this case, since we have a pretty large dataset with tens of thousands of samples, and the classes are not too imbalanced, we didn't have to take precautions to ensure this happens.
However, you can imagine that if the dataset were smaller, and the positive class very rare, it may be that the class fractions would be noticeably different between the training and test sets, or worse yet, there might be no positive samples at all in the test set. In order to guard against such scenarios, you could use stratified sampling, with the stratify
keyword argument of train_test_split
. This procedure also makes a random split of the data into training and test sets but guarantees that the class fractions will be equal or very similar.
Note
Out-of-time testing
If your data contains both features and responses that span a substantial period of time, it's a good practice to try making your train/test split over time. For example, if you have two years of data with features and responses from every month, you may wish to try sequentially training the model on 12 months of data and testing on the next month, or the month after that, depending on what is operationally feasible when the model will be used. You could repeat this until you've exhausted your data, to get a few different test scores. This will give you useful insights into model performance because it simulates the actual conditions the model will face when it is deployed: a model trained on old features and responses will be used to make predictions on new data. In the case study, the responses only come from one point in time (credit defaults within one month), so this is not an option here.
Classification Accuracy
Now we proceed to fit an example model to illustrate binary classification metrics. We will continue to use logistic regression with near-default options, choosing the same options we demonstrated in Chapter 1, Data Exploration and Cleaning:
Now we proceed to train the model, as you might imagine, using the labeled data from our training set. We proceed immediately to use the trained model to make predictions on the features of the samples from the held-out test set:
We've stored the model-predicted labels of the test set in a variable called y_pred
. How should we now assess the quality of these predictions? We have the true labels, in the y_test
variable. First, we will compute what is probably the simplest of all binary classification metrics: accuracy. Accuracy is defined as the proportion of samples that were correctly classified.
One way to calculate accuracy is to create a logical mask that is True
whenever the predicted label is equal to the actual label, and False
otherwise. We can then take the average of this mask, which will interpret True
as 1 and False
as 0, giving us the proportion of correct classifications:
This indicates that the model is correct 78% of the time. While this is a pretty straightforward calculation, there are actually easier ways to calculate accuracy using the convenience of scikit-learn. One way is to use the trained model's .score
method, passing the features of the test data to make predictions on, as well as the test labels. This method makes the predictions and then does the same calculation we performed previously, all in one step. Or, we could import scikit-learn's metrics
library, which includes many model performance metrics, such as accuracy_score
. For this, we pass the true labels and the predicted labels:
These all give the same result, as they should. Now that we know how accurate the model is, how do we interpret this metric? On the surface, an accuracy of 78% may sound good. We are getting most of the predictions right. However, an important test for the accuracy of binary classification is to compare things to a very simple hypothetical model that only makes one prediction: this hypothetical model predicts the majority class for every sample, no matter what the features are. While in practice this model is useless, it provides an important extreme case with which to compare the accuracy of our trained model. Such extreme cases are sometimes referred to as null models.
Think about what the accuracy of such a null model would be. In our dataset, we know that about 22% of the samples are positive. So, the negative class is the majority class, with the remaining 78% of the samples. Therefore, a null model for this dataset, which always predicts the majority negative class, will be right 78% of the time. Now when we compare our trained model here to such a null model, it becomes clear that an accuracy of 78% is actually not very useful. We can get the same accuracy with a model that doesn't pay any attention to the features.
While we can interpret accuracy in terms of a majority-class null model, there are other binary classification metrics that delve a little deeper into how the model is performing for negative, as well as positive samples separately.
True Positive Rate, False Positive Rate, and Confusion Matrix
In binary classification, there are just two labels to consider: positive and negative. As a more descriptive way to look at model performance than the accuracy of prediction across all samples, we can also look at the accuracy of only those samples that have a positive label. The proportion of these that we successfully predict as positive is called the true positive rate (TPR). If we say that P is the number of samples in the positive class in the test data, and TP is the number of true positives, defined as the number of positive samples that were predicted to be positive by the model, then the TPR is as follows:
The flip side of the true positive rate is the false negative rate (FNR). This is the proportion of positive test samples that we incorrectly predicted as negative. Such errors are called false negatives (FN) and the false negative rate (FNR) is calculated as follows:
Since all the positive samples are either correctly or incorrectly predicted, the sum of the number of true positives and the number of false negatives equals the total number of positive samples. Mathematically, P = TP + FN, and therefore, using the definitions of TPR and FNR, we have the following:
Since the TPR and FNR sum to 1, it's sufficient to just calculate one of them.
Similar to the TPR and FNR, there is the true negative rate (TNR) and the false positive rate (FPR). If N is the number of negative samples, the sum of true negative samples (TN) is the number of these that are correctly predicted, and the sum of false positive (FP) samples is the number incorrectly predicted as positive:
True and false positives and negatives can be conveniently summarized in a table called a confusion matrix. A confusion matrix for a binary classification problem is a 2 x 2 matrix where the true class is along one axis and the predicted class is along the other. The confusion matrix gives a quick summary of how many true and false positives and negatives there are:
Since we hope to make correct classifications, we hope that the diagonal entries (that is, the entries along a diagonal line from the top left to the bottom right: TN and TP) of the confusion matrix are relatively large, while the off-diagonals are relatively small, as these represent incorrect classifications. The accuracy metric can be calculated from the confusion matrix by adding up the entries on the diagonal, which are predictions that are correct, and dividing by the total number of all predictions.
Exercise 2.02: Calculating the True and False Positive and Negative Rates and Confusion Matrix in Python
In this exercise, we'll use the test data and model predictions from the logistic regression model we created previously, using only the EDUCATION
feature. We will illustrate how to manually calculate the true and false positive and negative rates, as well as the numbers of true and false positives and negatives needed for the confusion matrix. Then we will show a quick way to calculate a confusion matrix with scikit-learn. Perform the following steps to complete the exercise, noting that some code from the previous section must be run before doing this exercise (as seen on GitHub):
Note
The Jupyter notebook for this exercise can be found here: https://packt.link/S02kz.
- Run this code to calculate the number of positive samples:
P = sum(y_test) P
The output should appear like this:
1155
Now we need the number of true positives. These are samples where the true label is 1 and the prediction is also 1. We can identify these with a logical mask for the samples that are positive (
y_test==1
) AND (&
is the logical AND operator in Python) have a positive prediction (y_pred==1
). - Use this code to calculate the number of true positives:
TP = sum( (y_test==1) & (y_pred==1) ) TP
Here is the output:
0
The true positive rate is the proportion of true positives to positives, which of course would be 0 here.
- Run the following code to obtain the TPR:
TPR = TP/P TPR
You will obtain the following output:
0.0
Similarly, we can identify the false negatives.
- Calculate the number of false negatives with this code:
FN = sum( (y_test==1) & (y_pred==0) ) FN
This should output the following: 1155
We'd also like the FNR.
- Calculate the FNR with this code:
FNR = FN/P FNR
This should output the following:
1.0
What have we learned from the true positive and false negative rates?
First, we can confirm that they sum to 1. This fact is easy to see because the TPR = 0 and the FPR = 1. What does this tell us about our model? On the test set, at least for the positive samples, the model has in fact acted as a majority-class null model. Every positive sample was predicted to be negative, so none of them was correctly predicted.
- Let's find the TNR and FPR of our test data. Since these calculations are very similar to those we looked at previously, we show them all at once and illustrate a new Python function:
In addition to calculating the TNR and FPR in a similar way that we had previously with the TPR and FNR, we demonstrate the
print
function in Python along with the.format
method for strings, which allows substitution of variables in locations marked by curly braces{}
. There is a range of options for formatting numbers, such as including a certain number of decimal places.Note
For additional details, refer to https://docs.python.org/3/tutorial/inputoutput.html.
Now, what have we learned here? In fact, our model behaves exactly like the majority-class null model for all samples, both positive and negative. It's clear we're going to need a better model.
While we have manually calculated all the entries of the confusion matrix in this exercise, in scikit-learn there is a quick way to do this. Note that in scikit-learn, the true class is along the vertical axis and the predicted class is along the horizontal axis of the confusion matrix, as we presented earlier.
- Create a confusion matrix in scikit-learn with this code:
metrics.confusion_matrix(y_test, y_pred)
You will obtain the following output:
All the information we need to calculate the TPR, FNR, TNR, and FPR is contained in the confusion matrix. We also note that there are many more classification metrics that can be derived from the confusion matrix. In fact, some of these are actually synonyms for ones we've already examined here. For example, the TPR is also called recall and sensitivity. Along with recall, another metric that is often used for binary classification is precision: this is the proportion of positive predictions that are correct (as opposed to the proportion of positive samples that are correctly predicted). We'll get more experience with precision in the activity for this chapter.
Note
Multiclass classification
Our case study involves a binary classification problem, with only two possible outcomes: the account does or does not default. Another important type of machine learning classification problem is multiclass classification. In multiclass classification, there are several possible mutually exclusive outcomes. A classic example is image recognition of handwritten digits; a handwritten digit should be only one of 0, 1, 2, … 9. Although multiclass classification is outside the scope of this book, the metrics we are learning now for binary classification can be extended to the multiclass setting.
Discovering Predicted Probabilities: How Does Logistic Regression Make Predictions?
Now that we're familiar with accuracy, true and false positives and negatives, and the confusion matrix, we can explore new ways of using logistic regression to learn about more advanced binary classification metrics. So far, we've only considered logistic regression as a "black box" that can learn from labeled training data and then make binary predictions on new features. While we will learn about the workings of logistic regression in detail later in the book, we can begin to peek inside the black box now.
One thing to understand about how logistic regression works is that the raw predictions – in other words, the direct outputs from the mathematical equation that defines logistic regression – are not binary labels. They are actually probabilities on a scale from 0 to 1 (although, technically, the equation never allows the probabilities to be exactly equal to 0 or 1, as we'll see later). These probabilities are only transformed into binary predictions through the use of a threshold. The threshold is the probability above which a prediction is declared to be positive, and below which it is negative. The threshold in scikit-learn is 0.5. This means any sample with a predicted probability of at least 0.5 is identified as positive, and any with a predicted probability < 0.5 is decided to be negative. However, we are free to use any threshold we want. In fact, choosing the threshold is one of the key flexibilities of logistic regression, as well as other machine learning classification algorithms that estimate probabilities of class membership.
Exercise 2.03: Obtaining Predicted Probabilities from a Trained Logistic Regression Model
In the following exercise, we will get familiar with the predicted probabilities of logistic regression and how to obtain them from a scikit-learn model.
We can begin to discover predicted probabilities by further examining the methods available to us on the logistic regression model object that we trained earlier in this chapter. Recall that before, once we trained the model, we could then make binary predictions using the values of features from new samples by passing these values to the .predict
method of the trained model. These are predictions made on the assumption of a threshold of 0.5.
However, we can directly access the predicted probabilities of these samples, using the .predict_proba
method. Perform the following steps to complete the exercise, keeping in mind that you will need to recreate the same model trained previously in the chapter if you are starting a new notebook:
Note
The Jupyter notebook for this exercise can be found here: https://packt.link/yDyQn. The notebook contains the prerequisite steps of training the model and should be executed prior to the first step shown here.
- Obtain the predicted probabilities for the test samples using this code:
y_pred_proba = example_lr.predict_proba(X_test) y_pred_proba
The output should be as follows:
We see in the output of this, which we've stored in
y_pred_proba
, that there are two columns. This is because there are two classes in our classification problem: negative and positive. Assuming the negative labels are coded as 0 and the positives as 1, as they are in our data, scikit-learn will report the probability of negative class membership as the first column, and positive class membership as the second.Since the two classes are mutually exclusive and are the only options, the sum of predicted probabilities for the two classes should equal 1 for every sample. Let's confirm this.
First, we can use
np.sum
over the first dimension (columns) to calculate the sum of probabilities for each sample. - Calculate the sum of predicted probabilities for each sample with this code:
prob_sum = np.sum(y_pred_proba,1) prob_sum
The output is as follows:
array([1., 1., 1., ..., 1., 1., 1.])
It certainly looks like all 1s. We should check to see that the result is the same shape as the array of test data labels.
- Check the array shape with this code:
prob_sum.shape
This should output the following:
(5333,)
Good; this is the expected shape. Now, to check that each value is 1. We use
np.unique
to show all the unique elements of this array. This is similar toDISTINCT
in SQL. If all the probability sums are indeed 1, there should only be one unique element of the probability array: 1. - Show all unique array elements with this code:
np.unique(prob_sum)
This should output the following:
array([1.])
After confirming our belief in the predicted probabilities, we note that since class probabilities sum to 1, it's sufficient to just consider the second column, the predicted probability of positive class membership. Let's capture these in an array.
- Run this code to put the second column of the predicted probabilities array (predicted probability of membership in the positive class) in an array:
pos_proba = y_pred_proba[:,1] pos_proba
The output should be as follows:
What do these probabilities look like? One way to find out, and a good diagnostic for model output, is to plot the predicted probabilities. A histogram is a natural way to do this, for which we can use the matplotlib function,
hist()
. Note that if you execute a cell with only the histogram function, you will get the output of the NumPy histogram function returned before the plot. This includes the number of samples in each bin and the locations of the bin edges. - Execute this code to see histogram output and an unformatted plot (not shown here):
plt.hist(pos_proba)
The output is as follows:
This may be useful information for you and could also be obtained directly from the
np.histogram()
function. However, here we're mainly interested in the plot, so we adjust the font size and add some axis labels. - Run this code for a formatted histogram plot of predicted probabilities:
mpl.rcParams['font.size'] = 12 plt.hist(pos_proba) plt.xlabel('Predicted probability of positive class '\ 'for test data') plt.ylabel('Number of samples')
The plot should look like this:
Notice that in the histogram of probabilities, there are only four bins that actually have samples in them, and they are spaced fairly far apart. This is because there are only four unique values for the
EDUCATION
feature, which is the only feature in our example model.Also, notice that all the predicted probabilities are below 0.5. This is the reason every sample was predicted to be negative, using the 0.5 threshold. We can imagine that if we set our threshold below 0.5, we would get different results. For example, if we set the threshold at 0.25, all of the samples in the smallest bin to the far right of Figure 2.26 would be classified as positive, since the predicted probability for all of these is above 0.25. It would be informative for us if we could see how many of these samples actually had positive labels. Then we could see whether moving our threshold down to 0.25 would improve the performance of our classifier by classifying the samples in the rightmost bin as positive.
In fact, we can visualize this easily, using a stacked histogram. This will look a lot like the histogram in Figure 2.27, except that the negative and positive samples will be colored differently. First, we need to distinguish between positive and negative samples in the predicted probabilities. We can do this by indexing our array of predicted probabilities with logical masks; first to get positive samples, where
y_test == 1
, and then to get negative samples, wherey_test == 0
. - Isolate the predicted probabilities for positive and negative samples with this code:
pos_sample_pos_proba = pos_proba[y_test==1] neg_sample_pos_proba = pos_proba[y_test==0]
Now we want to plot these as a stacked histogram. The code is similar to the histogram we already created, except that we will pass a list of arrays to be plotted, which are the arrays of probabilities for positive and negative samples we just created, and a keyword indicating we'd like the bars to be stacked, as opposed to plotted side by side. We'll also create a legend so that the colors are clearly identifiable on the plot.
- Plot a stacked histogram using this code:
plt.hist([pos_sample_pos_proba, neg_sample_pos_proba],\ histtype='barstacked') plt.legend(['Positive samples', 'Negative samples']) plt.xlabel('Predicted probability of positive class') plt.ylabel('Number of samples')
The plot should look like this:
The plot shows us the true labels of the samples for each predicted probability. Now we can consider what the effect would be of lowering the threshold to 0.25. Take a moment and think about what this would mean, keeping in mind that any sample with a predicted probability at or above the threshold would be classified as positive.
Since nearly all the samples in the small bin to the right of Figure 2.28 are negative samples, if we were to decrease the threshold to 0.25, we would erroneously classify these as positive samples and increase our FPR. At the same time, we still wouldn't have managed to classify many, if any, positive samples correctly, so our TPR wouldn't increase very much at all. Making this change would appear to decrease the accuracy of the model.
The Receiver Operating Characteristic (ROC) Curve
Deciding on a threshold for a classifier is a question of finding the "sweet spot" where we are successfully recovering enough true positives, without incurring too many false positives. As the threshold is lowered more and more, there will be more of both. A good classifier will be able to capture more true positives without the expense of a large number of false positives. What would be the effect of lowering the threshold even more, with the predicted probabilities from the previous exercise? It turns out there is a classic method of visualization in machine learning, with a corresponding metric that can help answer this kind of question.
The receiver operating characteristic (ROC) curve is a plot of the pairs of TPRs (y-axis) and FPRs (x-axis) that result from lowering the threshold down from 1 all the way to 0. You can imagine that if the threshold is 1, there are no positive predictions since a logistic regression only predicts probabilities strictly between 0 and 1 (endpoints not included). Since there are no positive predictions, the TPR and the FPR are both 0, so the ROC curve starts out at (0, 0). As the threshold is lowered, the TPR will start to increase, hopefully faster than the FPR if it's a good classifier. Eventually, when the threshold is lowered all the way to 0, every sample is predicted to be positive, including all the samples that are, in fact, positive, but also all the samples that are actually negative. This means the TPR is 1 but the FPR is also 1. In between these two extremes are the reasonable options for where you may want to set the threshold, depending on the relative costs and benefits of true and false positives and negatives for the specific problem being considered. In this way, it is possible to get a complete picture of the performance of the classifier at all different thresholds to decide which one to use.
We could write the code to determine the TPRs and FPRs of the ROC curve by using the predicted probabilities and varying the threshold from 1 to 0. Instead, we will use scikit-learn's convenient functionality, which will take the true labels and predicted probabilities as inputs and return arrays of TPRs, FPRs, and the thresholds that lead to them. We will then plot the TPRs against the FPRs to show the ROC curve. Run this code to use scikit-learn to generate the arrays of TPRs and FPRs for the ROC curve, importing the metrics
module if needed:
from sklearn import metrics fpr, tpr, thresholds = metrics.roc_curve(y_test, pos_proba)
Now we need to produce a plot. We'll use plt.plot
, which will make a line plot using the first argument as the x values (FPRs), the second argument as the y values (TPRs), and the shorthand '*-'
to indicate a line plot with star symbols where the data points are located. We add a straight-line plot from (0, 0) to (1, 1), which will appear in red ('r'
) and as a dashed line ('--'
). We've also given the plot a legend (which we'll explain shortly), as well as axis labels and a title. This code produces the ROC plot:
plt.plot(fpr, tpr, '*-') plt.plot([0, 1], [0, 1], 'r--') plt.legend(['Logistic regression', 'Random chance']) plt.xlabel('FPR') plt.ylabel('TPR') plt.title('ROC curve')
And the plot should look like this:
What have we learned from our ROC curve? We can see that it starts at (0,0) with a threshold high enough so that there are no positive classifications. Then the first thing that happens, as we imagined previously when lowering the threshold to about 0.25, is that we get an increase in the FPR, but very little increase in the TPR. The effects of continuing to lower the threshold so that the other bars from our stacked histogram plot in Figure 2.28 would be included as positive classifications are shown by the subsequent points on the line. We can see the thresholds that lead to these rates by examining the threshold array, which is not part of the plot. View the thresholds used to calculate the ROC curve using this code:
thresholds
The output should be as follows:
array([1.2549944 , 0.2549944 , 0.24007604, 0.22576598, 0.21207085])
Notice that the first threshold is actually above 1; practically speaking, it just needs to be a threshold that's high enough that there are no positive classifications.
Now consider what a "good" ROC curve would look like. As we lower the threshold, we want to see the TPR increase, which means our classifier is doing a good job of correctly identifying positive samples. At the same time, ideally the FPR should not increase that much. The ROC curve of an effective classifier would hug the upper left corner of the plot: high TPR, low FPR. You can imagine that a perfect classifier would get a TPR of 1 (recovers all the positive samples) and an FPR of 0 and appear as a sort of square starting at (0,0), going up to (0,1), and finishing at (1,1). While in practice this kind of performance is highly unlikely, it gives us a limiting case.
Further consider what the area under the curve (AUC) of such a classifier would be, remembering integrals from calculus if you have studied it. The AUC of a perfect classifier would be 1, because the shape of the curve would be a square on the unit interval [0, 1].
On the other hand, the line labeled as "Random chance" in our plot is the ROC curve that theoretically results from flipping an unbiased coin as a classifier: it's just as likely to get a true positive as a false positive, so lowering the threshold introduces more of each in equal proportion and the TPR and FPR increase at the same rate. The AUC under this ROC would be half of the perfect classifier's, as you can see graphically, and would be 0.5.
So, in general, the ROC AUC is going to be between 0.5 and 1 (although values below 0.5 are technically possible). Values close to 0.5 indicate the model can do little better than random chance (coin flip) as a classifier, while values closer to 1 indicate better performance. The ROC AUC is a key metric for the quality of a classifier and is widely used in machine learning. The ROC AUC may also be referred to as the C-statistic (concordance statistic).
Being such an important metric, scikit-learn has a convenient way to calculate the ROC AUC. Let's see what the ROC AUC of the logistic regression classifier is, where we can pass the same information that we did to the roc_curve
function. Calculate the area under the ROC curve with this code:
metrics.roc_auc_score(y_test, pos_proba)
And observe the output:
0.5434650477972642
The ROC AUC for the logistic regression is pretty close to 0.5, meaning it's not a very effective classifier. This may not be surprising, considering we have expended no effort to determine which features out of the candidate pool are actually useful at this point. We're just getting used to model fitting syntax and learning the way to calculate model quality metrics using a simple model containing only the EDUCATION
feature. Later on, by considering other features, hopefully we'll get a higher ROC AUC.
Note
ROC curve: How did it get that name?
During World War II, radar receiver operators were evaluated on their ability to judge whether something that appeared on their radar screen was in fact an enemy aircraft or not. These decisions involved the same concepts of true and false positives and negatives that we are interested in for binary classification. The ROC curve was devised as a way to measure the effectiveness of operators of radar receiver equipment.
Precision
Before embarking on the activity, we will consider the classification metric briefly introduced previously: precision. Like the ROC curve, this diagnostic is useful over a range of thresholds. Precision is defined as follows:
Consider the interpretation of this, in the sense of varying the threshold across the range of predicted probabilities, as we did for the ROC curve. At a high threshold, there will be relatively few samples predicted as positive. As we lower the threshold, more and more will be predicted as positive. Our hope is that as we do this, the number of true positives increases more quickly than the number of false positives, as we saw on the ROC curve. Precision looks at the ratio of the number of true positives to the sum of true and false positives. Think about the denominator here: what is the sum of true and false positives?
This sum is in fact the total number of positive predictions, since all positive predictions will be either correct or incorrect. So, precision measures the ratio of positive predictions that are correct to all positive predictions. For this reason, it is also called the positive predictive value. If there are very few positive samples, precision gives a more critical assessment of the quality of a classifier than the ROC AUC. As with the ROC curve, there is a convenient function in scikit-learn to calculate precision, together with recall (also known as the TPR), over a range of thresholds: metrics.precision_recall_curve
. Precision and recall are often plotted together to assess the quality of positive predictions as far as what fraction are correct, while at the same time considering what fraction of the positive class a model is able to identify. We’ll plot a precision-recall curve in the following activity.
Why might precision be a useful measure of classifier performance? Imagine that for every positive model prediction, you are going to take some expensive course of action, such as a time-consuming review of content that was flagged as inappropriate by an automated procedure. False positives would waste the valuable time of human reviewers. You would want to be sure that you were making the right decisions on what content received a detailed review. Precision could be a good metric to use in this situation.
Activity 2.01: Performing Logistic Regression with a New Feature and Creating a Precision-Recall Curve
In this activity, you'll train a logistic regression model using a feature besides EDUCATION
. Then you will graphically assess the trade-off between precision and recall, as well as calculate the area underneath a precision-recall curve. You will also calculate the ROC AUC on both the training and test sets and compare them.
Perform the following steps to complete the activity:
Note
The code and the resulting output for this activity have been loaded in a Jupyter notebook that can be found here: https://packt.link/SvAOD.
- Use scikit-learn's
train_test_split
to make a new set of training and test data. This time, instead ofEDUCATION
, useLIMIT_BAL
, the account's credit limit, as the feature. - Train a logistic regression model using the training data from your split.
- Create the array of predicted probabilities for the test data.
- Calculate the ROC AUC using the predicted probabilities and the true labels of the test data. Compare this to the ROC AUC from using the
EDUCATION
feature. - Plot the ROC curve.
- Calculate the data for the precision-recall curve on the test data using scikit-learn's functionality.
- Plot the precision-recall curve using matplotlib.
- Use scikit-learn to calculate the area under the precision-recall curve. You should get a value of approximately 0.315.
- Now recalculate the ROC AUC, except this time do it for the training data. How is this different, conceptually and quantitatively, from your earlier calculation?
Note
The Jupyter notebook containing the Python code solution for this activity can be found here: https://packt.link/SvAOD. Detailed step-wise solution to this activity can be found via this link.