From Practical Data Science with R, Second Edition by Nina Zumel and John Mount This article discusses how one can experiment with the effectiveness of a classification model using a spam filter. 
Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at manning.com.
Evaluating classification models
A classification model places examples into one of two or more categories. For measuring classifier performance, we’ll first introduce the incredibly useful tool called the confusion matrix and show how it can be used to calculate many important evaluation scores. The first score we’ll discuss is accuracy.
A readytogo example (with a good description) is the Spambase dataset Each row of this dataset is a set of features measured for a specific email and an additional column telling whether the mail was spam (unwanted) or nonspam (wanted). We’ll quickly build a spam classification model using logistic regression to get results to evaluate. Download the file Spambase/spamD.tsv from GitHub and then perform the steps shown in the following listing.
Listing 1 Building and applying a logistic regression spam model
spamD < read.table('spamD.tsv',header=T,sep='\t') ❶ spamTrain < subset(spamD,spamD$rgroup >= 10) ❷ spamTest < subset(spamD,spamD$rgroup < 10) spamVars < setdiff(colnames(spamD), list('rgroup','spam')) ❸ spamFormula < as.formula(paste('spam == "spam"', paste(spamVars, collapse = ' + '),sep = ' ~ ')) spamModel < glm(spamFormula,family = binomial(link = 'logit'), ❹ data = spamTrain) spamTrain$pred < predict(spamModel,newdata = spamTrain, type = 'response') spamTest$pred < predict(spamModel,newdata = spamTest, ❺ type = 'response')
❶ Read in the data
❷ Split the data into training and test sets
❸ Create a formula which describes the model
❹ Fit the logistic regression model
❺ Make predictions on the training and test sets
The spam model predicts the probability that a given email is spam. A sample of the results of our simple spam classifier is shown in the next listing.
Listing 2 Spam classifications
sample < spamTest[c(7,35,224,327), c('spam','pred')]
print(sample)
## spam pred ❶
## 115 spam 0.9903246227
## 361 spam 0.4800498077
## 2300 nonspam 0.0006846551
## 3428 nonspam 0.0001434345
❶ The first column gives the predicted class label (spam or nonspam). The second column gives the predicted probability that an email is spam. If the probability > 0.5 the email is labeled “spam,” otherwise it’s “nonspam”.
THE CONFUSION MATRIX
The absolute most interesting summary of classifier performance is the confusion matrix. This matrix is a table that summarizes the classifier’s predictions against the known data categories.
The confusion matrix is a table counting how often each combination of known outcomes (the truth) occurred in combination with each prediction type. For our email spam example, the confusion matrix is calculated by the following R command.
Listing 3 Spam confusion matrix
confmat_spam < table(truth = spamTest$spam, prediction = ifelse(spamTest$pred > 0.5, "spam", "nonspam")) print(confmat_spam) ## prediction ## truth nonspam spam ## nonspam 264 14 ## spam 22 158
The rows of the table (labeled truth) correspond to the label of the datums: whether they’re spam or not. The columns of the table (labeled prediction) correspond to the prediction which the model makes. The first cell of the table (truth = “nonspam” and prediction = “nonspam”) corresponds to the 264 emails in the test set which aren’t spam, and that the model (correctly) predicts are not spam. These correct negative predictions are called true negatives.
NOTE 
Confusion matrix conventions A number of tools, and the Wikipedia, draw confusion matrices with the truth values controlling the xaxis in the figure. This is likely due to the math convention that the first coordinate in matrices and tables names the row (vertical offset), and not the column (horizontal offset). It’s our feeling that direct labels, such as “pred” and “actual”, are much clearer than any convention. Also note that in residual graphs the prediction is always the xaxis, and being visually consistent with this important convention is a benefit. 
It’s a standard terminology to refer to datums which are in the class of interest as positive instances, and those not in the class of interest as negative instances. In our scenario, spam emails are positive instances, and nonspam emails are negative instances.
In a twobytwo confusion matrix, every cell has a special name, as illustrated in table 1.
Table 1 Twobytwo confusion matrix

Prediction=NEGATIVE (predicted as nonspam) 
Prediction=POSITIVE (predicted as spam) 
Truth mark=NEGATIVE (nonspam) 
True negatives (TN)

False positives (FP) 
Truth mark=POSITIVE (spam) 
False negatives (FN)

True positives (TP) 
Using this summary, we can now start to calculate various performance metrics of our spam filter.
TIP 
Changing a score to a classification Note that we converted the numerical prediction score into a decision by checking if the score was above or below 0.5. This means that if the model returned a higher than 50% probability that an email is spam, we classify it as spam. For some scoring models (like logistic regression) the 0.5 score is likely a threshold that gives a classifier with reasonably good accuracy. Accuracy isn’t always the end goal, and for unbalanced training data the 0.5 threshold won’t be good. Picking thresholds other than 0.5 can allow the data scientist to trade precision for recall (two terms that we’ll define later in this article). You can start at 0.5, but consider trying other thresholds and looking at the ROC curve (see section 6.2.5). 
ACCURACY
Accuracy answers the question: “When the spam filter says this email is or isn’t spam, what’s the probability that it’s correct?” For a classifier, accuracy is defined as the number of items categorized correctly divided by the total number of items. It’s what fraction of the time the classifier is correct. This is shown in figure 1.
Figure 1 Accuracy
At the least, you want a classifier to be accurate. Let’s calculate the accuracy of the spam filter:
(confmat_spam[1,1] + confmat_spam[2,2]) / sum(confmat_spam) ## [1] 0.9213974
The error of around 8% is unacceptably high for a spam filter, but it’s good for illustrating different sorts of model evaluation criteria.
Before we move on, we’d like to share the confusion matrix of a good spam filter. In the next listing we create the confusion matrix for the Akismet comment spam filter from the WinVector blog.
Listing 4 Entering the Akismet confusion matrix by hand
confmat_akismet < as.table(matrix(data=c(2881,17,1,1388217),nrow=2,ncol=2)) rownames(confmat_akismet) < rownames(confmat_spam) colnames(confmat_akismet) < colnames(confmat_spam) print(confmat_akismet) ## nonspam spam ## nonspam 287 1 ## spam 17 13865
Because the Akismet filter uses link destination clues and determination from other websites (in addition to text features), it achieves a more acceptable accuracy.
(confmat_akismet[1,1] + confmat_akismet[2,2]) / sum(confmat_akismet) ## [1] 0.9987297
More importantly, Akismet seems to have suppressed fewer good comments. Our next section on precision and recall will help quantify this distinction.
WARNING 
Accuracy is an inappropriate measure for unbalanced classes Suppose we’ve a situation where we have a rare event (say, severe complications during childbirth). If the event we’re trying to predict is rare (say, around 1% of the population), the null model that says the rare event never happens is extremely (99%) accurate. The null model is more accurate than a useful (but not perfect model) that identifies 5% of the population as being “at risk” and captures all of the bad events in the 5%. This isn’t any sort of paradox. It’s that accuracy isn’t a good measure for events that have unbalanced distribution or unbalanced costs. 
PRECISION AND RECALL
Another evaluation measure used by machine learning researchers is a pair of numbers called precision and recall. These terms come from the field of information retrieval and are defined as follows.
Precision answers the question “If the spam filter says this email is spam, what’s the probability that it’s spam?” Precision is defined as the ratio of true positives to predicted positives. This is shown in figure 2.
Figure 2 Precision
We can calculate the precision of our spam filter as follows:
confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2]) ## [1] 0.9186047
It’s only a coincidence that the precision is close to the accuracy number we reported earlier. Again, precision is how often a positive indication turns out to be correct. It’s important to remember that precision is a function of the combination of the classifier and the dataset. It doesn’t make sense to ask how precise a classifier is in isolation; it’s only sensible to ask how precise a classifier is for a given dataset. The hope is that the classifier is similarly precise on the overall population which the dataset is drawn from: a population with the same distribution of positives instances as the dataset.
In our email spam example, 92% precision means 8% of what was flagged as spam wasn’t spam. This is an unacceptable rate for losing possibly important messages. Akismet, on the other hand, had a precision of over 99.99%, and it throws out few nonspam email.
confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[1,2]) ## [1] 0.9999279
The companion score to precision is recall. Recall answers the question “Of all the spam in the email set, what fraction did the spam filter detect?” Recall is the ratio of true positives over all positives, as shown in figure 3.
Figure 3 Recall
Let’s compare the recall of the two spam filters.
confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1]) ## [1] 0.8777778 confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[2,1]) ## [1] 0.9987754
For our email spam filter this is 88%, which means about 12% of the spam email we receive still makes it into our inbox. Akismet has a recall of 99.88%. In both cases most spam is tagged (we have high recall) and precision is emphasized over recall. This is appropriate for a spam filter, because it’s more important to not lose nonspam email than it is to filter every single piece of spam out of our inbox.
It’s important to remember this: precision is a measure of confirmation (when the classifier indicates positive, how often it’s correct), and recall is a measure of utility (how much the classifier finds of what there is to find). Precision and recall tend to be relevant to business needs and are good measures to discuss with your project sponsor and client.
F1
In situations like this, some people prefer to have one number to compare all the different choices by. One such score is the F1 score. The F1 score measures a tradeoff between precision and recall. It’s defined as the harmonic mean of the precision and recall. This is most easily shown with an explicit calculation.
precision < confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2]) recall < confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1]) (F1 < 2 * precision * recall / (precision + recall) ) ## [1] 0.8977273
Our spam filter with 0.93
precision and 0.88
recall has an F1 score of 0.90
. F1 is one when a classifier has perfect precision and recall, and goes to zero for classifiers which have either low precision or recall (or both). Suppose you think that your spam filter is losing too much real email, and you want to make it “pickier” about marking email as spam; you want to increase its precision. Quite often, increasing the precision of a classifier also lowers its recall: in this case, a pickier spam filter may also mark fewer real spam emails as spam, and allow it into your inbox. If the filter’s recall falls too low as its precision increases, this results in a lower F1. This possibly means that you traded off too much recall for better precision.
SENSITIVITY AND SPECIFICITY
It’s possible the filter works fine on your personal email asis, because the nature of spam (the length of the email, the words used, the number of links, etc.) probably doesn’t change much between the two email accounts, but the proportion of spam you get on the personal email account may be different than on your work email. This can change the performance of the spam filter on your personal email.4
Note: 
The spam filter performance can also change because the nature of the nonspam is different, too: the words commonly used are different; the number of links or images in a legitimate email may be different; the email domains of people you correspond with may be different. For this discussion, we assume that the proportion of spam email is the main reason that a spam filter’s performance is different. 
Let’s see how changes in the proportion of spam can change the performance metrics of the spam filter. Here, we simulate having email sets with both higher and lower proportions of email than the data that we trained the filter on.
Listing 5 Comparing spam filter performance on data with different proportions of spam
set.seed(234641) N < nrow(spamTest) pull_out_ix < sample.int(N, 100, replace=FALSE) removed = spamTest[pull_out_ix,] ❶ get_performance < function(sTest) { ❷ proportion < mean(sTest$spam == "spam") confmat_spam < table(truth = sTest$spam, prediction = ifelse(sTest$pred>0.5, "spam", "nonspam")) precision < confmat_spam[2,2]/sum(confmat_spam[,2]) recall < confmat_spam[2,2]/sum(confmat_spam[2,]) list(spam_proportion = proportion, confmat_spam = confmat_spam, precision = precision, recall = recall) } sTest < spamTest[pull_out_ix,] ❸ get_performance(sTest) ## $spam_proportion ## [1] 0.3994413 ## ## $confmat_spam ## prediction ## truth nonspam spam ## nonspam 204 11 ## spam 17 126 ## ## $precision ## [1] 0.919708 ## ## $recall ## [1] 0.8811189 get_performance(rbind(sTest, subset(removed, spam=="spam"))) ## $spam_proportion ❹ ## [1] 0.4556962 ## ## $confmat_spam ## prediction ## truth nonspam spam ## nonspam 204 11 ## spam 22 158 ## ## $precision ## [1] 0.9349112 ## ## $recall ## [1] 0.8777778 get_performance(rbind(sTest, subset(removed, spam=="nonspam"))) ❺ ## $spam_proportion ## [1] 0.3396675 ## ## $confmat_spam ## prediction ## truth nonspam spam ## nonspam 264 14 ## spam 17 126 ## ## $precision ## [1] 0.9 ## ## $recall ## [1] 0.8811189
❶ Pull one hundred emails out of the test set at random.
❷ A convenience function to print out the confusion matrix, precision, and recall of the filter on a test set.
❸ Look at performance on a test set with the same proportion of spam as the training data
❹ Add back only additional spam, and the test set has a higher proportion of spam than the training set
❺ Add back only nonspam, and the test set has a lower proportion of spam than the training set.
Note that the recall of the filter is the same in all three cases: about 88%. When the data has more spam than the filter was trained on, the filter has higher precision, which means it throws a lower proportion of nonspam email out. This is good! When the data has less spam than the filter was trained on, the precision is lower, meaning the filter throws out a higher fraction of nonspam email. This is undesirable.
Because there are situations where a classifier or filter may be used on populations where the prevalence of the positive class (in this example, spam) varies, it’s useful to have performance metrics which are independent of the class prevalence. One such pair of metrics is sensitivity and specificity. This pair of metrics is common in medical research, because tests for diseases and other conditions are used on different populations, with differing prevalence of a given disease or condition.
Sensitivity is also called the true positive rate and it’s exactly equal to recall. Specificity is also called the true negative rate: it’s the ratio of true negatives to all negatives. This is shown in figure 4.
Figure 4 Specificity
Sensitivity and recall answer the question what fraction of spam does the spam filter find? Specificity answers the question what fraction of nonspam does the spam filter find?
We can calculate specificity for our spam filter:
confmat_spam[1,1] / (confmat_spam[1,1] + confmat_spam[1,2]) ## [1] 0.9496403
One minus the specificity is also called the false positive rate. False positive rate answers the question “What fraction of nonspam will the model classify as spam?” You want the false positive rate to be low (or the specificity to be high), and the sensitivity to also be high. Our spam filter has a specificity of about 0.95, which means that it marks about 5% of nonspam email as spam.
An important property of sensitivity and specificity is this: if you flip your labels (switch from spam being the class you’re trying to identify to nonspam being the class you’re trying to identify), you switch sensitivity and specificity. Also, a trivial classifier that always says positive or always says negative always returns a zero score on either sensitivity or specificity. Useless classifiers always score poorly on at least one of these measures.
Why have both precision/recall and sensitivity/specificity? Historically, these measures come from different fields, but each has advantages. Sensitivity/specificity is good for fields, like medicine, where it’s important to have an idea how well a classifier, test, or filter separates positive from negative instances independently of the distribution of the different classes in the population. But precision/recall give you an idea how well a classifier or filter works on a specific population. If you want to know the probability that an email identified as spam is really spam, you must know how common spam is in that person’s email box, and the appropriate measure is precision.
SUMMARY: USING COMMON CLASSIFICATION PERFORMANCE MEASURES
You should use these standard scores while working with your client and sponsor to see which measure most models their business needs. For each score, you should ask them if they need that score to be high and then run a quick thought experiment with them to confirm you’ve gotten their business need. You should then be able to write a project goal in terms of a minimum bound on a pair of these measures. Table 2 shows a typical business need and an example followup question for each measure.
Table 2 Classifier performance measures business stories
Measure 
Typical business need 
Followup question 
Accuracy 
“We need most of our decisions to be correct.” 
“Can we tolerate being wrong 5% of the time? And do users see mistakes like spam marked as nonspam or nonspam marked as spam as being equivalent?” 
Precision 
“Most of what we marked as spam had darn well better be spam.” 
“That guarantees that most of what’s in the spam folder is spam, but it isn’t the best way to measure what fraction of the user’s legitimate email is lost. We could cheat on this goal by sending all our users a bunch of easytoidentify spam which we correctly identify. Maybe we want good specificity.” 
Recall 
“We want to cut down on the amount of spam a user sees by a factor of ten (eliminate 90% of the spam).” 
“If 10% of the spam gets through, will the user see mostly nonspam mail or mostly spam? Will this result in a good user experience?” 
Sensitivity 
“We have to cut a lot of spam, otherwise the user won’t see a benefit.” 
“If we cut spam down to 1% of what it is now, would that be a good user experience?” 
Specificity 
“We must be at least three nines on legitimate email; the user must see at least 99.9% of their nonspam email.” 
“Will the user tolerate missing 0.1% of their legitimate email, and should we keep a spam folder the user can look at?” 
One conclusion for this dialogue process on spam classification could be to recommend writing the business goals as maximizing sensitivity while maintaining a specificity of at least 0.999.
That’s all for this article. If you want to learn more about the book, you can check it out on our browserbased liveBook reader here and in this slide deck.