From Practical Data Science with R, Second Edition by Nina Zumel and John Mount

This article discusses how one can experiment with the effectiveness of a classification model using a spam filter.

Take 37% off Practical Data Science with R, Second Edition by entering fcczumel3 into the discount code box at checkout at

Evaluating classification models

A classification model places examples into one of two or more categories. For measuring classifier performance, we’ll first introduce the incredibly useful tool called the confusion matrix and show how it can be used to calculate many important evaluation scores. The first score we’ll discuss is accuracy.

Example Scenario

Suppose we want to classify email into spam (email we don’t want) and non-spam (email we want).


A ready-to-go example (with a good description) is the Spambase dataset Each row of this dataset is a set of features measured for a specific email and an additional column telling whether the mail was spam (unwanted) or non-spam (wanted). We’ll quickly build a spam classification model using logistic regression to get results to evaluate. Download the file Spambase/spamD.tsv from GitHub and then perform the steps shown in the following listing.

Listing 1 Building and applying a logistic regression spam model

 spamD <- read.table('spamD.tsv',header=T,sep='\t') 
 spamTrain <- subset(spamD,spamD$rgroup  >= 10)     
 spamTest <- subset(spamD,spamD$rgroup < 10)
 spamVars <- setdiff(colnames(spamD), list('rgroup','spam')) 
 spamFormula <- as.formula(paste('spam == "spam"',
                paste(spamVars, collapse = ' + '),sep = ' ~ '))
 spamModel <- glm(spamFormula,family = binomial(link = 'logit'), 
                                  data = spamTrain)
 spamTrain$pred <- predict(spamModel,newdata = spamTrain,
                              type = 'response')
 spamTest$pred <- predict(spamModel,newdata = spamTest,  
                             type = 'response')

  Read in the data

  Split the data into training and test sets

  Create a formula which describes the model

 Fit the logistic regression model

 Make predictions on the training and test sets

The spam model predicts the probability that a given email is spam. A sample of the results of our simple spam classifier is shown in the next listing.

Listing 2 Spam classifications

 sample <- spamTest[c(7,35,224,327), c('spam','pred')]
 ##          spam         pred   
 ## 115      spam 0.9903246227
 ## 361      spam 0.4800498077
 ## 2300 non-spam 0.0006846551
 ## 3428 non-spam 0.0001434345

  The first column gives the predicted class label (spam or non-spam). The second column gives the predicted probability that an email is spam. If the probability > 0.5 the email is labeled “spam,” otherwise it’s “non-spam”.


The absolute most interesting summary of classifier performance is the confusion matrix. This matrix is a table that summarizes the classifier’s predictions against the known data categories.

The confusion matrix is a table counting how often each combination of known outcomes (the truth) occurred in combination with each prediction type. For our email spam example, the confusion matrix is calculated by the following R command.

Listing 3 Spam confusion matrix

 confmat_spam <- table(truth = spamTest$spam,
                       prediction = ifelse(spamTest$pred > 0.5,                         
                                           "spam", "non-spam"))
 ##          prediction
 ## truth   non-spam spam
 ##   non-spam   264   14
 ##   spam        22  158

The rows of the table (labeled truth) correspond to the label of the datums: whether they’re spam or not. The columns of the table (labeled prediction) correspond to the prediction which the model makes. The first cell of the table (truth = “non-spam” and prediction = “non-spam”) corresponds to the 264 emails in the test set which aren’t spam, and that the model (correctly) predicts are not spam. These correct negative predictions are called true negatives.


Confusion matrix conventions

A number of tools, and the Wikipedia, draw confusion matrices with the truth values controlling the x-axis in the figure. This is likely due to the math convention that the first coordinate in matrices and tables names the row (vertical offset), and not the column (horizontal offset). It’s our feeling that direct labels, such as “pred” and “actual”, are much clearer than any convention. Also note that in residual graphs the prediction is always the x-axis, and being visually consistent with this important convention is a benefit. 


It’s a standard terminology to refer to datums which are in the class of interest as positive instances, and those not in the class of interest as negative instances. In our scenario, spam emails are positive instances, and non-spam emails are negative instances.

In a two-by-two confusion matrix, every cell has a special name, as illustrated in table 1.

Table 1 Two-by-two confusion matrix

Prediction=NEGATIVE (predicted as non-spam)

Prediction=POSITIVE (predicted as spam)

Truth mark=NEGATIVE


True negatives (TN)


False positives (FP) confmat_spam[1,2]=14

Truth mark=POSITIVE (spam)

False negatives (FN)


True positives (TP) confmat_spam[2,2]=158


Using this summary, we can now start to calculate various performance metrics of our spam filter.


Changing a score to a classification

Note that we converted the numerical prediction score into a decision by checking if the score was above or below 0.5. This means that if the model returned a higher than 50% probability that an email is spam, we classify it as spam. For some scoring models (like logistic regression) the 0.5 score is likely a threshold that gives a classifier with reasonably good accuracy. Accuracy isn’t always the end goal, and for unbalanced training data the 0.5 threshold won’t be good. Picking thresholds other than 0.5 can allow the data scientist to trade precision for recall (two terms that we’ll define later in this article). You can start at 0.5, but consider trying other thresholds and looking at the ROC curve (see section 6.2.5).



Accuracy answers the question: “When the spam filter says this email is or isn’t spam, what’s the probability that it’s correct?” For a classifier, accuracy is defined as the number of items categorized correctly divided by the total number of items. It’s what fraction of the time the classifier is correct. This is shown in figure 1.

Figure 1 Accuracy

At the least, you want a classifier to be accurate. Let’s calculate the accuracy of the spam filter:

 (confmat_spam[1,1] + confmat_spam[2,2]) / sum(confmat_spam)
 ## [1] 0.9213974                     

The error of around 8% is unacceptably high for a spam filter, but it’s good for illustrating different sorts of model evaluation criteria.

Before we move on, we’d like to share the confusion matrix of a good spam filter. In the next listing we create the confusion matrix for the Akismet comment spam filter from the Win-Vector blog.

Listing 4 Entering the Akismet confusion matrix by hand

 confmat_akismet <- as.table(matrix(data=c(288-1,17,1,13882-17),nrow=2,ncol=2))
 rownames(confmat_akismet) <- rownames(confmat_spam)
 colnames(confmat_akismet) <- colnames(confmat_spam)
 ##       non-spam  spam
 ## non-spam   287     1
 ## spam        17 13865

Because the Akismet filter uses link destination clues and determination from other websites (in addition to text features), it achieves a more acceptable accuracy.

 (confmat_akismet[1,1] + confmat_akismet[2,2]) / sum(confmat_akismet)  
 ## [1] 0.9987297                      

More importantly, Akismet seems to have suppressed fewer good comments. Our next section on precision and recall will help quantify this distinction.


Accuracy is an inappropriate measure for unbalanced classes

Suppose we’ve a situation where we have a rare event (say, severe complications during childbirth). If the event we’re trying to predict is rare (say, around 1% of the population), the null model that says the rare event never happens is extremely (99%) accurate. The null model is more accurate than a useful (but not perfect model) that identifies 5% of the population as being “at risk” and captures all of the bad events in the 5%. This isn’t any sort of paradox. It’s that accuracy isn’t a good measure for events that have unbalanced distribution or unbalanced costs.



Another evaluation measure used by machine learning researchers is a pair of numbers called precision and recall. These terms come from the field of information retrieval and are defined as follows.

Precision answers the question “If the spam filter says this email is spam, what’s the probability that it’s spam?” Precision is defined as the ratio of true positives to predicted positives. This is shown in figure 2.

Figure 2 Precision

We can calculate the precision of our spam filter as follows:

 confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2])
 ## [1] 0.9186047                       

It’s only a coincidence that the precision is close to the accuracy number we reported earlier. Again, precision is how often a positive indication turns out to be correct. It’s important to remember that precision is a function of the combination of the classifier and the dataset. It doesn’t make sense to ask how precise a classifier is in isolation; it’s only sensible to ask how precise a classifier is for a given dataset. The hope is that the classifier is similarly precise on the overall population which the dataset is drawn from: a population with the same distribution of positives instances as the dataset.

In our email spam example, 92% precision means 8% of what was flagged as spam wasn’t spam. This is an unacceptable rate for losing possibly important messages. Akismet, on the other hand, had a precision of over 99.99%, and it throws out few non-spam email.

 confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[1,2])
 ## [1] 0.9999279                       

The companion score to precision is recall. Recall answers the question “Of all the spam in the email set, what fraction did the spam filter detect?” Recall is the ratio of true positives over all positives, as shown in figure 3.

Figure 3 Recall

Let’s compare the recall of the two spam filters.

 confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1])
 ## [1] 0.8777778
 confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[2,1])
 ## [1] 0.9987754

For our email spam filter this is 88%, which means about 12% of the spam email we receive still makes it into our inbox. Akismet has a recall of 99.88%. In both cases most spam is tagged (we have high recall) and precision is emphasized over recall. This is appropriate for a spam filter, because it’s more important to not lose non-spam email than it is to filter every single piece of spam out of our inbox.

It’s important to remember this: precision is a measure of confirmation (when the classifier indicates positive, how often it’s correct), and recall is a measure of utility (how much the classifier finds of what there is to find). Precision and recall tend to be relevant to business needs and are good measures to discuss with your project sponsor and client.


Example scenario:

Suppose you had multiple spam filters to choose from, each with different values of precision and recall. How do you pick the spam filter to use?


In situations like this, some people prefer to have one number to compare all the different choices by. One such score is the F1 score. The F1 score measures a tradeoff between precision and recall. It’s defined as the harmonic mean of the precision and recall. This is most easily shown with an explicit calculation.

 precision <- confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2])
 recall <- confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1])
 (F1 <- 2 * precision * recall / (precision + recall) )
 ## [1] 0.8977273                       

Our spam filter with 0.93 precision and 0.88 recall has an F1 score of 0.90. F1 is one when a classifier has perfect precision and recall, and goes to zero for classifiers which have either low precision or recall (or both). Suppose you think that your spam filter is losing too much real email, and you want to make it “pickier” about marking email as spam; you want to increase its precision. Quite often, increasing the precision of a classifier also lowers its recall: in this case, a pickier spam filter may also mark fewer real spam emails as spam, and allow it into your inbox. If the filter’s recall falls too low as its precision increases, this results in a lower F1. This possibly means that you traded off too much recall for better precision.


Example Scenario:

You have successfully trained a spam filter with acceptable precision and recall, using your work email as training data. Now you want to use that same spam filter on a personal email account that you use primarily for your photography hobby. Will the filter work as well?


It’s possible the filter works fine on your personal email as-is, because the nature of spam (the length of the email, the words used, the number of links, etc.) probably doesn’t change much between the two email accounts, but the proportion of spam you get on the personal email account may be different than on your work email. This can change the performance of the spam filter on your personal email.4


The spam filter performance can also change because the nature of the non-spam is different, too: the words commonly used are different; the number of links or images in a legitimate email may be different; the email domains of people you correspond with may be different. For this discussion, we assume that the proportion of spam email is the main reason that a spam filter’s performance is different. 


Let’s see how changes in the proportion of spam can change the performance metrics of the spam filter. Here, we simulate having email sets with both higher and lower proportions of email than the data that we trained the filter on.

Listing 5 Comparing spam filter performance on data with different proportions of spam

 N <- nrow(spamTest)
 pull_out_ix <-, 100, replace=FALSE)
 removed = spamTest[pull_out_ix,]                   
 get_performance <- function(sTest) {               
   proportion <- mean(sTest$spam == "spam")
   confmat_spam <- table(truth = sTest$spam,
                         prediction = ifelse(sTest$pred>0.5,
   precision <- confmat_spam[2,2]/sum(confmat_spam[,2])
   recall <- confmat_spam[2,2]/sum(confmat_spam[2,])
   list(spam_proportion = proportion,
        confmat_spam = confmat_spam,
        precision = precision, recall = recall)
 sTest <- spamTest[-pull_out_ix,]          
 ## $spam_proportion
 ## [1] 0.3994413
 ## $confmat_spam
 ##           prediction
 ## truth      non-spam spam
 ##   non-spam      204   11
 ##   spam           17  126
 ## $precision
 ## [1] 0.919708
 ## $recall
 ## [1] 0.8811189
 get_performance(rbind(sTest, subset(removed, spam=="spam")))
 ## $spam_proportion        
 ## [1] 0.4556962
 ## $confmat_spam
 ##           prediction
 ## truth      non-spam spam
 ##   non-spam      204   11
 ##   spam           22  158
 ## $precision
 ## [1] 0.9349112
 ## $recall
 ## [1] 0.8777778
 get_performance(rbind(sTest, subset(removed, spam=="non-spam")))  
 ## $spam_proportion
 ## [1] 0.3396675
 ## $confmat_spam
 ##           prediction
 ## truth      non-spam spam
 ##   non-spam      264   14
 ##   spam           17  126
 ## $precision
 ## [1] 0.9
 ## $recall
 ## [1] 0.8811189

  Pull one hundred emails out of the test set at random.

  A convenience function to print out the confusion matrix, precision, and recall of the filter on a test set.

❸  Look at performance on a test set with the same proportion of spam as the training data

❹  Add back only additional spam, and the test set has a higher proportion of spam than the training set

❺  Add back only non-spam, and the test set has a lower proportion of spam than the training set.

Note that the recall of the filter is the same in all three cases: about 88%. When the data has more spam than the filter was trained on, the filter has higher precision, which means it throws a lower proportion of non-spam email out. This is good! When the data has less spam than the filter was trained on, the precision is lower, meaning the filter throws out a higher fraction of non-spam email. This is undesirable.

Because there are situations where a classifier or filter may be used on populations where the prevalence of the positive class (in this example, spam) varies, it’s useful to have performance metrics which are independent of the class prevalence. One such pair of metrics is sensitivity and specificity. This pair of metrics is common in medical research, because tests for diseases and other conditions are used on different populations, with differing prevalence of a given disease or condition.

Sensitivity is also called the true positive rate and it’s exactly equal to recall. Specificity is also called the true negative rate: it’s the ratio of true negatives to all negatives. This is shown in figure 4.

Figure 4 Specificity

Sensitivity and recall answer the question what fraction of spam does the spam filter find? Specificity answers the question what fraction of non-spam does the spam filter find?

We can calculate specificity for our spam filter:

 confmat_spam[1,1] / (confmat_spam[1,1] + confmat_spam[1,2])
 ## [1] 0.9496403                       

One minus the specificity is also called the false positive rate. False positive rate answers the question “What fraction of non-spam will the model classify as spam?” You want the false positive rate to be low (or the specificity to be high), and the sensitivity to also be high. Our spam filter has a specificity of about 0.95, which means that it marks about 5% of non-spam email as spam.

An important property of sensitivity and specificity is this: if you flip your labels (switch from spam being the class you’re trying to identify to non-spam being the class you’re trying to identify), you switch sensitivity and specificity. Also, a trivial classifier that always says positive or always says negative always returns a zero score on either sensitivity or specificity. Useless classifiers always score poorly on at least one of these measures.

Why have both precision/recall and sensitivity/specificity? Historically, these measures come from different fields, but each has advantages. Sensitivity/specificity is good for fields, like medicine, where it’s important to have an idea how well a classifier, test, or filter separates positive from negative instances independently of the distribution of the different classes in the population. But precision/recall give you an idea how well a classifier or filter works on a specific population. If you want to know the probability that an email identified as spam is really spam, you must know how common spam is in that person’s email box, and the appropriate measure is precision.


You should use these standard scores while working with your client and sponsor to see which measure most models their business needs. For each score, you should ask them if they need that score to be high and then run a quick thought experiment with them to confirm you’ve gotten their business need. You should then be able to write a project goal in terms of a minimum bound on a pair of these measures. Table 2 shows a typical business need and an example follow-up question for each measure.

Table 2 Classifier performance measures business stories


Typical business need

Follow-up question


“We need most of our decisions to be correct.”

“Can we tolerate being wrong 5% of the time? And do users see mistakes like spam marked as non-spam or non-spam marked as spam as being equivalent?”


“Most of what we marked as spam had darn well better be spam.”

“That guarantees that most of what’s in the spam folder is spam, but it isn’t the best way to measure what fraction of the user’s legitimate email is lost. We could cheat on this goal by sending all our users a bunch of easy-to-identify spam which we correctly identify. Maybe we want good specificity.”


“We want to cut down on the amount of spam a user sees by a factor of ten (eliminate 90% of the spam).”

“If 10% of the spam gets through, will the user see mostly non-spam mail or mostly spam? Will this result in a good user experience?”


“We have to cut a lot of spam, otherwise the user won’t see a benefit.”

“If we cut spam down to 1% of what it is now, would that be a good user experience?”


“We must be at least three nines on legitimate email; the user must see at least 99.9% of their non-spam email.”

“Will the user tolerate missing 0.1% of their legitimate email, and should we keep a spam folder the user can look at?”


One conclusion for this dialogue process on spam classification could be to recommend writing the business goals as maximizing sensitivity while maintaining a specificity of at least 0.999.

That’s all for this article. If you want to learn more about the book, you can check it out on our browser-based liveBook reader here and in this slide deck.