Brink_00 By Henrik Brink, Joseph W. Richards, and Mark Fetherolf

In this article, excerpted from Real-World Machine Learning, we describe the difficulties that arise when evaluating ML models.


Figure 1

The primary goal of supervised machine learning is accurate prediction. We want our ML model to be as accurate as possible when predicting on new data (for which the target variable is unknown). Said in a different way, we want our models, which have been built from some training data, to generalize well to new data. That way, when we deploy the model in production, we can be assured that the predictions generated are of high quality.

Therefore, when we evaluate the performance of a model, we want to determine how well that model will perform on new data. This seemingly simple task is wrought with complications and pitfalls that can befuddle even the most experienced ML users. In this article, we describe the difficulties that arise when evaluating ML models and propose a simple workflow to overcome those menacing issues and achieve unbiased estimates of model performance.

The Problem: Over-fitting and Model Optimism

To describe the challenges associated with estimating the predictive accuracy of a model, it is easiest to start with an example.

Imagine that we want to predict the production of bushels of corn per acre on a farm as a function of the proportion of that farm’s planting area that was treated with a new pesticide. We have training data for 100 farms for this regression problem. Plotting the target (bushels of corn per acre) versus the feature (% of the farm treated) it is clear that an increasing, non-linear relationship exists, and that the data also have random fluctuations (see Figure 1).


Figure 2

Now, suppose that we want to use a simple non-parametric ML regression modeling technique to build a predictive model for corn production as a function of proportion of land treated. One of the simplest ML regression models is kernel smoothing. Kernel smoothing operates by taking local averages: for each new data point, the value of the target variable is modeled as the average of the target variable for only the training data whose feature value is close to the feature value of the new data point. A single parameter, called the bandwidth parameter, controls the size of the window for the local averaging.

Figure 3 demonstrates what happens for different values of the kernel smoothing bandwidth parameters. For large values of the bandwidth, almost all of the training data are averaged together to predict the target, at each value of the input parameter. This causes the model to be very flat and to under-fit the obvious trend in the training data. Likewise, for very small values of the bandwidth, only one or two training instances are used to determine the model output at each feature value. Therefore, the model effectively traces every bump and wiggle in the data. This susceptibility to model the intrinsic noise in the data instead of the true signal is called over-fitting. Where we want to be is somewhere in the Goldilocks zone: not too under-fit and not too over-fit.


Figure 3: Three fits of a kernel smoothing regression model to the corn production training set.

Now, let’s get back to the problem at hand: determining how well our ML model will generalize to predict the corn output from data on different farms. The first step in this process is to select an evaluation metric that captures how good our predictions are. For regression, the standard metric for evaluation is mean squared error (MSE), which is the average squared difference between the true value of the target variable and the model-predicted value.

This is where things get tricky. Evaluated on the training set, the error (measured by MSE) of our model predictions gets ever smaller as the bandwidth parameter decreases. This should not be unexpected:  the more flexibility that we allow the model, the better it will do at tracing the patterns (both the signal and the noise) in the training data. However, the models with smallest bandwidth are severely over-fit to the training data because they trace every random fluctuation in the training set. Using these models to predict on new data will result in poor predictive accuracy because the new data will have their own unique random noise signatures that are different from those in the training set.

Thus, there is a divergence between the training set error and the generalization error of a ML model. This divergence is exemplified in Figure 3 on the corn production data. For small values of the bandwidth parameter, the MSE evaluated on training set is extremely small while the MSE evaluated on new data (in this case, 10,000 new instances) is much larger.  Simply put, the performance of the predictions of a model evaluated on the training set is not indicative of the performance of that model on new data. Therefore, it is extremely dangerous to evaluate the performance of a model on the same data that was used to train the model.

CAUTION about  Double-dipping of the training data
Using the training data for both model fitting and evaluation purposes, can lead you to be overly optimistic of the performance of the model. This can cause you to ultimately choose a sub-optimal model which performs poorly when predicting on new data.

As we see on the corn production data, choosing the model with smallest training set MSE causes the selection of the model with smallest bandwidth. On the training set, this model yields a MSE of 0.08. However, when applied to new data, the same model yields a MSE of 0.50, which is much worse than the optimal model (bandwidth=0.12 and MSE=0.27).

We need an evaluation metric that better approximates the performance of the model on new data. This way, we can be confident about the accuracy of our model when deployed to make predictions on new data. This is the topic of the next subsection.


Figure 4: Comparison of the training set error to the error on new data for the corn production regression problem. The training set error is an overly optimistic measure of the performance of the model for new data, particularly for small values of the bandwidth parameter. Obviously, using the training set error as a surrogate for the prediction error on new data will get us into a lot of trouble.

The Solution: Cross-validation

We have diagnosed the challenge in model evaluation: the training set error is not indicative of the error of the model when applied to new data. To get a good estimate of what our error rate will be for new data, we must use more sophisticated methodology called cross-validation, that rigorously employs the training set to evaluate what the accuracy will be on new data.

The two most commonly used methods for cross-validation are the holdout method and K-fold cross-validation.

The Holdout Method

Using the same training data to both fit and evaluate the accuracy of a model produces accuracy metrics that are overly optimistic. The easiest way around this is to use separate training and testing subsets, using only the training subset to fit the model and only the testing subset to evaluate the accuracy of the model.

This approach is referred to as the holdout method, because a random subset of the training data is held out from the training process. Practitioners typically leave out 30% of the data as the testing subset.  The basic algorithmic flow of the holdout method is shown in Figure 4 and Python pseudo-code is given in Listing 1.


Figure 5: Flowchart of the holdout method of cross-validation. Here, the dark green boxes denote the target variable.

Listing 1: Cross-validation with the Holdout Method

# assume that we begin with two inputs:
#     features – a matrix of input features
#     target – an array of target variables corresponding to those features

N = features.shape[0]
N_train = floor(0.7 * N)

# randomly select indices for the training subset
idx_train = random.sample(np.arange(N), N_train)

# break your data into training and testing subsets
features_train = features[idx_train,:]
target_train = target[idx_train]
features_test = features[~idx_train,:]
target_test = target[~idx_train]

# build a model on the training set
model = train(features_train, target_train)

# generate model predictions on the testing set
preds_test = predict(model, features_test)

# evaluate the accuracy of the predictions
accuracy = evaluate_acc(preds_test, target_test)

Now, let’s apply the holdout method to the corn production data. For each value of the bandwidth parameter, we apply the holdout method (using a 70% / 30% split) and compute the MSE on the predictions for the held-out 30% of data. Figure 5 demonstrates how the holdout method estimates of MSE stack up to the MSE of the model when applied to new data. Two main things stand out:

  1. The error estimates computed by the holdout method are very close to the ‘new data’ error of the model. They are certainly much closer than the training set error estimates (Figure 3), particularly for small bandwidth values.
  2. The holdout error estimates are very noisy. They bounce around wildly compared to the smooth curve that represents the error on new data.

We could beat down the noise by doing repeated random training-testing splits and averaging the result. However, over multiple iterations, each data point will be assigned to the testing set a different number of times, which could bias our result.

A better approach is to do k-fold cross-validation.


Figure 6: Comparison of the holdout error MSE to the MSE on new data, using the corn production data set. The holdout error is an unbiased estimate of the error of each model on new data. However, it is a very noisy estimator that fluctuates wildly between 0.14 and 0.40 for bandwidths in the neighborhood of the optimal model (bandwidth = 0.12).

K-Fold Cross-Validation

A better, but more computationally intensive approach to cross-validation is K-fold cross-validation. Like the holdout method, K-fold cross-validation relies on quarantining subsets of the training data during the learning process. The primary difference is that K-fold cross-validation begins by randomly splitting the data into K disjoint subsets, called folds (typical choices for K are 5, 10, or 20). For each fold, a model is trained on all of the data except the data from that fold and is subsequently used to generate predictions for the data from that fold.

Once all K folds are cycled through, the predictions for each fold are aggregated and compared to the true target variable to assess accuracy. A pictorial representation of K-fold cross-validation is shown in Figure 6 and pseudo-code is given in Listing 2.


Figure 7: Flowchart of the K-fold cross-validation.

Finally, let’s apply K-fold cross-validation to the corn production data. For each value of the bandwidth parameter, we apply K-fold cross-validation with K=10 and compute the cross-validated MSE on the predictions. Figure 7 demonstrates how the K-fold cross-validation MSE estimates stack up to the MSE of the model when applied to new data. Clearly, the K-fold cross-validation error estimate is very close to the error of the model on future data.

Listing 2: Cross-validation with K-fold Cross-validation

# assume that we begin with two inputs:
#     features – a matrix of input features
#     target – an array of target variables corresponding to those features

N = features.shape[0]
K = 10 # number of folds

preds_kfold = np.empty(N)

folds = np.random.randint(0, K, size=N)

# loop through the cross-validation folds
for ii in np.arange(K):
    # break your data into training and testing subsets
    features_train = features[folds != ii,:]
    target_train = target[folds != ii]
    features_test = features[folds == ii,:]

    # build a model on the training set
    model = train(features_train, target_train)

    # generate and store model predictions on the testing set
    preds_kfold[folds == ii] = predict(model, features_test)

# evaluate the accuracy of the predictions
accuracy = evaluate_acc(preds_kfold, target)


Figure 8: Comparison of the K-fold cross-validation error MSE to the MSE on new data, using the corn production data set. The K-fold CV error is a very good estimate for how the model will perform on new data, allowing us to use it confidently to forecast the error of the model and to select the best model.

Some things to look out for when using cross-validation

Cross-validation gives us a way to estimate how accurately our ML models will predict when deployed in the wild. This is extremely powerful, as it allows us to select the best model for our task.

However, there are a few things to watch out for when applying cross-validation on real-world data:

  • The larger the number of folds used in K-fold cross-validation, the better the error estimates will be but the longer your program will take to run. Solution:  Use at least 10 folds (or more) when you can. For models that train and predict very quickly, you can use leave-one-out cross-validation (K = number of data instances).
  • Cross-validation methods (including both the holdout and K-fold methods) assume that the training data form a representative sample from the population of interest. If you plan to deploy the model to predict on some new data, then those data should be well-represented by the training data. If not, then the cross-validation error estimates may be overly optimistic for the error rates on future data. Solution: Ensure that any potential biases in the training data are addressed and minimized.
  • Some data sets use features that are temporal in nature. For instance, using last month’s revenue to forecast this month’s revenue. If this is the case on your data, then you must ensure that features that are available in the future can never be used to predict the past. Solution: You can structure your cross-validation holdout set or K folds so that the training set data are all collected previous to the testing set.


We started by discussing how to evaluate models in general terms. It was clear that we can’t double dip the training data and use it for evaluation as well as training. Instead we introduced cross-validation as a more robust method of model evaluation.

  • Holdout cross-validation is the simplest form of cross-validation, where a testing set is held out for prediction, in order to better estimate the generalizability of the model.
  • K-fold cross-validation – where K folds are held out one at a time – were introduced to give even less uncertain estimates of the model performance. This improvement comes at the cost of a higher computational cost. If available, the best estimate is obtained if K = number of samples, also known as leave-one-out cross-validation.
  • The basic model evaluation workflow was introduced. In condensed form this is introduced as:
    • Acquire and pre-process the dataset for modeling (Chapter 2) and determine the appropriate ML method and algorithm (Chapter 3).
    • Build models and make predictions using either the holdout or k-fold cross-validation methods, depending on the computing resources available.
    • Evaluate the predictions with the performance metric of choice. If the ML method is classification, common performance metrics will be introduced in the following section 4.2. Likewise, we will look at common performance metrics for regression in section 4.3.
    • Tweak data and model until the desired model performance is obtained. In sections 5-8 we will look at various methods for increasing the model performance in common real-world scenarios.
  • For classification models, we introduced a few model performance metrics to be used in step 3 of the above workflow. These techniques included simple counting accuracy, the confusions matrix, receiver-operator characteristics, the ROC curve and the area under the ROC curve.
  • For regression models, we introduced the root-mean-square error and R-squared estimators, and we discussed the usefulness of simple visualizations such as the prediction-vs-actual scatter plot and the residual plot.
  • We introduced the concept of tuning parameters and showed how to optimize a model with respect to those parameters using a grid search algorithm.




Under / over-fitting

Using too simple / too complex of a model for the problem at hand.

Evaluation metric

A number that characterizes the performance of the model.

Mean squared error

A specific evaluation metric used in regression models.


The method of splitting the training set into 2 or more separate training/testing sets in order to assess the accuracy better.

Holdout method

A form of cross-validation where a single test set is held out of the model fitting routine for testing purposes.

K-fold cross-validation

A kind of cross-validation where the data are split into K random disjoint sets (folds). The folds are held out one at a time, and cross-validated on models built on the remainder of the data.

Confusion matrix

A matrix showing for each class the number of predicted values that were correctly classified or not.

ROC – Receiver operator characteristic

A number characterizing the number of true positives, false positives, true negatives or false negatives.

AUC – Area under the ROC curve

Tuning parameter

Grid search

An evaluation metric for classification tasks defined from the area under the ROC curve of false positives versus true positives.

An internal parameter to a machine learning algorithm, such as the bandwidth parameter for kernel smoothing regression

A brute force strategy for selecting the best values for the tuning parameters to optimize a ML model