From Ensemble Methods for Machine Learning by Gautam Kunapuli

Our first case study explores a medical decision-making task: breast cancer diagnosis. We will see how to use scikit-learn’s homogeneous parallel ensemble modules in practice. Specifically, we will train and evaluate the performance of three homogeneous parallel algorithms, each characterized by increasing randomness: bagging with decision trees, random forests and ExtraTrees.


Take 40% off Ensemble Methods for Machine Learning by entering fcckunapuli into the discount code box at checkout at manning.com.


Breast-cancer diagnosis is the task of classifying patient tumor data into one of two classes: malignant vs. benign. This task (and similar medical tasks) are at the core of medical diagnostic decision-support systems that are beginning to come into wider use in the healthcare industry. Such machine-learning algorithms can greatly minimize the false negative rate (where patients who do have cancer are misdiagnosed until too late) and improve chances for treatment and survival.

This task is challenging owing to subtle variations in patient data, which makes it very difficult for a single model to achieve high performance. Bagging and random forests are able to learn effective models. This case study will be tackled using scikit-learn’s implementations of bagging and random forests. The performance of these learned ensemble models will be compared with individual (non-ensemble / traditional) machine-learning models. Results will be used to show that ensemble methods can significantly outperform individual models, thus giving the reader a concrete, real-world example of how ensembles can be successful.

Listing 1. Loading and pre-processing

 
 In [1]:
 from sklearn.datasets import load_breast_cancer
 dataset = load_breast_cancer()
  
 # Convert to a Pandas DataFrame so we can visualize nicely
 import pandas as pd
 import numpy as np
 df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
 i = np.random.permutation(len(dataset['target']))
 df = df.iloc[i, :7]
 df['diagnosis'] = dataset['target'][i]
 df = df.reset_index()
 df.head()
  
 Out[1]:

Pre-process the data by standardizing the features to be zero mean and unit standard deviation.

 
In [2]:
 from sklearn.preprocessing import StandardScaler
 X, y = dataset['data'], dataset['target']
 X = StandardScaler().fit_transform(X)

Bagging, Random Forest and ExtraTrees

The key difference between random forests and bagging variants such as random subspaces and random patches is where the feature sampling occurs. Random forests exclusively use randomized decision trees as base estimators. Specifically, they perform feature sampling inside the tree learning algorithm each time they grow the tree with a decision node.

Random subspaces and random patches, on the other hand, are not restricted to tree learning and can use any learning algorithm as a base estimator. They randomly sample features once outside before calling the base learning algorithm for each base estimator.

scikit-learn provides an ExtraTreesClassifier that supports out-of-bag estimation and parallelization, much like BaggingClassifier and RandomForestClassifier. Note that ExtraTrees typically do not perform bootstrap sampling (bootstrap=False, by default), as we are able to achieve base estimator diversity through extreme randomization.

Once we have pre-processed our data set, we will train and evaluate bagging with decision trees, random forests and ExtraTrees in order to answer the following questions:

  1. How does the ensemble performance change with ensemble size? That is, what happens when our ensembles get bigger and bigger?
  2. How does the ensemble performance change with base learner complexity? That is, what happens when our individual base estimators become more and more complex. In this case study, since all three ensemble methods considered use decision trees as base estimators, one “measure” of complexity is tree depth, with deeper trees being more complex.

Ensemble size vs. ensemble performance

Our first case study explores a medical decision-making task: breast cancer diagnosis. We will see how to use scikit-learn’s homogeneous parallel ensemble modules in practice. Specifically, we will train and evaluate the performance of three homogeneous parallel algorithms, each characterized by increasing randomness: bagging with decision trees, random forests and ExtraTrees.

Doctors make many decisions regarding patient care everyday: tasks such as diagnosis (what disease does the patient have?), prognosis (how will their disease progress?), treatment planning (how should the disease be treated?), to name a few. They make these decisions based on a patient’s health records, medical history, family history, test results and so on.

The specific data set we will use is the Wisconsin Breast Cancer (WDBC) data set, a common benchmark data set in machine learning. Since 1993, the WDBC data has been used to benchmark the performance of dozens of machine-learning algorithms.

The WDBC data set was originally created by applying feature extraction techniques on patient biopsy medical images. More concretely, for each patient, the data describe the size and texture of the cell nuclei of cells extracted during biopsy.

WDBC is available in scikit-learn. We compare the behavior of the three algorithms as the parameter n_estimators increases.

Listing 2: Training and test errors with increasing ensemble size.

 
In [3]:
 import pickle
 import os
  
 from sklearn.model_selection import train_test_split
 from sklearn.tree import DecisionTreeClassifier
 from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
 from sklearn.metrics import accuracy_score
  
 max_leaf_nodes = 8
 n_runs = 20
 n_estimator_range = range(2, 20, 1)
  
 bag_trn_error = np.zeros((n_runs, len(n_estimator_range)))  # Train error for the bagging classifier
 rf_trn_error = np.zeros((n_runs, len(n_estimator_range)))   # Train error for the random forest classifier
 xt_trn_error = np.zeros((n_runs, len(n_estimator_range)))   # Train error for the extra trees classifier
  
 bag_tst_error = np.zeros((n_runs, len(n_estimator_range)))  # Test error for the bagging classifier
 rf_tst_error = np.zeros((n_runs, len(n_estimator_range)))   # Test error for the random forest classifier
 xt_tst_error = np.zeros((n_runs, len(n_estimator_range)))   # Test error for the extra trees classifier
  
 for run in range(0, n_runs):
     print('Run {0}'.format(run))
  
     # Split into train and test sets
     X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25)
  
     for j, n_estimators in enumerate(n_estimator_range):
         # Train using bagging
         bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier
                                                    (max_leaf_nodes=max_leaf_nodes),
                                     n_estimators=n_estimators, max_samples=0.5, n_jobs=-1)
         bag_clf.fit(X_trn, y_trn)
         bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn))
         bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst))
  
         # Train using random forests
         rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes,
                                         n_estimators=n_estimators, n_jobs=-1)
         rf_clf.fit(X_trn, y_trn)
         rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn))
         rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst))
  
         # Train using extra trees
         xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True,
                                       n_estimators=n_estimators, n_jobs=-1)
         xt_clf.fit(X_trn, y_trn)
         xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn))
         xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst))
  
     results = (bag_trn_error, bag_tst_error, \
                rf_trn_error, rf_tst_error, \
                xt_trn_error, xt_tst_error)
  
  
 In [4]:
 %matplotlib inline
  
 import matplotlib.pyplot as plt
  
 fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
 n_estimator_range = range(2, 20, 1)
  
 # Plot the training error
 m = np.mean(bag_trn_error*100, axis=0)
 ax[0].plot(n_estimator_range, m, linewidth=3)
  
 m = np.mean(rf_trn_error*100, axis=0)
 ax[0].plot(n_estimator_range, m, linestyle='--', linewidth=3)
  
 m = np.mean(xt_trn_error*100, axis=0)
 ax[0].plot(n_estimator_range, m, linestyle='-.', linewidth=3)
  
 ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16)
 ax[0].set_xlabel('Number of Estimators', fontsize=16)
 ax[0].set_ylabel('Training Error (%)', fontsize=16)
 ax[0].axis([0, 21, 1, 9])
 ax[0].grid()
  
 # Plot the test error
 m = np.mean(bag_tst_error*100, axis=0)
 ax[1].plot(n_estimator_range, m, linewidth=3)
  
 m = np.mean(rf_tst_error*100, axis=0)
 ax[1].plot(n_estimator_range, m, linestyle='--', linewidth=3)
  
 m = np.mean(xt_tst_error*100, axis=0)
 ax[1].plot(n_estimator_range, m, linestyle='-.', linewidth=3)
  
 ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16)
 ax[1].set_xlabel('Number of Estimators', fontsize=16)
 ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=16)
 ax[1].grid()
 ax[1].axis([0, 21, 1, 9]);
 plt.tight_layout()
 Out[4]:


Base learner complexity vs. ensemble performance

Next, we compare the behavior of the three algorithms as the complexity of the base learners increases. There are several ways to control the complexity of the base decision trees: maximum depth, maximum number of leaf nodes, impurity criteria etc. Here, we compare the performance of the three ensemble methods as with complexity as determined by max_leaf_nodes.

It is always good practice to standardize the training data, especially since the scales of the features are vastly different. Standardization is a pre-processing step that rescales features to have a mean of 0 and standard deviation of 1.

Recall that highly complex trees are inherently unstable and sensitive to small perturbations in the data. This means that, in general, if we increase the complexity of the base learners, we will need a lot more of them to successfully reduce the variance of the ensemble overall. Here, however, we have fixed n_estimators=10.

 
In [5]:
 n_estimators = 10
 max_leaf_nodes = 8
 n_runs = 20
 n_leaf_range = [2, 4, 8, 16, 24, 32]
  
 bag_trn_error = np.zeros((n_runs, len(n_leaf_range)))  # Train error for the bagging classifier
 rf_trn_error = np.zeros((n_runs, len(n_leaf_range)))   # Train error for the random forest classifier
 xt_trn_error = np.zeros((n_runs, len(n_leaf_range)))   # Train error for the extra trees classifier
  
 bag_tst_error = np.zeros((n_runs, len(n_leaf_range)))  # Test error for the bagging classifier
 rf_tst_error = np.zeros((n_runs, len(n_leaf_range)))   # Test error for the random forest classifier
 xt_tst_error = np.zeros((n_runs, len(n_leaf_range)))   # Test error for the extra trees classifier
  
 for run in range(0, n_runs):
     print('Run {0}'.format(run))
  
     # Split into train and test sets
     X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25)
  
     for j, max_leaf_nodes in enumerate(n_leaf_range):
         # Train using bagging
         bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier
                                                    (max_leaf_nodes=max_leaf_nodes),
                                     n_estimators=n_estimators, max_samples=0.5, n_jobs=-1)
         bag_clf.fit(X_trn, y_trn)
         bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn))
         bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst))
  
         # Train using random forests
         rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes,
                                         n_estimators=n_estimators, n_jobs=-1)
         rf_clf.fit(X_trn, y_trn)
         rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn))
         rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst))
  
         # Train using extra trees
         xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True,
                                       n_estimators=n_estimators, n_jobs=-1)
         xt_clf.fit(X_trn, y_trn)
         xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn))
         xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst))
            
     results = (bag_trn_error, bag_tst_error, \
                rf_trn_error, rf_tst_error, \
                xt_trn_error, xt_tst_error)
            
    
  
  
  
 In [6]:
 %matplotlib inline
  
 fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))
 n_leaf_range = [2, 4, 8, 16, 24, 32]
  
 # Plot the training error
 m = np.mean(bag_trn_error*100, axis=0)
 ax[0].plot(n_leaf_range, m, linewidth=3, marker='o')
  
 m = np.mean(rf_trn_error*100, axis=0)
 ax[0].plot(n_leaf_range, m, linestyle='--', linewidth=3, marker='o')
  
 m = np.mean(xt_trn_error*100, axis=0)
 ax[0].plot(n_leaf_range, m, linestyle='-.', linewidth=3, marker='o')
  
 ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16)
 ax[0].set_xlabel('Depth of Trees in Ensemble', fontsize=16)
 ax[0].set_ylabel('Training Error (%)', fontsize=16)
 ax[0].set_xticks(n_leaf_range)
 ax[0].grid()
  
 # Plot the test error
 m = np.mean(bag_tst_error*100, axis=0)
 ax[1].plot(n_leaf_range, m, linewidth=3, marker='o')
  
 m = np.mean(rf_tst_error*100, axis=0)
 ax[1].plot(n_leaf_range, m, linestyle='--', linewidth=3, marker='o')
  
 m = np.mean(xt_tst_error*100, axis=0)
 ax[1].plot(n_leaf_range, m, linestyle='-.', linewidth=3, marker='o')
  
 ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16)
 ax[1].set_xlabel('Depth of Trees in Ensemble', fontsize=16)
 ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=16)
 ax[1].set_xticks(n_leaf_range)
 ax[1].grid();
 plt.tight_layout()
 Out[6]:


Feature importances with Random Forests

Finally, we see how we can use feature importances to identify the most predictive features for breast cancer diagnosis using the random forest ensemble. Such analysis adds interpretability to the model and can be very helpful in communicating and explaining such models to domain experts such as doctors.

How does the ensemble performance change with ensemble size? That is, what happens when our ensembles get bigger and bigger?

Feature importances with label correlations

First, let’s peek into the data set to see if we can discover some interesting relationships among the features and the diagnosis. This type of analysis is typical when we get a new data set, as we try to learn more about it. Here, our analysis will try to identify which features are most correlated with each other and with the diagnosis (label), so that we can check if random forests can do something similar.

First, we look at how training and testing performance change with ensemble size. That is, we compare the behavior of the three algorithms as the parameter n_estimators increases.

Recall that since the test set is held-out during training, the test error is generally a useful estimate of how well we will do on future data, that is, generalize. However, since we don’t want our learning and evaluation to be at the mercy of randomness, we will repeat this experiment 20 times and average the results.

 
In [7]:
 %matplotlib inline
  
 import seaborn as sea
  
 df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
 df['diagnosis'] = dataset['target']
  
 fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 8))
 cor = np.abs(df.corr())
 sea.heatmap(cor, annot=False, cbar=False, cmap=plt.cm.Reds, ax=ax[0])
  
  
 f = ['mean radius', 'mean perimeter', 'mean area', \
      'worst radius', 'worst perimeter', 'worst area', \
      'radius error', 'perimeter error', 'area error', 'diagnosis']
 cor_zoom = np.abs(df[f].corr())
 sea.heatmap(cor_zoom, annot=True, cbar=False, cmap=plt.cm.Reds, ax=ax[1])
 fig.tight_layout()
  
 Out[7]:


There are several features that are also highly correlated with the label, that is, the diagnosis as benign or malignant. Let’s identify the 10 features most correlated with the diagnosis label.

 
In [8]:
 label_corr = cor.iloc[:, -1]
 label_corr.sort_values(ascending=False)[1:11]
 Out[8]:
 worst concave points    0.793566
 worst perimeter         0.782914
 mean concave points     0.776614
 worst radius            0.776454
 mean perimeter          0.742636
 worst area              0.733825
 mean radius             0.730029
 mean area               0.708984
 mean concavity          0.696360
 worst concavity         0.659610
 Name: diagnosis, dtype: float64
 

Feature importances using random forests

Random forests can also provide feature importances. The listing below illustrates this.

Listing 3: Feature importances in the WDBC data set using random forests

 
In [9]:
 X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.15)
 n_features = X_trn.shape[1]
  
 rf = RandomForestClassifier(max_leaf_nodes=24, n_estimators=50, n_jobs=-1)
 rf.fit(X_trn, y_trn)
 err = 1 - accuracy_score(y_tst, rf.predict(X_tst))
 print('Prediction Error = {0:4.2f}%'.format(err*100))
  
 importance_threshold = 0.02
 for i, (feature, importance) in enumerate(zip(dataset['feature_names'],
                                               rf.feature_importances_)):
  
     if importance > importance_threshold:
         print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance))
 Prediction Error = 4.65%
 [0] mean radius (score=0.077)
 [1] mean texture (score=0.020)
 [2] mean perimeter (score=0.045)
 [3] mean area (score=0.041)
 [5] mean compactness (score=0.026)
 [6] mean concavity (score=0.036)
 [7] mean concave points (score=0.120)
 [13] area error (score=0.020)
 [20] worst radius (score=0.193)
 [22] worst perimeter (score=0.138)
 [23] worst area (score=0.054)
 [26] worst concavity (score=0.027)
 [27] worst concave points (score=0.096)
 

We can plot the feature importances as identified by the random forest ensemble. For the WDBC data set, the random forest identifies the features below as being important. Observe that there is a considerable overlap between important features identified by correlation analysis and random forests, though their relative rankings are different.

 
In [10]:
 fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4))
  
 # Identify the important features
 importance_threshold = 0.02
 idx = np.array(range(n_features))
 imp = np.where(rf.feature_importances_ >= importance_threshold)  # important features
 rest = np.setdiff1d(idx, imp)  # remaining features
  
 # Plot the important features and the rest on a bar chart
 plt.bar(idx[imp], rf.feature_importances_[imp], alpha=0.65)
 plt.bar(idx[rest], rf.feature_importances_[rest], alpha=0.45)
  
 # Print feature names on the bar chart
 for i, (feature, importance) in enumerate(zip(dataset['feature_names'], rf.feature_importances_)):
     if importance > importance_threshold:
         plt.text(i, 0.015, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, fontweight='bold')
         print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance))
     else:
         plt.text(i, 0.01, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, color='gray')
    
 # Finish the plot   
 fig.axes[0].get_xaxis().set_visible(False)
 plt.ylabel('Feature Importance Score', fontsize=16)
 plt.xlabel('Features for Breast Cancer Diagnosis', fontsize=16);
 plt.tight_layout()
  
 Out[10]:
  
 [0] mean radius (score=0.077)
 [1] mean texture (score=0.020)
 [2] mean perimeter (score=0.045)
 [3] mean area (score=0.041)
 [5] mean compactness (score=0.026)
 [6] mean concavity (score=0.036)
 [7] mean concave points (score=0.120)
 [13] area error (score=0.020)
 [20] worst radius (score=0.193)
 [22] worst perimeter (score=0.138)
 [23] worst area (score=0.054)
 [26] worst concavity (score=0.027)
 [27] worst concave points (score=0.096)
 


As expected, the training error for all the approaches decreases steadily as the number of estimators increases. The test error also decreases with ensemble size and then stabilizes. As the test error is an estimate of the generalization error, our experiment confirms our intuition about the performance of these ensemble methods in practice.

That’s all for this article.

If you want to learn more about the book, you can check it out on our browser-based liveBook platform here.