|
From Ensemble Methods for Machine Learning by Gautam Kunapuli Our first case study explores a medical decision-making task: breast cancer diagnosis. We will see how to use scikit-learn’s homogeneous parallel ensemble modules in practice. Specifically, we will train and evaluate the performance of three homogeneous parallel algorithms, each characterized by increasing randomness: bagging with decision trees, random forests and ExtraTrees. |
Take 40% off Ensemble Methods for Machine Learning by entering fcckunapuli into the discount code box at checkout at manning.com.
Breast-cancer diagnosis is the task of classifying patient tumor data into one of two classes: malignant vs. benign. This task (and similar medical tasks) are at the core of medical diagnostic decision-support systems that are beginning to come into wider use in the healthcare industry. Such machine-learning algorithms can greatly minimize the false negative rate (where patients who do have cancer are misdiagnosed until too late) and improve chances for treatment and survival.
This task is challenging owing to subtle variations in patient data, which makes it very difficult for a single model to achieve high performance. Bagging and random forests are able to learn effective models. This case study will be tackled using scikit-learn’s implementations of bagging and random forests. The performance of these learned ensemble models will be compared with individual (non-ensemble / traditional) machine-learning models. Results will be used to show that ensemble methods can significantly outperform individual models, thus giving the reader a concrete, real-world example of how ensembles can be successful.
Listing 1. Loading and pre-processing
In [1]: from sklearn.datasets import load_breast_cancer dataset = load_breast_cancer() # Convert to a Pandas DataFrame so we can visualize nicely import pandas as pd import numpy as np df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names']) i = np.random.permutation(len(dataset['target'])) df = df.iloc[i, :7] df['diagnosis'] = dataset['target'][i] df = df.reset_index() df.head() Out[1]:
Pre-process the data by standardizing the features to be zero mean and unit standard deviation.
In [2]: from sklearn.preprocessing import StandardScaler X, y = dataset['data'], dataset['target'] X = StandardScaler().fit_transform(X)
Bagging, Random Forest and ExtraTrees
The key difference between random forests and bagging variants such as random subspaces and random patches is where the feature sampling occurs. Random forests exclusively use randomized decision trees as base estimators. Specifically, they perform feature sampling inside the tree learning algorithm each time they grow the tree with a decision node.
Random subspaces and random patches, on the other hand, are not restricted to tree learning and can use any learning algorithm as a base estimator. They randomly sample features once outside before calling the base learning algorithm for each base estimator.
scikit-learn provides an ExtraTreesClassifier that supports out-of-bag estimation and parallelization, much like BaggingClassifier and RandomForestClassifier. Note that ExtraTrees typically do not perform bootstrap sampling (bootstrap=False, by default), as we are able to achieve base estimator diversity through extreme randomization.
Once we have pre-processed our data set, we will train and evaluate bagging with decision trees, random forests and ExtraTrees in order to answer the following questions:
- How does the ensemble performance change with ensemble size? That is, what happens when our ensembles get bigger and bigger?
- How does the ensemble performance change with base learner complexity? That is, what happens when our individual base estimators become more and more complex. In this case study, since all three ensemble methods considered use decision trees as base estimators, one “measure” of complexity is tree depth, with deeper trees being more complex.
Ensemble size vs. ensemble performance
Our first case study explores a medical decision-making task: breast cancer diagnosis. We will see how to use scikit-learn’s homogeneous parallel ensemble modules in practice. Specifically, we will train and evaluate the performance of three homogeneous parallel algorithms, each characterized by increasing randomness: bagging with decision trees, random forests and ExtraTrees.
Doctors make many decisions regarding patient care everyday: tasks such as diagnosis (what disease does the patient have?), prognosis (how will their disease progress?), treatment planning (how should the disease be treated?), to name a few. They make these decisions based on a patient’s health records, medical history, family history, test results and so on.
The specific data set we will use is the Wisconsin Breast Cancer (WDBC) data set, a common benchmark data set in machine learning. Since 1993, the WDBC data has been used to benchmark the performance of dozens of machine-learning algorithms.
The WDBC data set was originally created by applying feature extraction techniques on patient biopsy medical images. More concretely, for each patient, the data describe the size and texture of the cell nuclei of cells extracted during biopsy.
WDBC is available in scikit-learn. We compare the behavior of the three algorithms as the parameter n_estimators increases.
Listing 2: Training and test errors with increasing ensemble size.
In [3]: import pickle import os from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier from sklearn.metrics import accuracy_score max_leaf_nodes = 8 n_runs = 20 n_estimator_range = range(2, 20, 1) bag_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the bagging classifier rf_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the random forest classifier xt_trn_error = np.zeros((n_runs, len(n_estimator_range))) # Train error for the extra trees classifier bag_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the bagging classifier rf_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the random forest classifier xt_tst_error = np.zeros((n_runs, len(n_estimator_range))) # Test error for the extra trees classifier for run in range(0, n_runs): print('Run {0}'.format(run)) # Split into train and test sets X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25) for j, n_estimators in enumerate(n_estimator_range): # Train using bagging bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier (max_leaf_nodes=max_leaf_nodes), n_estimators=n_estimators, max_samples=0.5, n_jobs=-1) bag_clf.fit(X_trn, y_trn) bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn)) bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst)) # Train using random forests rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes, n_estimators=n_estimators, n_jobs=-1) rf_clf.fit(X_trn, y_trn) rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn)) rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst)) # Train using extra trees xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True, n_estimators=n_estimators, n_jobs=-1) xt_clf.fit(X_trn, y_trn) xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn)) xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst)) results = (bag_trn_error, bag_tst_error, \ rf_trn_error, rf_tst_error, \ xt_trn_error, xt_tst_error) In [4]: %matplotlib inline import matplotlib.pyplot as plt fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4)) n_estimator_range = range(2, 20, 1) # Plot the training error m = np.mean(bag_trn_error*100, axis=0) ax[0].plot(n_estimator_range, m, linewidth=3) m = np.mean(rf_trn_error*100, axis=0) ax[0].plot(n_estimator_range, m, linestyle='--', linewidth=3) m = np.mean(xt_trn_error*100, axis=0) ax[0].plot(n_estimator_range, m, linestyle='-.', linewidth=3) ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16) ax[0].set_xlabel('Number of Estimators', fontsize=16) ax[0].set_ylabel('Training Error (%)', fontsize=16) ax[0].axis([0, 21, 1, 9]) ax[0].grid() # Plot the test error m = np.mean(bag_tst_error*100, axis=0) ax[1].plot(n_estimator_range, m, linewidth=3) m = np.mean(rf_tst_error*100, axis=0) ax[1].plot(n_estimator_range, m, linestyle='--', linewidth=3) m = np.mean(xt_tst_error*100, axis=0) ax[1].plot(n_estimator_range, m, linestyle='-.', linewidth=3) ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16) ax[1].set_xlabel('Number of Estimators', fontsize=16) ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=16) ax[1].grid() ax[1].axis([0, 21, 1, 9]); plt.tight_layout() Out[4]:
Base learner complexity vs. ensemble performance
Next, we compare the behavior of the three algorithms as the complexity of the base learners increases. There are several ways to control the complexity of the base decision trees: maximum depth, maximum number of leaf nodes, impurity criteria etc. Here, we compare the performance of the three ensemble methods as with complexity as determined by max_leaf_nodes.
It is always good practice to standardize the training data, especially since the scales of the features are vastly different. Standardization is a pre-processing step that rescales features to have a mean of 0 and standard deviation of 1.
Recall that highly complex trees are inherently unstable and sensitive to small perturbations in the data. This means that, in general, if we increase the complexity of the base learners, we will need a lot more of them to successfully reduce the variance of the ensemble overall. Here, however, we have fixed n_estimators=10.
In [5]: n_estimators = 10 max_leaf_nodes = 8 n_runs = 20 n_leaf_range = [2, 4, 8, 16, 24, 32] bag_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the bagging classifier rf_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the random forest classifier xt_trn_error = np.zeros((n_runs, len(n_leaf_range))) # Train error for the extra trees classifier bag_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the bagging classifier rf_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the random forest classifier xt_tst_error = np.zeros((n_runs, len(n_leaf_range))) # Test error for the extra trees classifier for run in range(0, n_runs): print('Run {0}'.format(run)) # Split into train and test sets X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.25) for j, max_leaf_nodes in enumerate(n_leaf_range): # Train using bagging bag_clf = BaggingClassifier(base_estimator=DecisionTreeClassifier (max_leaf_nodes=max_leaf_nodes), n_estimators=n_estimators, max_samples=0.5, n_jobs=-1) bag_clf.fit(X_trn, y_trn) bag_trn_error[run, j] = 1 - accuracy_score(y_trn, bag_clf.predict(X_trn)) bag_tst_error[run, j] = 1 - accuracy_score(y_tst, bag_clf.predict(X_tst)) # Train using random forests rf_clf = RandomForestClassifier(max_leaf_nodes=max_leaf_nodes, n_estimators=n_estimators, n_jobs=-1) rf_clf.fit(X_trn, y_trn) rf_trn_error[run, j] = 1 - accuracy_score(y_trn, rf_clf.predict(X_trn)) rf_tst_error[run, j] = 1 - accuracy_score(y_tst, rf_clf.predict(X_tst)) # Train using extra trees xt_clf = ExtraTreesClassifier(max_leaf_nodes=max_leaf_nodes, bootstrap=True, n_estimators=n_estimators, n_jobs=-1) xt_clf.fit(X_trn, y_trn) xt_trn_error[run, j] = 1 - accuracy_score(y_trn, xt_clf.predict(X_trn)) xt_tst_error[run, j] = 1 - accuracy_score(y_tst, xt_clf.predict(X_tst)) results = (bag_trn_error, bag_tst_error, \ rf_trn_error, rf_tst_error, \ xt_trn_error, xt_tst_error) In [6]: %matplotlib inline fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 4)) n_leaf_range = [2, 4, 8, 16, 24, 32] # Plot the training error m = np.mean(bag_trn_error*100, axis=0) ax[0].plot(n_leaf_range, m, linewidth=3, marker='o') m = np.mean(rf_trn_error*100, axis=0) ax[0].plot(n_leaf_range, m, linestyle='--', linewidth=3, marker='o') m = np.mean(xt_trn_error*100, axis=0) ax[0].plot(n_leaf_range, m, linestyle='-.', linewidth=3, marker='o') ax[0].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16) ax[0].set_xlabel('Depth of Trees in Ensemble', fontsize=16) ax[0].set_ylabel('Training Error (%)', fontsize=16) ax[0].set_xticks(n_leaf_range) ax[0].grid() # Plot the test error m = np.mean(bag_tst_error*100, axis=0) ax[1].plot(n_leaf_range, m, linewidth=3, marker='o') m = np.mean(rf_tst_error*100, axis=0) ax[1].plot(n_leaf_range, m, linestyle='--', linewidth=3, marker='o') m = np.mean(xt_tst_error*100, axis=0) ax[1].plot(n_leaf_range, m, linestyle='-.', linewidth=3, marker='o') ax[1].legend(['Bagging', 'Random Forest', 'Extra Trees'], fontsize=16) ax[1].set_xlabel('Depth of Trees in Ensemble', fontsize=16) ax[1].set_ylabel('Hold-out Test Error (%)', fontsize=16) ax[1].set_xticks(n_leaf_range) ax[1].grid(); plt.tight_layout() Out[6]:
Feature importances with Random Forests
Finally, we see how we can use feature importances to identify the most predictive features for breast cancer diagnosis using the random forest ensemble. Such analysis adds interpretability to the model and can be very helpful in communicating and explaining such models to domain experts such as doctors.
How does the ensemble performance change with ensemble size? That is, what happens when our ensembles get bigger and bigger?
Feature importances with label correlations
First, let’s peek into the data set to see if we can discover some interesting relationships among the features and the diagnosis. This type of analysis is typical when we get a new data set, as we try to learn more about it. Here, our analysis will try to identify which features are most correlated with each other and with the diagnosis (label), so that we can check if random forests can do something similar.
First, we look at how training and testing performance change with ensemble size. That is, we compare the behavior of the three algorithms as the parameter n_estimators increases.
Recall that since the test set is held-out during training, the test error is generally a useful estimate of how well we will do on future data, that is, generalize. However, since we don’t want our learning and evaluation to be at the mercy of randomness, we will repeat this experiment 20 times and average the results.
In [7]: %matplotlib inline import seaborn as sea df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names']) df['diagnosis'] = dataset['target'] fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12, 8)) cor = np.abs(df.corr()) sea.heatmap(cor, annot=False, cbar=False, cmap=plt.cm.Reds, ax=ax[0]) f = ['mean radius', 'mean perimeter', 'mean area', \ 'worst radius', 'worst perimeter', 'worst area', \ 'radius error', 'perimeter error', 'area error', 'diagnosis'] cor_zoom = np.abs(df[f].corr()) sea.heatmap(cor_zoom, annot=True, cbar=False, cmap=plt.cm.Reds, ax=ax[1]) fig.tight_layout() Out[7]:
There are several features that are also highly correlated with the label, that is, the diagnosis as benign or malignant. Let’s identify the 10 features most correlated with the diagnosis label.
In [8]: label_corr = cor.iloc[:, -1] label_corr.sort_values(ascending=False)[1:11] Out[8]: worst concave points 0.793566 worst perimeter 0.782914 mean concave points 0.776614 worst radius 0.776454 mean perimeter 0.742636 worst area 0.733825 mean radius 0.730029 mean area 0.708984 mean concavity 0.696360 worst concavity 0.659610 Name: diagnosis, dtype: float64
Feature importances using random forests
Random forests can also provide feature importances. The listing below illustrates this.
Listing 3: Feature importances in the WDBC data set using random forests
In [9]: X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.15) n_features = X_trn.shape[1] rf = RandomForestClassifier(max_leaf_nodes=24, n_estimators=50, n_jobs=-1) rf.fit(X_trn, y_trn) err = 1 - accuracy_score(y_tst, rf.predict(X_tst)) print('Prediction Error = {0:4.2f}%'.format(err*100)) importance_threshold = 0.02 for i, (feature, importance) in enumerate(zip(dataset['feature_names'], rf.feature_importances_)): if importance > importance_threshold: print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance)) Prediction Error = 4.65% [0] mean radius (score=0.077) [1] mean texture (score=0.020) [2] mean perimeter (score=0.045) [3] mean area (score=0.041) [5] mean compactness (score=0.026) [6] mean concavity (score=0.036) [7] mean concave points (score=0.120) [13] area error (score=0.020) [20] worst radius (score=0.193) [22] worst perimeter (score=0.138) [23] worst area (score=0.054) [26] worst concavity (score=0.027) [27] worst concave points (score=0.096)
We can plot the feature importances as identified by the random forest ensemble. For the WDBC data set, the random forest identifies the features below as being important. Observe that there is a considerable overlap between important features identified by correlation analysis and random forests, though their relative rankings are different.
In [10]: fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 4)) # Identify the important features importance_threshold = 0.02 idx = np.array(range(n_features)) imp = np.where(rf.feature_importances_ >= importance_threshold) # important features rest = np.setdiff1d(idx, imp) # remaining features # Plot the important features and the rest on a bar chart plt.bar(idx[imp], rf.feature_importances_[imp], alpha=0.65) plt.bar(idx[rest], rf.feature_importances_[rest], alpha=0.45) # Print feature names on the bar chart for i, (feature, importance) in enumerate(zip(dataset['feature_names'], rf.feature_importances_)): if importance > importance_threshold: plt.text(i, 0.015, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, fontweight='bold') print('[{0}] {1} (score={2:4.3f})'.format(i, feature, importance)) else: plt.text(i, 0.01, feature, ha='center', va='bottom', rotation='vertical', fontsize=16, color='gray') # Finish the plot fig.axes[0].get_xaxis().set_visible(False) plt.ylabel('Feature Importance Score', fontsize=16) plt.xlabel('Features for Breast Cancer Diagnosis', fontsize=16); plt.tight_layout() Out[10]: [0] mean radius (score=0.077) [1] mean texture (score=0.020) [2] mean perimeter (score=0.045) [3] mean area (score=0.041) [5] mean compactness (score=0.026) [6] mean concavity (score=0.036) [7] mean concave points (score=0.120) [13] area error (score=0.020) [20] worst radius (score=0.193) [22] worst perimeter (score=0.138) [23] worst area (score=0.054) [26] worst concavity (score=0.027) [27] worst concave points (score=0.096)
As expected, the training error for all the approaches decreases steadily as the number of estimators increases. The test error also decreases with ensemble size and then stabilizes. As the test error is an estimate of the generalization error, our experiment confirms our intuition about the performance of these ensemble methods in practice.
That’s all for this article.
If you want to learn more about the book, you can check it out on our browser-based liveBook platform here.