From Feature Engineering Bookcamp by Sinan Ozdemir This article series covers ● Recognizing and mitigating bias in our data and model ● Quantifying fairness through various metrics ● Applying feature engineering techniques to remove bias from our model without sacrificing model performance

Take 35% off Feature Engineering Bookcamp by entering fccozdemir into the discount code box at checkout at manning.com.
Building a Baseline Model
Check out part 1 for an intro to the dataset and how bias influences machine learning models, making fairness important to consider when dealing with data.
It’s time to build our baseline ML model. For our first pass at our model, we will apply a bit of feature engineering to ensure our model interprets all of our data correctly and spend time analyzing the fairness/performance results of our model.
Feature Construction
As we saw in our EDA, we have three features that each count the number of juvenile offenses of the person in question. Taking another look at our three juvenile features.
compas_df[["juv_fel_count", "juv_misd_count", "juv_other_count"]].describe()
Figure 1. We have three different features that each count a subset of prior juvenile offenses. Our goal will be to combine them together into one feature
Let’s add these all up into a new column called juv_count that should be a bit easier to interpret.
Listing 1. Constructing a new Juvenile Offense Count Feature
# feature construction, add up our three juv columns and remove the original features compas_df['juv_count'] = compas_df[["juv_fel_count", "juv_misd_count", "juv_other_count"]].sum(axis=1) # A compas_df = compas_df.drop(["juv_fel_count", "juv_misd_count", "juv_other_count"], axis=1) # B
#A Construct our new total juvenile offense count
#B Remove the original juvenile features
We now have 1 new feature and we have removed 3 features as a result.
Figure 2. The current state of our training data with our combined juvenile offense count
Building our baseline pipeline
Let’s start putting our pipeline together to create our baseline ML model. Let’s begin by splitting up our data into a training and testing set and let’s also instantiate a static Random Forest classifier. We have chosen a Random Forest model here because Random Forest models have the useful feature of calculating feature importance. This will end up being very useful for us. We could have chosen a Decision Tree or even a Logistic Regression as they both also have representations of feature importance but for now, we will go with a Random Forest. Remember that our goal is to manipulate our features and not our model so we will use the same model with the same parameters for all of our iterations. In addition to splitting up our X and our y, we will also split the race column so that we have an easy way to split our test set by race.
Listing 2. Splitting up our data into training and testing sets
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier # Split up our data X_train, X_test, y_train, y_test, race_train, race_test = train_test_split(compas_df.drop('two_year_recid', axis=1), compas_df['two_year_recid'], compas_df['race'], stratify=compas_df['two_year_recid'], test_size=0.3, random_state=0) # our static classifier classifier = RandomForestClassifier(max_depth=10, n_estimators=20, random_state=0)
Now that we have our data split up and our classifier ready to go, let’s start creating our feature pipelines just like we did in our last chapter. First up is our categorical data. Let’s create a pipeline that will one hot encode our categorical columns and only drop the second dummy column if the categorical feature is binary.
Listing 3. Creating our qualitative pipeline
from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline, FeatureUnion from sklearn.preprocessing import OneHotEncoder, StandardScaler categorical_features = ['race', 'sex', 'c_charge_degree'] categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(drop='if_binary')) ])
And for our numerical data, we will scale our data in order to bring down those outliers that we saw in our EDA.
Listing 4. Creating our quantitative pipeline
numerical_features = ["age", "priors_count"] numerical_transformer = Pipeline(steps=[ ('scale', StandardScaler()) ])
Let’s introduce the ColumnTransformer object from scikitlearn that will help us quickly apply our two pipelines to our specific columns with minimal code.
Listing 5. Putting our pipelines together to create our feature preprocessor
preprocessor = ColumnTransformer(transformers=[ ('cat', categorical_transformer, categorical_features), ('num', numerical_transformer, numerical_features) ]) clf_tree = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', classifier) ])
With our pipeline set up, we can train it on our training set and run it on our test set.
Listing 6. Running our biasunaware model on our test set
clf_tree.fit(X_train, y_train) unaware_y_preds = clf_tree.predict(X_test)
Unaware_y_preds will be an array of 0s and 1s where 0 represents our model predicting that his person will not recidivate and a 1 represents our model predicting that this person will recidivate.
Now that we have our predictions of our model predicting on our test set, it’s time to start investigating how fair our ML model truly is.
Measuring bias in our baseline model
To help us dive into our fairness metrics, we are going to be using a module called dalex. Dalex has some excellent features that help visualize different kinds of bias and fairness metrics. Our base object is the Explainer object and with our explainer object, we can obtain some basic model performance.
Listing 7. Using Dalex to explain our model
import dalex as dx exp_tree = dx.Explainer(clf_tree, X_test, y_test, label='Random Forest Bias Unaware', verbose=True) exp_tree.model_performance()
Figure 3. Baseline model performance with our biasunaware model
Our metrics are not amazing but we are concerned with both performance and fairness, so let’s dig into fairness a bit. Our first question is “how much did our model rely on race as a way to predict recidivism?” This question goes hand in hand with our model’s disparate treatment. Dalex has a very handy plot that can be used with treebased models and linear models to help visualize the features our model is learning the most from.
exp_tree.model_parts().plot()
Figure 4. Feature importance of our biasunaware model as reported by dalex. This visualization is taking feature importances directly from the Random Forest’s feature importance attribute and is showing that priors_count and age are our most important features.
Dalex is reporting importance in terms of dropout loss which means by how much would the overall “fit” of our model decrease if the featuring question were entirely removed. According to this chart, our model would lose a lot of information if we lost priors_count but, in theory, would have been better if we dropped race. It would seem that our model isn’t even learning from the race at all! This speaks to the model’s unawareness of sensitive features.
Before we begin our nobiashere dance, we should look at a few more metrics. Dalex also has a model_fairness object we can look at that will calculate several metrics for each of our racial categories.
Listing 8. Outputting Model Fairness
mf_tree = exp_tree.model_fairness(protected=race_test, privileged = "Caucasian") mf_tree.metric_scores
Figure 5. A breakdown of 10 fairness metrics for our biasunaware model
This package gives us 10 metrics here by default, let’s break down how to calculate each one in terms of True Positives (TP), False Positives (FP), False Negatives (FN), Actual Positives (AP), Actual Negatives (AN), Predicted Positive (PP) and Predicted Negative (PN). Keep in mind that we can calculate each of these metrics by race:
 TPR(r) = TP / AP (a.k.a. sensitivity)
 TNR(r) = TN / AN (a.k.a. specificity)
 PPV(r) = TP / (PP) (a.k.a. precision)
 NPV(r) = TN / (PN)
 FNR(r) = FN / AP OR 1 – TPR
 FPR(r) = FP / AN OR 1 – TNR
 FDR(r) = FP / (PP) OR 1 – PPV
 FOR(r) = FN / (PN) OR 1 – NPV
 ACC(r) = TP + TN / (TP + TN + FP + FN) (Overall accuracy by Race)
 STP(r) = TP + FP / (TP + FP + FP + FN) (a.k.a. P[recidivism predicted  Race=r])
These numbers on their own will not be very helpful so let’s perform a “fairness check” by comparing our values to the privileged group of people: Caucasians. Why are we choosing Caucasians as our privileged group? Well, among a lot of other reasons, if we look at how often our baseline model predicted recidivism between our groups, we will notice that the model is vastly underpredicting Caucasian recidivism compared to actual rates in our test set.
For our purposes, we will focus on TPR, ACC, PPV, FPR, and STP as our main metrics. The reason we are choosing these metrics is that:
 TPR relates to how well our model captures actual recidivism. Of all the times people recidivate, did our model predict them as positive? We want this to be higher.
 ACC is our overall accuracy. It is a fairly wellrounded way to judge our model but will not be taken into consideration in a vacuum. We want this to be higher.
 PPV is our precision. It measures how much we can trust our model’s positive predictions. Of the times our model predicts recidivism, how often was the model correct in that positive prediction? We want this to be higher.
 FPR relates to our model’s rate of predicting recidivism when someone will not actually recidivate. We want this to be lower.
 STP is statistical parity per group. . We want this to be roughly equal to each other by race meaning our model should be able to reliably predict recidivism based on nondemographic information.
Listing 9. Highlighting Caucasian Privilege
# Recidivism by race in our test set y_test.groupby(race_test).mean() # Predicted Recidivism by race in our biasunaware model pd.Series(unaware_y_preds, index=y_test.index).groupby(race_test).mean()
Figure 6. On the left we have Actual recidivism rates by group in our test set and the right has the rates of recidivism predicted by our baseline biasunaware model. Our model is vastly underpredicting Caucasian recidivism. Nearly 41% of Caucasian folk recidivated meanwhile our model only thought 28% of them would. That means that our model missed nearly 30% of Caucasians recidivate.
The rates of recidivism predicted among AfricanAmerican people are very similar while Caucasians seem to only get a recidivism prediction less than 29% of the time even though the actual rate is almost 41%. The fact that our model is underpredicting the Caucasian group is an indicator that Caucasians are privileged by our model. Part of the reason this is happening is that the data is representative of an unfair justice system. Thinking back to the fact that AfricanAmericans have a higher prior count and that the prior count was the most important feature in our model and it is still unable to accurately predict Caucasian recidivism, our model is clearly unable to reliably predict recidivism based on the raw data.
Let’s run that fairness check now to see how our biasunaware model is doing a cross our five bias metrics.
mf_tree = exp_tree.model_fairness(protected=race_test, privileged = "Caucasian") mf_tree.fairness_check()
Our output is outlined in the following table and at first glance, it is a lot! We’ve highlighted the main areas to focus on. We want each of the values to be between (0.8 and 1.25) and the bolded values are those which are outside of that range and therefore being called out as being evidence of bias.
Bias detected in 4 metrics: TPR, PPV, FPR, STP Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon. Ratios of metrics, based on 'Caucasian'. Parameter 'epsilon' was set to 0.8 and therefore metrics should be within (0.8, 1.25) TPR ACC PPV FPR STP AfricanAmerican 1.633907 1.035994 1.160069 1.701493 1.782456 Hispanic 0.874693 1.007825 0.769363 1.069652 0.915789 Other 1.380835 1.035994 0.876076 1.422886 1.336842
Each value in the table above is the value from the metric_scores table divided by the Caucasian value (our privileged group). For example, the AfricanAmerican TPR value of 1.633907 is equal to the TPR(AfricanAmerican) / TPR(Caucasian) which is calculated as 0.665 / 0.407.
These ratios are then checked against a fourfifth range of (0.8, 1.25) and if our metric falls outside of that range, we consider that ratio unfair. The ideal value is 1 which indicates that the specified metric for that race is equal to the value of that metric for our privileged group. If we count up the number of ratios outside of that range, we come up with 7 (they are bolded).
We can plot the numbers in the previous table using dalex as well.
mf_tree.plot() # Same numbers from the fairness_check in a plot
Figure 7. Dalex offers a visual breakdown of the 5 main ratios we will focus on broken down by subgroup. We want all of the blue bars to be within the yellow section of the graph and bars in the red section are considered in danger of being biased. We can see that we have some work to do!
To make things a bit simpler, let’s focus on the parity loss of each of the five metrics from our fairness check. Parity loss represents a total score across our disadvantageous groups. Dalex calculates parity loss for a metric as being the sum of the absolute value of the log of the metric ratios in our fairness checks.
For example, if we look at the statistical parities of our groups (STP) we have:
STP(AfricanAmerican) = 0.508
STP(Hispanic) = 0.261
STP(Other) = 0.381
STP(Caucasian) = 0.285
A quick code snippet reveals that our parity loss for STP for our biasunaware model should be 0.956
# STP metrics for unprivileged groups unpriv_stp = [0.508, 0.261, 0.381] # STP metrics for privileged group caucasian_stp = 0.285 # 0.956 appears as light orange in the following figure sum([abs(np.log(u / caucasian_stp)) for u in unpriv_stp])
We see that the parity loss of STP is 0.956. Luckily, dalex gives us an easier way to calculate parity loss for all five metrics and stack them together in a chart for us. The following figure is the one we will use to compare across our models and the 5 stacks represent the values for each of our five bias metrics. They are stacked up together to represent the overall bias of the model. We want to see the overall stacked length to decrease as we become more biasaware. We will be pairing this stacked parity loss graph with classic ML metrics like accuracy, precision, and recall.
# Plot of parity loss of each metric mf_tree.plot(type = 'stacked')
Figure 8. Cumulative Parity loss. In this case, smaller is better meaning less bias. For example, the light orange section on the right hand side represents the 0.956 we previously calculated by hand. Overall our bias unaware model is scoring around 3.5 which is our number to beat on the bias front.
We now have both a baseline for model performance (from our model performance summary) and a baseline for fairness given by our stacked parity loss chart.
Let’s move on now to how we can actively use feature engineering to mitigate bias in our data.
Mitigating Bias
When it comes to mitigating bias and promoting fairness in our models, we have three main opportunities to do so:
 Preprocessing – bias mitigation as applied to the training data, i.e. before the model has had a chance to train on the training data
 Inprocessing – bias mitigation applied to a model during the training phase
 Postprocessing – bias mitigation applied to the predicted labels after the model has been fit to the training data
Each phase of bias mitigation has pros and cons and preprocessing directly refers to feature engineering techniques.
Preprocessing
Preprocessing bias mitigation takes place in the training data before modeling takes place. Preprocessing is useful when we don’t have access to the model itself or the downstream predictions but we do have access to the initial training data.
Two examples of preprocessing bias mitigation techniques that we will implement in this chapter are:
 Disparate Impact Removal – editing feature values to improve group fairness
 Learning Fair Representations – extracting a new feature set by obfuscating the original information regarding protected attributes
By implementing these two techniques, we will hoping to reduce the overall bias that our model is exhibiting while also trying to enhance our ML pipeline’s performance in the process.
Inprocessing
Inprocessing techniques are applied during training time. They usually come in the form of some regularization term or an alternative objective function. Inprocessing techniques are only possible when we have access to the actual learning algorithm. Otherwise, we’d have to rely on pre or postprocessing.
Some examples of inprocessing bias mitigation techniques include:
 Meta Fair Classifier – uses fairness as an input to optimize a classifier for fairness
 Prejudice Remover – Implementing a privilegeaware regularization term to our learning objective
Postprocessing
Postprocessing techniques, as the name implies, are applied after training time and are most useful when we need to treat the ML model as a black box and we don’t have access to the original training data.
Some examples of postprocessing bias mitigation techniques include:
 Equalized Odds – modifying predicted labels using a separate optimization objective to make the predictions fairer.
 Calibrated Equalized Odds – modifying the classifier’s scores to make for fairer results.
That’s all for now. In part 3, we will see how to build a biasaware model.
If you want to learn more about the book, check it out on Manning’s liveBook platform here.