From Feature Engineering Bookcamp by Sinan Ozdemir

This article series covers

●      Recognizing and mitigating bias in our data and model

●      Quantifying fairness through various metrics

●      Applying feature engineering techniques to remove bias from our model without sacrificing model performance


Take 35% off Feature Engineering Bookcamp by entering fccozdemir into the discount code box at checkout at manning.com.


Building a Baseline Model

Check out part 1 for an intro to the dataset and how bias influences machine learning models, making fairness important to consider when dealing with data.

It’s time to build our baseline ML model. For our first pass at our model, we will apply a bit of feature engineering to ensure our model interprets all of our data correctly and spend time analyzing the fairness/performance results of our model.

Feature Construction

As we saw in our EDA, we have three features that each count the number of juvenile offenses of the person in question. Taking another look at our three juvenile features.

 
 compas_df[["juv_fel_count", "juv_misd_count", "juv_other_count"]].describe()
  

Figure 1. We have three different features that each count a subset of prior juvenile offenses. Our goal will be to combine them together into one feature


Let’s add these all up into a new column called juv_count that should be a bit easier to interpret.

Listing 1. Constructing a new Juvenile Offense Count Feature

 
 # feature construction, add up our three juv columns and remove the original features
 compas_df['juv_count'] = compas_df[["juv_fel_count", "juv_misd_count", "juv_other_count"]].sum(axis=1)  # A
 compas_df = compas_df.drop(["juv_fel_count", "juv_misd_count", "juv_other_count"], axis=1)  # B
  

#A Construct our new total juvenile offense count

#B Remove the original juvenile features

We now have 1 new feature and we have removed 3 features as a result.


Figure 2. The current state of our training data with our combined juvenile offense count


Building our baseline pipeline

Let’s start putting our pipeline together to create our baseline ML model. Let’s begin by splitting up our data into a training and testing set and let’s also instantiate a static Random Forest classifier. We have chosen a Random Forest model here because Random Forest models have the useful feature of calculating feature importance. This will end up being very useful for us. We could have chosen a Decision Tree or even a Logistic Regression as they both also have representations of feature importance but for now, we will go with a Random  Forest. Remember that our goal is to manipulate our features and not our model so we will use the same model with the same parameters for all of our iterations. In addition to splitting up our X and our y, we will also split the race column so that we have an easy way to split our test set by race.

Listing 2. Splitting up our data into training and testing sets

 
 from sklearn.model_selection import train_test_split
 from sklearn.ensemble import RandomForestClassifier
  
 # Split up our data
 X_train, X_test, y_train, y_test, race_train, race_test = train_test_split(compas_df.drop('two_year_recid', axis=1),
                                                     compas_df['two_year_recid'],
                                                     compas_df['race'],
                                                     stratify=compas_df['two_year_recid'],
                                                     test_size=0.3,
                                                     random_state=0)
  
 # our static classifier
 classifier = RandomForestClassifier(max_depth=10, n_estimators=20, random_state=0)
  

Now that we have our data split up and our classifier ready to go, let’s start creating our feature pipelines just like we did in our last chapter. First up is our categorical data. Let’s create a pipeline that will one hot encode our categorical columns and only drop the second dummy column if the categorical feature is binary.

Listing 3. Creating our qualitative pipeline

 
 from sklearn.compose import ColumnTransformer
 from sklearn.pipeline import Pipeline, FeatureUnion
 from sklearn.preprocessing import OneHotEncoder, StandardScaler
  
 categorical_features = ['race', 'sex', 'c_charge_degree']
 categorical_transformer = Pipeline(steps=[
     ('onehot', OneHotEncoder(drop='if_binary'))
 ])
  

And for our numerical data, we will scale our data in order to bring down those outliers that we saw in our EDA.

Listing 4. Creating our quantitative pipeline

 
 numerical_features = ["age", "priors_count"]
 numerical_transformer = Pipeline(steps=[
     ('scale', StandardScaler())
 ])
  

Let’s introduce the ColumnTransformer object from scikit-learn that will help us quickly apply our two pipelines to our specific columns with minimal code.

Listing 5. Putting our pipelines together to create our feature preprocessor

 
 preprocessor = ColumnTransformer(transformers=[
         ('cat', categorical_transformer, categorical_features),
         ('num', numerical_transformer, numerical_features)
 ])
  
 clf_tree = Pipeline(steps=[
     ('preprocessor', preprocessor),
     ('classifier', classifier)
 ])
  

With our pipeline set up, we can train it on our training set and run it on our test set.

Listing 6. Running our bias-unaware model on our test set

 
 clf_tree.fit(X_train, y_train)
 unaware_y_preds = clf_tree.predict(X_test)
  

Unaware_y_preds will be an array of 0s and 1s where 0 represents our model predicting that his person will not recidivate and a 1 represents our model predicting that this person will recidivate.

Now that we have our predictions of our model predicting on our test set, it’s time to start investigating how fair our ML model truly is.

Measuring bias in our baseline model

To help us dive into our fairness metrics, we are going to be using a module called dalex. Dalex has some excellent features that help visualize different kinds of bias and fairness metrics. Our base object is the Explainer object and with our explainer object, we can obtain some basic model performance.

Listing 7. Using Dalex to explain our model

 
 import dalex as dx
  
 exp_tree = dx.Explainer(clf_tree, X_test, y_test, label='Random Forest Bias Unaware', verbose=True)
 exp_tree.model_performance()
  

Figure 3. Baseline model performance with our bias-unaware model


Our metrics are not amazing but we are concerned with both performance and fairness, so let’s dig into fairness a bit. Our first question is “how much did our model rely on race as a way to predict recidivism?” This question goes hand in hand with our model’s disparate treatment. Dalex has a very handy plot that can be used with tree-based models and linear models to help visualize the features our model is learning the most from.

 
 exp_tree.model_parts().plot()
  

Figure 4. Feature importance of our bias-unaware model as reported by dalex. This visualization is taking feature importances directly from the Random Forest’s feature importance attribute and is showing that priors_count and age are our most important features.


Dalex is reporting importance in terms of drop-out loss which means by how much would the overall “fit” of our model decrease if the featuring question were entirely removed. According to this chart, our model would lose a lot of information if we lost priors_count but, in theory, would have been better if we dropped race. It would seem that our model isn’t even learning from the race at all! This speaks to the model’s unawareness of sensitive features.

Before we begin our no-bias-here dance, we should look at a few more metrics. Dalex also has a model_fairness object we can look at that will calculate several metrics for each of our racial categories.

Listing 8. Outputting Model Fairness

 
 mf_tree = exp_tree.model_fairness(protected=race_test, privileged = "Caucasian")
 mf_tree.metric_scores
  

Figure 5. A breakdown of 10 fairness metrics for our bias-unaware model


This package gives us 10 metrics here by default, let’s break down how to calculate each one in terms of True Positives (TP), False Positives (FP), False Negatives (FN), Actual Positives (AP), Actual Negatives (AN), Predicted Positive (PP) and Predicted Negative (PN). Keep in mind that we can calculate each of these metrics by race:

  1. TPR(r) = TP / AP                            (a.k.a. sensitivity)
  2. TNR(r) = TN / AN                           (a.k.a. specificity)
  3. PPV(r) = TP / (PP)                          (a.k.a. precision)
  4. NPV(r) = TN / (PN)
  5. FNR(r) = FN / AP  OR 1 – TPR
  6. FPR(r) = FP / AN  OR 1 – TNR
  7. FDR(r) = FP / (PP) OR 1 – PPV
  8. FOR(r) = FN / (PN) OR 1 – NPV
  9. ACC(r) = TP + TN  / (TP + TN + FP + FN)    (Overall accuracy by Race)
  10. STP(r) = TP + FP / (TP + FP + FP + FN)     (a.k.a. P[recidivism predicted | Race=r])

These numbers on their own will not be very helpful so let’s perform a “fairness check” by comparing our values to the privileged group of people: Caucasians. Why are we choosing Caucasians as our privileged group? Well, among a lot of other reasons, if we look at how often our baseline model predicted recidivism between our groups, we will notice that the model is vastly under-predicting Caucasian recidivism compared to actual rates in our test set.

For our purposes, we will focus on TPR, ACC, PPV, FPR, and STP as our main metrics. The reason we are choosing these metrics is that:

  1. TPR relates to how well our model captures actual recidivism. Of all the times people recidivate, did our model predict them as positive? We want this to be higher.
  2. ACC is our overall accuracy. It is a fairly well-rounded way to judge our model but will not be taken into consideration in a vacuum. We want this to be higher.
  3. PPV is our precision. It measures how much we can trust our model’s positive predictions. Of the times our model predicts recidivism, how often was the model correct in that positive prediction? We want this to be higher.
  4. FPR relates to our model’s rate of predicting recidivism when someone will not actually recidivate. We want this to be lower.
  5. STP is statistical parity per group. . We want this to be roughly equal to each other by race meaning our model should be able to reliably predict recidivism based on non-demographic information.

Listing 9. Highlighting Caucasian Privilege

 
 # Recidivism by race in our test set
 y_test.groupby(race_test).mean()
  
 # Predicted Recidivism by race in our bias-unaware model
 pd.Series(unaware_y_preds, index=y_test.index).groupby(race_test).mean()
  

Figure 6. On the left we have Actual recidivism rates by group in our test set and the right has the rates of recidivism predicted by our baseline bias-unaware model.  Our model is vastly underpredicting Caucasian recidivism. Nearly 41% of Caucasian folk recidivated meanwhile our model only thought 28% of them would. That means that our model missed nearly 30% of Caucasians recidivate.


The rates of recidivism predicted among African-American people are very similar while Caucasians seem to only get a recidivism prediction less than 29% of the time even though the actual rate is almost 41%. The fact that our model is under-predicting the Caucasian group is an indicator that Caucasians are privileged by our model. Part of the reason this is happening is that the data is representative of an unfair justice system. Thinking back to the fact that African-Americans have a higher prior count and that the prior count was the most important feature in our model and it is still unable to accurately predict Caucasian recidivism, our model is clearly unable to reliably predict recidivism based on the raw data.

Let’s run that fairness check now to see how our bias-unaware model is doing a cross our five bias metrics.

 
 mf_tree = exp_tree.model_fairness(protected=race_test, privileged = "Caucasian")
 mf_tree.fairness_check()
  

Our output is outlined in the following table and at first glance, it is a lot! We’ve highlighted the main areas to focus on. We want each of the values to be between (0.8 and 1.25) and the bolded values are those which are outside of that range and therefore being called out as being evidence of bias.

 
 Bias detected in 4 metrics: TPR, PPV, FPR, STP
 Conclusion: your model is not fair because 2 or more criteria exceeded acceptable limits set by epsilon.
 Ratios of metrics, based on 'Caucasian'. Parameter 'epsilon' was set to 0.8 and therefore metrics should be within (0.8, 1.25)
                        TPR       ACC       PPV       FPR       STP
 African-American  1.633907  1.035994  1.160069  1.701493  1.782456
 Hispanic          0.874693  1.007825  0.769363  1.069652  0.915789
 Other             1.380835  1.035994  0.876076  1.422886  1.336842
  

Each value in the table above is the value from the metric_scores table divided by the Caucasian value (our privileged group). For example, the African-American TPR value of 1.633907 is equal to the TPR(African-American) / TPR(Caucasian) which is calculated as 0.665 / 0.407.

These ratios are then checked against a four-fifth range of (0.8, 1.25) and if our metric falls outside of that range, we consider that ratio unfair. The ideal value is 1 which indicates that the specified metric for that race is equal to the value of that metric for our privileged group. If we count up the number of ratios outside of that range, we come up with 7 (they are bolded).

We can plot the numbers in the previous table using dalex as well.

 
 mf_tree.plot()  # Same numbers from the fairness_check in a plot
  

Figure 7. Dalex offers a visual breakdown of the 5 main ratios we will focus on broken down by sub-group. We want all of the blue bars to be within the yellow section of the graph and bars in the red section are considered in danger of being biased. We can see that we have some work to do!


To make things a bit simpler, let’s focus on the parity loss of each of the five metrics from our fairness check. Parity loss represents a total score across our disadvantageous groups. Dalex calculates parity loss for a metric as being the sum of the absolute value of the log of the metric ratios in our fairness checks.



For example, if we look at the statistical parities of our groups (STP) we have:

STP(African-American) = 0.508

STP(Hispanic) = 0.261

STP(Other) = 0.381

STP(Caucasian) = 0.285

A quick code snippet reveals that our parity loss for STP for our bias-unaware model should be 0.956

 

 # STP metrics for unprivileged groups
 unpriv_stp = [0.508, 0.261, 0.381]
  
 # STP metrics for privileged group
 caucasian_stp = 0.285
  
 # 0.956 appears as light orange in the following figure
 sum([abs(np.log(u / caucasian_stp)) for u in unpriv_stp])
  

We see that the parity loss of STP is 0.956. Luckily, dalex gives us an easier way to calculate parity loss for all five metrics and stack them together in a chart for us. The following figure is the one we will use to compare across our models and the 5 stacks represent the values for each of our five bias metrics. They are stacked up together to represent the overall bias of the model. We want to see the overall stacked length to decrease as we become more bias-aware. We will be pairing this stacked parity loss graph with classic ML metrics like accuracy, precision, and recall.

 
 # Plot of parity loss of each metric
 mf_tree.plot(type = 'stacked')
  

Figure 8. Cumulative Parity loss. In this case, smaller is better meaning less bias. For example, the light orange section on the right hand side represents the 0.956 we previously calculated by hand. Overall our bias unaware model is scoring around 3.5 which is our number to beat on the bias front.


We now have both a baseline for model performance (from our model performance summary) and a baseline for fairness given by our stacked parity loss chart.

Let’s move on now to how we can actively use feature engineering to mitigate bias in our data.

Mitigating Bias

When it comes to mitigating bias and promoting fairness in our models, we have three main opportunities to do so:

  1. Pre-processing – bias mitigation as applied to the training data, i.e. before the model has had a chance to train on the training data
  2. In-processing – bias mitigation applied to a model during the training phase
  3. Post-processing – bias mitigation applied to the predicted labels after the model has been fit to the training data

Each phase of bias mitigation has pros and cons and pre-processing directly refers to feature engineering techniques.

Pre-processing

Pre-processing bias mitigation takes place in the training data before modeling takes place. Pre-processing is useful when we don’t have access to the model itself or the downstream predictions but we do have access to the initial training data.

Two examples of pre-processing bias mitigation techniques that we will implement in this chapter are:

  1. Disparate Impact Removal – editing feature values to improve group fairness
  2. Learning Fair Representations – extracting a new feature set by obfuscating the original information regarding protected attributes

By implementing these two techniques, we will hoping to reduce the overall bias that our model is exhibiting while also trying to enhance our ML pipeline’s performance in the process.

In-processing

In-processing techniques are applied during training time. They usually come in the form of some regularization term or an alternative objective function. In-processing techniques are only possible when we have access to the actual learning algorithm. Otherwise, we’d have to rely on pre or post-processing.

Some examples of in-processing bias mitigation techniques include:

  1. Meta Fair Classifier – uses fairness as an input to optimize a classifier for fairness
  2. Prejudice Remover – Implementing a privilege-aware regularization term to our learning objective

Post-processing

Post-processing techniques, as the name implies, are applied after training time and are most useful when we need to treat the ML model as a black box and we don’t have access to the original training data.

Some examples of post-processing bias mitigation techniques include:

  1. Equalized Odds – modifying predicted labels using a separate optimization objective to make the predictions fairer.
  2. Calibrated Equalized Odds – modifying the classifier’s scores to make for fairer results.

That’s all for now. In part 3, we will see how to build a bias-aware model.

If you want to learn more about the book, check it out on Manning’s liveBook platform here.