From Feature Engineering Bookcamp by Sinan Ozdemir
This article series covers
● Recognizing and mitigating bias in our data and model
● Quantifying fairness through various metrics
● Applying feature engineering techniques to remove bias from our model without sacrificing model performance
Generally, the stated goal of a machine learning problem is building a feature engineering pipeline that would maximize a model’s performance on a dataset. Our goal in this article series, however, will be to not only monitor and measure model performance but also to keep track of how our model treats different groups of data because sometimes data are people.
In our case study today, data are people whose lives are on the line. Data are people who simply want to have the best life they can possibly have. As we navigate waters around bias and discrimination, around systemic privilege and racial discrepancies, we urge you to keep in mind that when we talk about rows we are talking about people, and when we are talking about features, we are talking about aggregating years if not decades of life experiences into a single number, class, or boolean. We must be respectful of our data and of the people our data represent.
Let’s get started.
The COMPAS Dataset
The dataset for this case study is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) dataset which is a collection of criminal offenders screened in Broward County, Florida in the years 2013–2014. In particular, we are looking at a subset of this data that corresponds to a binary classification problem of predicting recidivism (whether or not a person will re-offend) given certain characteristics about an individual. A link to the dataset can be found here: https://www.kaggle.com/danofer/compass
On its face, the problem is pretty simple. Binary classification, no missing data, let’s go! The problem arises when our ML models have very real downstream effects on people’s lives and well-being. As ML engineers and data scientists, much of this burden is on us to create models that are not only performing well but are also generating predictions that can be considered “fair”.
As we go through this chapter, we will define and quantify “fair” in many ways and ultimately a decision must be rendered on what fairness criteria is best for a particular problem domain. It will be our goal in this chapter to introduce various definitions of fairness and give examples throughout on how each one is meant to be interpreted.
Let’s jump in and start by ingesting our data and taking a look around.
Listing 1. Ingesting the Data
import pandas as pd #A import numpy as np #A compas_df = pd.read_csv('../data/compas-scores-two-years.csv') #B compas_df.head() #B
#A Import packages
#B Show the first five rows
Figure 1. The first five rows of our COMPAS dataset showing some sensitive information about people who have been incarcerated in Broward County, Florida. Our response label here is “two_year_recid” which represents an answer to the binary question “did this person return to incarceration within 2 years of being released?”
In the original ProPublica study from 2016 that looked into the fairness of the COMPAS algorithm, software, and underlying data, they focused on the decile score given to each person. A decile score is a score from 1-10 that scales data into buckets of 10%. If this word looks somewhat familiar, it’s because it is closely related to the idea of a percentile. The idea is that a person can be given a score between 1 and 10, where each score represents a bucket of a population in which a certain percentage of people above and below rank higher in a metric. For example, if we give someone a decile score of 3, that means that 70% of people should have a higher risk of recidivism (people with scores of 4, 5, 6, 7, 8, 9, and 10) and 20% of people have a lower risk of recidivism (people with scores of 1 and 2). Likewise, a score of 7 means that 30% of people have a higher rate of recidivism (people with scores of 8, 9, 10) where 60% of people have a lower rate of recidivism (people with scores of 1, 2, 3, 4, 5, and 6)
The study went further to show the disparities between how decile scores are used and how they don’t always look fair. For example, if we look at how scores are distributed, we can see that scores are given out differently by race. The following snippet will plot a histogram of decile scores by race and highlights a few things:
- African American decile scores are spread out relatively evenly with about 10% of the population residing in each decile score. By definition of a decile score, this on its face is appropriate. 10% of the population should, in theory, live in each decile score
- The Asian, Caucasian, Hispanic, and Other categories seem to have a right skew on decile scores where a larger than expected portion of the category having a decile score of 1 or 2
compas_df.groupby('race')['decile_score'].value_counts( normalize=True ).unstack().plot( kind='bar', figsize=(20, 7), title='Decile Score Histogram by Race', ylabel='% with Decile Score' )
Figure 2. We can see clear differences in how decile scores are distributed when broken down by race.
We can see this more clearly by inspecting some basic statistics around decile scores by race.
Figure 3. Taking a look at the means and medians of decile scores by race, we can see for example that the median decile score for African-Americans is 5 (which is expected) but for Caucasians and Hispanics it is 3.
We could go on looking at how the ProPublica study interpreted this data, but rather than attempting to re-create these results our approach to this dataset will be focused on building a binary classifier with the data, ignoring the decile score already given to people.
The Problem Statement / Defining Success
As mentioned in the previous section, the ML problem here is one of binary classification. The goal of our model can be summarized by the question:
“Given certain aspects about a person, can we predict recidivism both accurately and fairly?”
The term accurately should be easy enough. We have plenty of metrics to measure model performance including accuracy, precision, and AUC. When it comes to the term “fairly” however, we will need to learn a few new terms and metrics. Before we get into how to quantify bias and fairness, let’s first do some EDA knowing the problem at hand.
Exploratory Data Analysis
Our goal is to directly model our response label
two_year_recid based on features about people in this dataset. Specifically, we have the following features:
- sex – qualitative, binary “Male” or “Female”
- age – quantitative ratio, in years
- race – qualitative nominal
- juv_fel_count – quantitative, the number of prior juvenile felonies this person has
- juv_misd_count – quantitative, the number of prior juvenile misdemeanors this person has
- juv_other_count – quantitative, the number of v juvenile convictions that are neither a felony nor a misdemeanor.
- priors_count – quantitative, the number of prior crimes committed
- c_charge_degree – qualitative, binary, ‘F’ for felony and ‘M’ for misdemeanor
And our response label:
- two_year_recid – qualitative, binary, did this person recidivate (commit another crime) within 2 years, yes or no
Note that we have three separate columns to count juvenile offenses. We should note that for our models, we may want to combine these into a single column that simply counts the number of juvenile offenses this person had.
Given our problem statement of creating an accurate and fair model, let’s look at the breakdown of recidivism by race. When we group our dataset by race and look at the rate of recidivism it becomes clear that there are differences in the “base rates” of recidivism. Without breaking down further (by age, criminal history, etc.) there are pretty big differences in recidivism rates between different race categories.
Figure 4. Descriptive statistics of recidivism by race. We can see clear difference in recidivism rates between our different racial groups
We should also note that we have two race categories (Asian and Native American) with extremely small representation in our data. This is an example of a sample bias where the population may not be represented appropriately. This data is taken from Broward County, Florida, where—according to the US census—those identifying as Asian, for example, make up about 4% of the population; whereas in the dataset, they make up about .44% of the data.
For our purposes in this book, we will re-label data points with race as either “Asian” or “Native American” and re-label their race as “Other” to avoid any misconceptions in our metrics in relation to having two categories of race being so under-represented. Our main reason for doing this re-labeling is to make the resulting classes more balanced. In our last figure, it’s clear that the counts of people in the “Asian” and “Native American” classes are vastly under-represented and therefore it would be inappropriate to try and use this dataset to make meaningful predictions about them.
Once we re-label these data points, let’s then plot the actual 2-year recidivism rates for our now four considered race categories.
Listing 2. Re-labeling Under-represented Races
# re-label two races as Other. # This is done purely for educational reasons and to avoid addressing issues with a skewed sample in our data compas_df.loc[compas_df['race'].isin(['Native American', 'Asian']), 'race'] = 'Other' # A compas_df.groupby('race')['two_year_recid'].value_counts( normalize=True ).unstack().plot( kind='bar', figsize=(10, 5), title='Actual Recidivism Rates by Race' ) # B
#A Re-label rows with Asian / Native American races as Other
#B Plot Recidivism Rates for the four races we are considering
Figure 5. Bar chart showing recidivism rates by group
Again, we can see that our data is showing that African-American people recidivate at a higher rate than Caucasian, Hispanic, or Other. The reason that this is so is the result of many different systemic reasons that we cannot even begin to touch on in this book. For now, let’s note that even though recidivism rates are different between groups, the difference between a near 50/50 split for African-Americans and the 60/40 split for Caucasians are not radically different rates.
Let’s continue on by looking at our other features a bit more. We have a binary charge degree feature that we will have to encode as a boolean but otherwise looks usable in its current form:
compas_df['c_charge_degree'].value_counts(normalize=True).plot( kind='bar', title='% of Charge Degree', ylabel='%', xlabel='Charge Degree' )
Figure 6. Distribution of felonies and misdemeanors in our dataset by degree. We have about 65% of our charges as F for felonies and the rest are M for misdemeanors.
Let’s wrap up our EDA by looking at a histogram of our remaining quantitative features: age and priors_count. Both of these variables are showing a pretty clear right skew and would benefit from some standardization to reel in those outliers a bit.
Listing 3. Plotting histograms of our quantitative variables
# Right skew on Age compas_df['age'].plot( title='Histogram of Age', kind='hist', xlabel='Age', figsize=(10, 5) ) # Right skew on Priors as well compas_df['priors_count'].plot( title='Histogram of Priors Count', kind='hist', xlabel='Priors', figsize=(10, 5) )
Figure 7. Age and Priors Count are showing a right skew in the data. It’s showing that most of the people in our dataset are on the young side but we do have some outliers pulling the average to the right. This will come up again when we investigate model fairness
With our EDA giving us some initial insight, let’s move on to discussing and measuring the bias and fairness of our models.
Measuring Bias & Fairness
When tasked with making model predictions fair and as unbiased as possible we need to look at a few different ways of formulating and quantifying fairness so that we can quantify how well our ML models are doing.
Disparate Treatment vs Disparate Impact
In general, a model—or really any predictive/decision-making process—can suffer from two forms of bias: Disparate Treatment and Disparate Impact. A model is considered to suffer from Disparate Treatment if predictions are in some way based on a sensitive attribute (such as sex or race). A model could also have a Disparate Impact if the predictions/downstream outcomes of the predictions disproportionately hurt or benefit people with specific sensitive features which can look like predicting higher rates of recidivism for one race over another.
Definitions of Fairness
There are at least dozens of ways of defining fairness in a model but let’s focus on three for now and when we are building our baseline model, we will see these again and more of them.
Unawareness is likely the easiest definition of fairness. It states that a model should not include sensitive attributes as a feature in the training data. This way our model will not have access to the sensitive values when training. This definition aligns well with the idea of disparate treatment in that we are literally not allowing the model to see the sensitive values of our data.
The surface-level pro of using unawareness as a definition is that it is very easy to explain to someone that we simply did not use a feature in our model therefore how could it have obtained any bias? The counter-argument to this statement and the major flaw of relying on unawareness to define fairness is that more often than not, the model will be able to re-construct sensitive values by relying on other features that are highly correlated to the original sensitive feature we were trying to be unaware of.
For example, if a recruiter is deciding whether or not to hire a candidate and we wish for them to be sensitive to the sex of the candidate, we could simply blind the recruiter to the candidate’s sex; however, if the recruiter also notices that the candidate listed “fraternities” as a prior volunteer/leadership experience, the recruiter may reasonably put together that the candidate is likely a Male.
Statistical Parity, also known as Demographic Parity or Disparate Impact, is a very common definition for fairness. Simply put, it states that our model’s prediction of being in a certain class (will they recidivate or not) is independent of the sensitive feature. Put as a formula:
P(recidivism | race=African-American) = P(recidivism | race=Caucasian) = P(recidivism | race=Hispanic) = P(recidivism | race=Other)
Put yet another way, to achieve good statistical parity our model should predict equal rates of recidivism for every racial category. The above formula is pretty strict and to relax this we can lean on the four-fifths rule, which states that a selection rate (the rate at which we predict recidivism) for any disadvantaged group, the ratio of our prediction rates may fall into the range (0.8, 1 / 0.8) and be considered fair. As a formula, this looks like the following.
0.8 < P(recidivism | race=disadvantaged) / P(recidivism | race=advantaged) < 1/.8 (1.25)
The pros of using Statistical Parity as a definition of fairness is that it is relatively easy to explain metric and also. There is evidence that using Statistical Parity as a definition can lead to both short term and long term benefits of the disadvantageous groups (Hu and Chen, WWW2018).
One caveat of relying on Statistical Parity is that it ignores any possible relationship between our label and our sensitive attribute. In our case, this is actually a good thing because we want to ignore any correlation between our response (will this person recidivate) and our sensitive attribute in question (race) as that correlation is driven by much bigger factors than our case study can deal with. For any use case, our readers may consider in the future, this may not be desired so please take this into consideration!
Another caveat of relying solely on Statistical Parity is that our ML model in theory could just be “lazy” and select random people from each group and we would still technically achieve statistical parity. Obviously, our ML metrics should catch our models from doing this but it’s always something to look out for.
Also known as Positive Rate Parity, the Equalized Odds definition of fairness states that our model’s prediction of our response should be independent of our sensitive feature conditional on our response value. In the context of our example, equalized odds would mean the following two conditions are met:
P(recidivism | race=Hispanic, actually recidivated=True) = P(recidivism | race=Caucasian, actually recidivated=True) = P(recidivism | race=African-American, actually recidivated=True) = P(recidivism | race=Other, actually recidivated=True)
P(recidivism | race=Hispanic, actually recidivated=False) = P(recidivism | race=Caucasian, actually recidivated=False) = P(recidivism | race=African-American, actually recidivated=False) = P(recidivism | race=Other, actually recidivated=False)
Another way to view this would be to say our model has equalized odds if:
- Independent of race, our model predicted recidivism rates equally for people who did actually recidivate
- Independent of race, our model predicted recidivism rates equally for people who did not actually recidivate.
The pros of using equalized odds for our definition is that it penalizes the same “laziness” we talked about with Statistical Parity. It encourages the model to become more accurate in all groups rather than allowing the model to simply randomly predict recidivism to achieve similar rates of prediction between groups.
The biggest flaw is those equalized odds are sensitive to different underlying base rates of the response. In our data, we saw that African-Americans recidivated at a higher rate than the other three racial categories. If this was a scenario where we believed that there were some natural differences between racial groups and recidivism rates, then equalized odds would not be a good metric for us. In our case, this will not be an issue as we reject the idea that these base rates relating to race and recidivism reflect natural recidivism rates.
Learn more about building a baseline model in part 2.
If you want to learn more about the book, check it out on Manning’s liveBook platform here.