brink2_00 By Henrik Brink, Joseph W. Richards, and Mark Fetherolf

In this article, excerpted from Real-World Machine Learning, we will look at a few of the most common data pre-processing steps needed for real-world machine learning.



Collecting data is the first step towards preparing it for modeling, but it is sometimes necessary to run the data through a few pre-processing steps depending on the  composition of the dataset. Many machine-learning algorithms work only on numerical data, integers and real-valued numbers. The simplest ML datasets come in this format, but many include other types of features, such as categorical features, and some might include missing values or be in need of other kinds of data processing before being ready for modeling. In this article, we will look at a few of the most common data pre-processing steps needed for real-world machine learning.

Categorical features

The most common type of non-numerical feature is the categorical feature. A feature is categorical if values can be placed in buckets and the order of values is not important. In some cases this type of feature is easy to identify, for example when it only takes on a few string values such as “spam” and “ham”. In other cases it is not obvious whether a feature is a numerical (integer) feature or categorical. In some cases both may be valid representations, and the choice can affect the performance of the model.

An example of this could be a feature representing the day of the week: if there is information in the general time of the week (order of days is important), this feature would be numerical. If the days themselves are more important – e.g., Monday and Thursday may be special for some reason, this would be a categorical feature. We are not going to look at the model building and performance, but we will in this article introduce a technique for dealing with categorical features. Figure 1 points out categorical features in a few datasets.


Figure 1: Identifying categorical features. In the top is our simple person dataset. In the bottom we show a dataset with information about Titanic passengers. The features identified as categorical here are: Survived (whether the passenger survived or not), Pclass (what class the passenger was traveling on), Sex (male or female), and Embarked (from which city the passenger embarked).

Some machine learning algorithms deal with categorical features natively, but generally they need data in numerical form. We can encode categorical features as numbers – one number per category – but we cannot use this encoded data as a true categorical feature as we’ve then introduced an (arbitrary) order of categories. Recall that one of the properties of categorical features is that they are not ordered. What we can do instead is to convert each of the categories in the categorical feature to a feature with 1’s or 0’s wherever the category appeared or not. In essence, we convert the categorical feature into binary features that can be used by ML algorithms that only support numerical features. Figure 2.5 illustrates this concept further.


Figure 2: Converting categorical columns to numerical columns.

The pseudo-code for converting categorical features to binary features, as shown in figure 2, would look something like this:

Listing 1 Convert categorical features to numerical binary features

def cat_to_num(data): categories = unique(data) features = []
 for cat in categories:
 binary = (data == cat) features.append(binary.astype(“int“))
 return features

While the categorical to numerical conversion technique works for most ML algorithms, there are a few algorithms – such as certain types of decision tree algorithms and related algorithms such as Random Forests – that can deal with categorical features natively. This will often yield better results for highly categorical datasets. Our simple person dataset, after conversion of the categorical feature to binary features, is shown in Figure 3.


Figure 3: The simple person dataset after conversion of categorical features to binary numerical features. Original dataset shown in figure 1.

Dealing with missing data

We’ve already seen a few examples of datasets with missing data. In tabular datasets they often appear as empty cells, NaN, None or similar. Missing data are usually artifacts of the data collection process; for some reason a particular value could not be measured for a data instance. Figure 4 show an example of missing data in the Titanic passenger dataset.


Figure 4: An example of missing values in the Titanic passenger dataset in the Age and Cabin columns. The passenger information has been extracted from various historical sources, so in this case the missing values stem from information that simply couldn’t be found in the sources.

There are two main types of missing data, which we need to handle in different ways. First, data can be missing where the fact that it is missing carries information in itself, and could be useful for the ML algorithm. The other possibility is that the data is missing simply because the measurement was impossible, and there is no information in the reason for the unavailability of the information. In the Titanic passenger dataset, for example, missing values in the Cabin column may tell us that those passengers were in a different social or economical class, while missing values in the age column carries no information; the age of a particular passenger at the time simply couldn’t be found.

Let us first consider the case of meaningful missing data. When we believe that there is information in the data being missing, we usually want the ML algorithm to be able to use this information to potentially improve the prediction accuracy. To achieve this we want to convert the missing values into the same format as the column in general. For numerical columns this can be done by setting missing values to -1 or -999, depending on typical values of non-missing values. Simply pick a number in one end of the numerical spectrum that will denote missing values, and remember that order is important for numerical columns. You don’t want to pick a value in the middle of the distribution of values.

For a categorical column with meaningful missing data, we can simply create a new category called “missing”, “None” or similar and then handle the categorical feature in the usual way, for example using the technique described in the previous section. Figure 5 shows a simple diagram of what to do with meaningful missing data.


Figure 5: Diagram of what to do with meaningful missing data.

Missing data where the lack of information carries no information, we need to handle this in different ways. In this case, we cannot simply introduce a special number or category because we might introduce data that are flat-out wrong. For example, if we were to change any missing values in the Age column of the Titanic passenger dataset to -1, we would probably hurt the model by messing with the age distribution for no good reason. Some ML algorithms will be able to deal with these truly missing values by simply ignoring them. If not, we need to pre-process the data and replace missing values by guessing the true value. This concept of replacing missing data is called imputation.

There are many ways to impute missing data, but there is unfortunately no one-size- fits-all. The easiest and most undesirable way is to simply remove all instances for which there are missing values. This will not only decrease the predictive power of the model, but also introduce biases in case the missing data are not randomly distributed. Another simple way is to assume some temporal order to the data instances and simply replace missing values with the column value of the preceding row. With no other information, we are making a guess that a measurement hasn’t changed from one instance to the next. Needless to say, this assumption will often be wrong. For extremely big data, however, we will not always be able to apply more sophisticated methods and these simple methods can be useful.

It is usually better to use a larger portion of the existing data to guess the missing values. To avoid biasing the model, we may replace missing column values by the mean value of the column. With no other information, we’ll make a guess that on average will be closest to the truth. The mean is sensitive to outliers, so depending on the distribution of column values we may want to use the median instead. These are widely used in machine learning today and work well in many cases. But, when we set all missing values to a single new value, we lose any potential correlation with other variables which may be important for the algorithm to detect appropriate patterns in the data.

What we really want to do is predict the value of the missing variable based on all of the data and variables available. Does this sound familiar? This is exactly what machine learning is about, so we are basically thinking about building ML models in order to be able to build ML models. In practice, you will often use a simple algorithm – such as linear or logistic regression to impute the missing data that is not necessarily the same as the main ML algorithm used. In any case you are creating a pipeline of ML algorithms that introduces more knobs to turn in order to optimize the model in the end.

Again, it is important to realize that there is no best way to deal with truly missing data. We’ve discussed a few different ways in this article and figure 6 attempts to summarize the different possibilities.


Figure 6: Full decision diagram for handling missing values when preparing data for ML modeling.