By Henrik Brink, Joseph W. Richards, and Mark Fetherolf


In this article, excerpted from Real-World Machine Learning, we will look at a few of the most common data pre-processing steps needed for real-world machine learning.


Data is crucial for any business and it needs to be well managed in order to understand what steps you need to take as a business – whether that’s look for cost effective data engineering services or becoming more efficient with how you collect it. Collecting data is the first step towards preparing it for modeling, but it is sometimes necessary to run the data through a few pre-processing steps depending on the composition of the dataset. Many machine-learning algorithms work only on numerical data, integers and real-valued numbers. The simplest ML datasets come in this format, but many include other types of features, such as categorical features, and some might include missing values or be in need of other kinds of data processing before being ready for modeling. In this article, we will look at a few of the most common data pre-processing steps needed for real-world machine learning.


Categorical features


The most common type of non-numerical feature is the categorical feature. A feature is categorical if values can be placed in buckets and the order of values is not important. In some cases this type of feature is easy to identify, for example when it only takes on a few string values such as “spam” and “ham.”

In other cases it is not obvious whether a feature is a numerical (integer) feature or categorical. In some cases both may be valid representations, and the choice can affect the performance of the model. An example of this could be a feature representing the day of the week: if there is information in the general time of the week (order of days is important), this feature would be numerical. If the days themselves are more important – e.g., Monday and Thursday may be special for some reason – this would be a categorical feature.


Figure 1: Identifying categorical features. In the top is our simple person dataset. In the bottom we show a dataset with information about Titanic passengers. The features identified as categorical here are: Survived (whether the passenger survived or not), Pclass (what class the passenger was traveling on), Sex (male or female), and Embarked (from which city the passenger embarked).

Some machine learning algorithms deal with categorical features natively, but generally they need data in numerical form. We can encode categorical features as numbers – one number per category – but we cannot use this encoded data as a true categorical feature as we’ve then introduced an (arbitrary) order of categories. Recall that one of the properties of categorical features is that they are not ordered. What we can do instead is to convert each of the categories in the categorical feature to a feature with 1’s or 0’s wherever the category appeared or not. In essence, we convert the categorical feature into binary features that can be used by ML algorithms that only support numerical features. Figure 2 illustrates this concept further.



Figure 2: Converting categorical columns to numerical columns.

The pseudo-code for converting categorical features to binary features would look something like this:

Listing 1 Convert categorical features to numerical binary features

 def cat_to_num(data):
     categories = unique(data)
     features = []
     for cat in categories:
            binary = (data == cat)
     return features

NOTE: CODE EXAMPLES Readers familiar with the Python programming language may have noticed that the above example is not just pseudo-code, but also valid Python. In this article, we will introduce a code snippet as pseudo-code, but unless otherwise noted it will be actual working code. In order to make the code simpler, we are implicitly importing a few helper libraries, such as numpy and scipy. A lot of functionality from those libraries can be imported easily with the pylab package. If nothing else is noted, code examples should work by importing everything from pylab: from pylab import *

While the categorical to numerical conversion technique works for most ML algorithms, there are a few algorithms – such as certain types of decision tree algorithms and related algorithms such as Random Forests – that can deal with categorical features natively. This will often yield better results for highly categorical datasets, and we will discuss this further in the next chapter where we will look closer at choosing the right algorithm with the data and performance requirements of the problem at hand. Our simple person dataset, after conversion of the categorical feature to binary features, is shown in Figure 3.


Figure 3: The simple person dataset after conversion of categorical features to binary numerical features. Original dataset shown in figure 2.1.


Dealing with missing data

We’ve already seen a few examples of datasets with missing data. In tabular datasets they often appear as empty cells, NaN, None or similar. Missing data are usually artifacts of the data collection process; for some reason a particular value could not be measured for a data instance. Figure 4 shows an example of missing data in the Titanic passenger dataset.


Figure 4: An example of missing values in the Titanic passenger dataset in the Age and Cabin columns. The passenger information has been extracted from various historical sources, so in this case the missing values stem from information that simply couldn’t be found in the sources.

There are two main types of missing data, which we need to handle in different ways. First, data can be missing where the fact that it is missing carries information in itself, and could be useful for the ML algorithm. The other possibility is that the data is missing simply because the measurement was impossible, and there is no information in the reason for the unavailability of the information. In the Titanic passenger dataset, for example, missing values in the Cabin column may tell us that those passengers were in a different social or economical class, while missing values in the age column carries no information; the age of a particular passenger at the time simply couldn’t be found. Let us first consider the case of meaningful missing data. When we believe that there is information in the data being missing, we usually want the ML algorithm to be able to use this information to potentially improve the prediction accuracy. To achieve this we want to convert the missing values into the same format as the column in general. For numerical columns this can be done by setting missing values to -1 or -999, depending on typical values of non-missing values. Simply pick a number in one end of the numerical spectrum that will denote missing values, and remember that order is important for numerical columns. You don’t want to pick a value in the middle of the distribution of values. For a categorical column with meaningful missing data, we can simply create a new category called “missing”, “None” or similar and then handle the categorical feature in the usual way, for example using the technique described in the previous section. Figure 5 shows a simple diagram of what to do with meaningful missing data.


Figure 5: Diagram of what to do with meaningful missing data.

Missing data where the lack of information carries no information, we need to handle in different ways. In this case, we cannot simply introduce a special number or category because we might introduce data that are flat-out wrong. For example, if we were to change any missing values in the Age column of the Titanic passenger dataset to -1, we would probably hurt the model by messing with the age distribution for no good reason. Some ML algorithms will be able to deal with these truly missing values by simply ignoring them. If not, we need to pre-process the data and replace missing values by guessing the true value. This concept of replacing missing data is called imputation. There are many ways to impute missing data, but there is unfortunately no one-size-fits-all. The easiest and most undesirable way is to simply remove all instances for which there are missing values. This will not only decrease the predictive power of the model, but also introduce biases in case the missing data are not randomly distributed. Another simple way is to assume some temporal order to the data instances and simply replace missing values with the column value of the preceding row. With no other information, we are making a guess that a measurement hasn’t changed from one instance to the next. Needless to say, this assumption will often be wrong. For extremely big data, however, we will not always be able to apply more sophisticated methods and these simple methods can be useful. It is usually better to use a larger portion of the existing data to guess the missing values. To avoid biasing the model, we may replace missing column values by the mean value of the column. With no other information, we’ll make a guess that on average will be closest to the truth. The mean is sensitive to outliers, so depending on the distribution of column values we may want to use the median instead. These are widely used in machine learning today and work well in many cases. But, when we set all missing values to a single new value, we lose any potential correlation with other variables which may be important for the algorithm to detect appropriate patterns in the data. What we really want to do is predict the value of the missing variable based on all of the data and variables available. Does this sound familiar? This is exactly what machine learning is about, so we are basically thinking about building ML models in order to be able to build ML models. In practice, you will often use a simple algorithm – such as linear or logistic regression – to impute the missing data that is not necessarily the same as the main ML algorithm used. In any case you are creating a pipeline of ML algorithms that introduces more knobs to turn in order to optimize the model in the end. Again, it is important to realize that there is no best way to deal with truly missing data. We’ve discussed a few different ways in this article and figure 6 attempts to summarize the different possibilities.


Figure 6: Full decision diagram for handling missing values when preparing data for ML modeling.


Simple feature engineering

We won’t go into domain-specific and more advanced feature-engineering techniques in this article, but it’s worth mentioning the basic idea here for simple pre-processing of the data in order to make the model better. We use our Titanic example again as the motivation in this section. In figure 7 we take another look at a part of the data, and in particular the Cabin feature. Without processing, the Cabin feature is not necessarily very useful. Some of the values seem to include multiple cabins and even a single cabin wouldn’t seem like a good categorical feature because all cabins would be independent “buckets”. If we want to predict, for example, if a passenger survived or not, living in a particular cabin instead of the neighboring cabin may not have any predictive power. Brink_07

Figure 7: Showing various cabin values in the Titanic dataset. Some include multiple cabins, while others are missing. And, cabin identifiers themselves may not be good categorical features.

Living in a particular region of the ship, though, could potentially be important for survival. For single cabin IDs, we could extract the letter as a categorical feature and the number as a numerical feature, assuming they denote different parts of the ship. This doesn’t handle multiple cabin IDs, but since it looks like all multiple cabins are close to each other, it will probably be fine to only extract the first cabin ID. We could include the number of cabins in a new feature, though, which could also be relevant. All in all, we will create three new features from the Cabin feature. The code for this simple extraction could look like the following:

Listing 2 Simple feature extraction on Titanic cabins

def cabin_features(data):
     features = []
     for cabin in data:
            cabins = cabin.split(" ")
            n_cabins = len(cabins)
            cabin_char = cabins[0][0] if n_cabins > 0 else "X"
            cabin_num = int(cabins[0][1:]) if n_cabins > 0 else -1
            features.append( [cabin_char, cabin_num, n_cabins] )
     return features

By now it should be no surprise what we mean by feature engineering: using existing features to engineer new features that increase the value of the original data using our knowledge of the data or domain in question.