|
From Pandas Workout by Reuven Lerner This article discusses cleaning data to use with Pandas. |
Take 35% off Pandas Workout by entering fcclerner2 into the discount code box at checkout at manning.com.
In the late 1980s, I worked for a company that wanted to know how much rain had fallen in a large number of US cities. Their solution? They gave me a list of cities and phone numbers, and asked me to call each of them in sequence, recording the previous day’s rainfall in an Excel spreadsheet. Nowadays, getting that sort of information - and many other types of information — is pretty easy. Not only do many governments provide data sets for free, but numerous companies make data available for a price. No matter what topic you’re researching, data is almost certainly available. The only questions are where you can get it, how much it’ll cost, and what format the data comes in.
Actually, you should ask another question, too: How accurate is the data you’re using?
All too often, we assume that if we’re downloading a CSV file from an official-looking Web site, the data it contains is good. But all too often, the data that we download has problems. That shouldn’t surprise us, given that the data comes from people (who can make a variety of types of mistakes) and machines (which make different types of mistakes). Maybe someone accidentally misnamed a file, or entered data into the wrong field. And maybe the automatic sensors whose inputs were used in collecting the data were broken, or offline. Maybe the servers were down for a day, or someone misconfigured the XML feed-reading system, or the routers were being rebooted, or a backhoe cut the Internet line.
All of this assumes that there was actually data to begin with. Often we’ll have missing data because there wasn’t any data to record.
This is why I’ve often heard data scientists say that 80 percent of their job involves cleaning data. What does it mean to “clean data”? Here is a partial list:
- rename columns
- rename the index
- remove irrelevant columns
- split one column into two
- combine two or more columns into one
- remove non-data rows
- remove repeated rows
- remove rows with missing data (aka
NaN
) - replace
NaN
data with a single value - replace
NaN
data via interpolation - standardize strings
- fix typos in strings
- remove whitespace from strings
- correct the types used for columns
- identify and remove outliers
The importance of cleaning your data, and thus ensuring that your analysis is as accurate is possible, cannot be overstated.
In this article, we’ll thus be looking at a few pandas
techniques for cleaning our data. First, we’ll look at a few ways in which we can handle NaN
values.
Useful references
Table 1. What you need to know
Concept |
What is it? |
Example |
To learn more |
|
Returns a boolean series indicating where there are null (typically |
|
|
|
Returns a boolean data frame indicating where there are null (typically |
|
|
|
Replace values in one or more columns with other values |
|
|
|
Apply a function to each element of a series, returning the result of that application on each element |
|
|
|
Replace |
|
|
|
Remove rows with |
|
|
|
Working with textual data |
|
|
|
Reorder the rows of a data frame based on the values in its index, in ascending order |
|
|
|
Create a data frame based on an Excel spreadsheet |
|
|
|
returns a sorted (descending frequency) series counting how many times each value appears in |
|
|
|
returns a series with the unique (i.e., distinct) values in |
|
How much is missing?
We’ve already seen, on a number of occasions, that data frames (and series) can contain NaN
values. One question we often want to answer is: How many NaN
values are there in a given column? Or, for that matter, in a data frame?
One solution is to calculate things yourself. There is a count
method you can run on a series, which returns the number of non-null values in the series. That, combined with the shape of the series, can tell you how many NaN
values there are:
s.shape[0] - s.count() ❶
❶ Returns an integer, the number of null elements
This is tedious and annoying. And besides, shouldn’t pandas
provide us with a way to do this? Indeed it does, in the form of the isnull
method. If you call isnull
on a column, it returns a boolean series, one that has True
where there is a NaN
value, and False
in other places. You can then apply the sum
method to the series, which will return the number of True
values, thanks to the fact that Python’s boolean values inherit from integers, and can be in place of 1 (True
) and 0 (False
) if you need:
s.isnull().sum() ❶
❶ Calculate the number of NaN
values in s
If you run isnull
on a data frame, then you will get a new data frame back, with True
and False
values indicating whether there is a null value in that particular row-column combination. And of course, then you can run sum
on the resulting data frame, finding how many NaN
values there are in each column:
df.isnull().sum() ❶
❶ Calculate the number of NaN
values in each column
Instead of summing the results of a call to isnull
, you can also use the any
and all
methods, both of which return boolean values. any
will return True
for each row in which at least one of the values is True
, and all
will return True
for each row in which all of the values are True
. You can thus do the following:
df[df.isnull().all()] ❶
❶ Show only the rows without NaN
Finally, the df.info
method returns a wealth of information about the data frame on which it is run, including the name and type of each column, a summary of how many columns there are of each type, and the estimated memory usage. If the data frame is small enough, then it’ll also show you how many null values there are in each column. However, this calculation can take some time. Thus, the df.info
will only count null values below a certain threshold. If you’re above that threshold (the pd.options.display.max_info_columns
option), then you’ll need to tell pandas
explicitly to count, by passing show_counts=True
:
df.info(show_counts=True) ❶
❶ Get full information about the data frame df
, including the number of null values in each column
pandas
defines both isna
and isnull
for both series and data frames. What’s the difference between them? Actually, there is no difference. If you look at the pandas
documentation, you’ll find that they’re identical except for the name of the method being called. I mostly use isnull
, but if you prefer to go with isna
, then be my guest.
Note that both of these are different from np.isnan
, a method defined in NumPy, on top of which pandas
is defined. I try to stick with the methods that pandas
defines, which integrate better into the rest of the system, in my experience.
Rather than using ~
, which pandas
uses to invert boolean series and data frames, you can often use the notnull
methods, for both series and data frame.
That’s all for this article. If you want to see more, check out the book on Manning’s liveBook platform here.